pybliometrics.scopus.ScopusSearch

ScopusSearch() implements the Scopus Search API. It performs a query to search for documents and then retrieves the records of the query.

Table of Contents

Documentation

class pybliometrics.scopus.ScopusSearch(query, refresh=False, view=None, verbose=False, download=True, integrity_fields=None, integrity_action='raise', subscriber=True, **kwds)[source]

Interaction with the Scopus Search API.

Parameters
  • query (str) – A string of the query as used in the Advanced Search on scopus.com. All fields except “INDEXTERMS()” and “LIMIT-TO()” work.

  • refresh (Union[bool, int], optional) – Whether to refresh the cached file if it exists or not. If int is passed, cached file will be refreshed if the number of days since last modification exceeds that value.

    Default: False

  • view (Optional[str], optional) – Which view to use for the query, see https://dev.elsevier.com/sc_search_views.html. Allowed values: STANDARD, COMPLETE. If None, defaults to COMPLETE if subscriber=True and to STANDARD if subscriber=False.

    Default: None

  • verbose (bool, optional) – Whether to print a download progress bar.

    Default: False

  • download (bool, optional) – Whether to download results (if they have not been cached).

    Default: True

  • integrity_fields (Union[List[str], Tuple[str, …], None], optional) – Names of fields whose completeness should be checked. ScopusSearch will perform the action specified in integrity_action if elements in these fields are missing. This helps avoiding idiosynchratically missing elements that should always be present (e.g., EID or source ID).

    Default: None

  • integrity_action (str, optional) – What to do in case integrity of provided fields cannot be verified. Possible actions: - “raise”: Raise an AttributeError - “warn”: Raise a UserWarning

    Default: 'raise'

  • subscriber (bool, optional) – Whether you access Scopus with a subscription or not. For subscribers, Scopus’s cursor navigation will be used. Sets the number of entries in each query iteration to the maximum number allowed by the corresponding view.

    Default: True

  • kwds (str) – Keywords passed on as query parameters. Must contain fields and values mentioned in the API specification at https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl.

Raises
  • ScopusQueryError – For non-subscribers, if the number of search results exceeds 5000.

  • ValueError – If any of the parameters integrity_action, refresh or view is not one of the allowed values.

Return type

None

Notes

The directory for cached results is {path}/{view}/{fname}, where path is specified in your configuration file and fname is the md5-hashed version of query.

property results

A list of namedtuples in the form (eid doi pii pubmed_id title subtype subtypeDescription creator afid affilname affiliation_city affiliation_country author_count author_names author_ids author_afids coverDate coverDisplayDate publicationName issn source_id eIssn aggregationType volume issueIdentifier article_number pageRange description authkeywords citedby_count openaccess fund_acr fund_no fund_sponsor). Field definitions correspond to https://dev.elsevier.com/guides/ScopusSearchViews.htm and return the values as-is, except for afid, affilname, affiliation_city, affiliation_country, author_names, author_ids and author_afids: These information are joined on “;”. In case an author has multiple affiliations, they are joined on “-” (e.g. Author1Aff;Author2Aff1-Author2Aff2).

Raises

ValueError – If the elements provided in integrity_fields do not match the actual field names (listed above).

Notes

The list of authors and the list of affiliations per author are deduplicated.

get_eids()[source]

EIDs of retrieved documents.

get_cache_file_age()

Return the age of the cached file in days.

Return type

int

get_cache_file_mdate()

Return the modification date of the cached file.

Return type

str

get_key_remaining_quota()

Return number of remaining requests for the current key and the current API (relative on last actual request).

Return type

Optional[str]

get_key_reset_time()

Return time when current key is reset (relative on last actual request).

Return type

Optional[str]

get_results_size()

Return the number of results (works even if download=False).

Return type

int

Examples

The class is initialized with a search query. Any query that works in the Advanced Search on scopus.com will work. There are but two exceptions to allowed keywords: “LIMIT-TO()”, as this only affects the display of the results on scopus.com, but not the selection of results per se; and “INDEXTERMS()”. An invalid search query will result in some error.

>>> from pybliometrics.scopus import ScopusSearch
>>> s = ScopusSearch('FIRSTAUTH ( kitchin  j.r. )')

You can obtain a search summary just by printing the object:

>>> print(s)
Search 'FIRSTAUTH ( kitchin  j.r. )' yielded 13 documents of 2020-04-15:
    2-s2.0-85048443766
    2-s2.0-85019169906
    2-s2.0-84971324241
    2-s2.0-84930349644
    2-s2.0-84930616647
    2-s2.0-67449106405
    2-s2.0-40949100780
    2-s2.0-37349101648
    2-s2.0-20544467859
    2-s2.0-13444307808
    2-s2.0-2942640180
    2-s2.0-0141924604
    2-s2.0-0037368024

Non-subscribers must instantiate the class with subscriber=False. They may only get 5000 results per query, whereas this limit does not exist for subscribers.

Users can receive the number of results programmatically via .get_results_size():

>>> s.get_results_size()
13

This method works even if one chooses to not download results. It thus helps subscribers to decide programmatically if one wants to proceed downloading or not:

>>> from pybliometrics.scopus import ScopusSearch
>>> other = ScopusSearch('AUTHLASTNAME(Brown)', download=False)
>>> other.get_results_size()
259526

The class’ main attribute results returns a list of namedtuples. They can be used neatly with pandas to form DataFrames:

>>> import pandas as pd
>>> df = pd.DataFrame(pd.DataFrame(s.results))
>>> df.columns
Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype', 'subtypeDescription', 'creator',
       'afid', 'affilname', 'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'fund_acr', 'fund_no', 'fund_sponsor'],
      dtype='object')
>>> df.shape
(12, 33)
>>> pd.set_option('display.max_columns', None)
>>> df.head()
                  eid                         doi                pii  \
0  2-s2.0-85019169906   10.1007/s00799-016-0173-7               None
1  2-s2.0-84971324241           10.1002/aic.15294               None
2  2-s2.0-84930349644  10.1016/j.susc.2015.05.007  S0039602815001326
3  2-s2.0-84930616647    10.1021/acscatal.5b00538               None
4  2-s2.0-67449106405  10.1103/PhysRevB.79.205412               None

  pubmed_id                                              title subtype  \
0      None    Automating data sharing through authoring tools      ar
1      None  High-throughput methods using composition and ...      ar
2      None                    Data sharing in Surface Science      ar
3      None  Examples of effective data sharing in scientif...      re
4      None  Correlations in coverage-dependent atomic adso...      ar

      creator                        afid  \
0  Kitchin J.  60027950;60027950;60027950
1  Kitchin J.                    60027950
2  Kitchin J.                    60027950
3  Kitchin J.                    60027950
4  Kitchin J.                    60027950

                                           affilname  \
0  Carnegie Mellon University;Carnegie Mellon Uni...
1                         Carnegie Mellon University
2                         Carnegie Mellon University
3                         Carnegie Mellon University
4                         Carnegie Mellon University

                   affiliation_city  \
0  Pittsburgh;Pittsburgh;Pittsburgh
1                        Pittsburgh
2                        Pittsburgh
3                        Pittsburgh
4                        Pittsburgh

                         affiliation_country author_count  \
0  United States;United States;United States            4
1                              United States            2
2                              United States            1
3                              United States            1
4                              United States            1

                                        author_names  \
0  Kitchin, John R.;Van Gulick, Ana E.;Zilinski, ...
1                Kitchin, John R.;Gellman, Andrew J.
2                                   Kitchin, John R.
3                                   Kitchin, John R.
4                                   Kitchin, John R.

                           author_ids                author_afids   coverDate  \
0  7004212771;50761335600;55755405700  60027950;60027950;60027950  2017-06-01
1              7004212771;35514271900           60027950;60027950  2016-11-01
2                          7004212771                    60027950  2016-05-01
3                          7004212771                    60027950  2015-06-05
4                          7004212771                    60027950  2009-05-01

  coverDisplayDate                                    publicationName  \
0      1 June 2017         International Journal on Digital Libraries
1  1 November 2016                                      AIChE Journal
2       1 May 2016                                    Surface Science
3      5 June 2015                                      ACS Catalysis
4       1 May 2009  Physical Review B - Condensed Matter and Mater...

       issn    source_id     eIssn aggregationType volume issueIdentifier  \
0  14325012       145200  14321300         Journal     18               2
1  00011541        16275  15475905         Journal     62              11
2  00396028        12284      None         Journal    647            None
3  21555435  19700188320      None         Journal      5               6
4  10980121  11000153773  1550235X         Journal     79              20

  article_number  pageRange  \
0           None      93-98
1           None  3826-3835
2           None    103-107
3           None  3894-3899
4         205412       None

                                         description  \
0  © 2016, Springer-Verlag Berlin Heidelberg. In ...
1                                               None
2  © 2015 Elsevier B.V. All rights reserved. Surf...
3  © 2015 American Chemical Society. We present a...
4  The adsorption energy of an adsorbate can depe...

                                      authkeywords citedby_count openaccess  \
0  Authoring | Data sharing | Embedding | Org-mode             1          0
1                                             None             3          0
2                                     Data sharing             2          1
3                                             None             8          1
4                                             None            50          0

  fund_acr       fund_no                 fund_sponsor
0     None     undefined                         None
1      NSF  DE-SC0004031  National Science Foundation
2      CMU  DE-SC0004031   Carnegie Mellon University
3     None     undefined                         None
4     None     undefined                         None

Keep in mind that no more than 100 authors are included in the search results.

The EIDs of documents can be used for the AbstractRetrieval() class and the Scopus Author IDs in column “authid” for the AuthorRetrieval() class.

Downloaded results are cached to speed up subsequent analysis. This information may become outdated. To refresh the cached results if they exist, set refresh=True, or provide an integer that will be interpreted as maximum allowed number of days since the last modification date. For example, if you want to refresh all cached results older than 100 days, set refresh=100. Use s.get_cache_file_mdate() to get the date of last modification, and s.get_cache_file_age() the number of days since the last modification.

There are sometimes missing fields in the returned results although it exists in the Scopus database. For example, the EID may be missing, even though every element always has an EID. This is not a bug of pybliometrics. Instead it is somehow related to a problem in the download process from the Scopus database. To check for completeness of specific fields, use parameter integrity_fields, which accepts any iterable. Using parameter integrity_action you can choose between two actions on what to do if the integrity check fails: Set integrity_action=”warn” to issue a UserWarning, or set integrity_action=”raise” to raise an AttributeError.

>>> s = ScopusSearch('FIRSTAUTH ( kitchin  j.r. )',
                     integrity_fields=["eid"], integrity_action="warn")

If you care about integrity of specific fields, you can attempt to refresh the downloaded file:

def robust_query(q, refresh=False, fields=["eid"]):
    """Wrapper function for individual ScopusSearch query."""
    try:
        return ScopusSearch(q, refresh=refresh, integrity_fields=fields).results
    except AttributeError:
        return ScopusSearch(q, refresh=True, integrity_fields=fields).results

The Scopus Search API allows a differing information depth via views. The view ‘COMPLETE’ is the highest unrestricted view and contains all information also included in the ‘STANDARD’ view. It is therefore the default view. However, when speed is an issue, choose the STANDARD view.

For convenience, method s.get_eids() returns the list of EIDs:

>>> s.get_eids()
['2-s2.0-85019169906', '2-s2.0-84971324241', '2-s2.0-84930349644',
'2-s2.0-84930616647', '2-s2.0-67449106405', '2-s2.0-40949100780',
'2-s2.0-37349101648', '2-s2.0-20544467859', '2-s2.0-13444307808',
'2-s2.0-2942640180', '2-s2.0-0141924604', '2-s2.0-0037368024']