pybliometrics.scopus.ScopusSearch

ScopusSearch() implements the Scopus Search API. It executes a query to search for documents and retrieves the resulting records. Any query that works in the Advanced Document Search on scopus.com will work (with two exceptions, see below), but with ScopusSearch() you achieve this programmatically, faster and without the download size cap.

Documentation

class pybliometrics.scopus.ScopusSearch(query, refresh=False, view=None, verbose=False, download=True, integrity_fields=None, integrity_action='raise', subscriber=True, **kwds)[source]

Interaction with the Scopus Search API.

Parameters:
  • query (str) – A string of the query as used in the Advanced Search on scopus.com. All fields except “INDEXTERMS()” and “LIMIT-TO()” work.

  • refresh (Union[bool, int], optional) – Whether to refresh the cached file if it exists or not. If int is passed, cached file will be refreshed if the number of days since last modification exceeds that value.

    Default: False

  • view (str, optional) – Which view to use for the query, see https://dev.elsevier.com/sc_search_views.html. Allowed values: STANDARD, COMPLETE. If None, defaults to COMPLETE if subscriber=True and to STANDARD if subscriber=False.

    Default: None

  • verbose (bool, optional) – Whether to print a download progress bar.

    Default: False

  • download (bool, optional) – Whether to download results (if they have not been cached).

    Default: True

  • integrity_fields (Union[List[str], Tuple[str, ...]], optional) – Names of fields whose completeness should be checked. ScopusSearch will perform the action specified in integrity_action if elements in these fields are missing. This helps avoiding idiosynchratically missing elements that should always be present (e.g., EID or source ID).

    Default: None

  • integrity_action (str, optional) – What to do in case integrity of provided fields cannot be verified. Possible actions: - “raise”: Raise an AttributeError - “warn”: Raise a UserWarning

    Default: 'raise'

  • subscriber (bool, optional) – Whether you access Scopus with a subscription or not. For subscribers, Scopus’s cursor navigation will be used. Sets the number of entries in each query iteration to the maximum number allowed by the corresponding view.

    Default: True

  • kwds (str) – Keywords passed on as query parameters. Must contain fields and values mentioned in the API specification at https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl.

Raises:
  • ScopusQueryError – For non-subscribers, if the number of search results exceeds 5000.

  • ValueError – If any of the parameters integrity_action, refresh or view is not one of the allowed values.

Notes

The directory for cached results is {path}/{view}/{fname}, where path is specified in your configuration file and fname is the md5-hashed version of query.

property results: List[NamedTuple] | None

A list of namedtuples in the form (eid doi pii pubmed_id title subtype subtypeDescription creator afid affilname affiliation_city affiliation_country author_count author_names author_ids author_afids coverDate coverDisplayDate publicationName issn source_id eIssn aggregationType volume issueIdentifier article_number pageRange description authkeywords citedby_count openaccess freetoread freetoreadLabel fund_acr fund_no fund_sponsor). Field definitions correspond to https://dev.elsevier.com/guides/ScopusSearchViews.htm and return the values as-is, except for afid, affilname, affiliation_city, affiliation_country, author_names, author_ids and author_afids: These information are joined on “;”. In case an author has multiple affiliations, they are joined on “-” (e.g. Author1Aff;Author2Aff1-Author2Aff2).

Raises:

ValueError – If the elements provided in integrity_fields do not match the actual field names (listed above).

Notes

The list of authors and the list of affiliations per author are deduplicated.

get_eids()[source]

EIDs of retrieved documents.

get_cache_file_age()

Return the age of the cached file in days.

Return type:

int

get_cache_file_mdate()

Return the modification date of the cached file.

Return type:

str

get_key_remaining_quota()

Return number of remaining requests for the current key and the current API (relative on last actual request).

Return type:

str | None

get_key_reset_time()

Return time when current key is reset (relative on last actual request).

Return type:

str | None

get_results_size()

Return the number of results (works even if download=False).

Return type:

int

Examples

The class is initialized with a search query. There are but two exceptions to allowed keywords as compared to the Advanced Document Search: “LIMIT-TO()”, as this only affects the display of the results on scopus.com, but not the selection of results per se; and “INDEXTERMS()”. An invalid search query will result in some error. Setting verbose=True informs about the download progress.

>>> from pybliometrics.scopus import ScopusSearch
>>> q = "REF(2-s2.0-85068268027)"
>>> s = ScopusSearch(q, verbose=True)
Downloading results for query "REF(2-s2.0-85068268027)":
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.07s/it]

You can obtain a search summary just by printing the object:

>>> print(s)
Search 'REF(2-s2.0-85068268027)' yielded 88 documents as of 2023-12-03:
    2-s2.0-85174697039
    2-s2.0-85169560066
    2-s2.0-85163820321
    2-s2.0-85153853127
    2-s2.0-85174908092
    2-s2.0-85163278918
    2-s2.0-85169708007
    2-s2.0-85165723386
    2-s2.0-85162924068
    2-s2.0-85150443997
    2-s2.0-85131324015
    2-s2.0-85164415056
# output truncated

Non-subscribers must instantiate the class with subscriber=False. They may only get 5,000 results per query, whereas this limit does not exist for subscribers.

Users can determine the number of results programmatically using the .get_results_size() method:

>>> s.get_results_size()
88

This method works even if one chooses to not download results. It thus helps subscribers to decide programmatically if one wants to proceed downloading or not:

>>> other = ScopusSearch('AUTHLASTNAME(Brown)', download=False)
>>> other.get_results_size()
311543

The main attribute of the class, results, returns a list of namedtuples. They can be efficiently converted into DataFrames using pandas:

>>> import pandas as pd
>>> df = pd.DataFrame(s.results)
>>> df.columns
Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype', 'subtypeDescription',
       'creator', 'afid', 'affilname', 'affiliation_city', 'affiliation_country',
       'author_count', 'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel' 'fund_acr', 'fund_no', 'fund_sponsor'],
      dtype='object')
>>> df.shape
(88, 36)
>>> pd.set_option('display.max_columns', None)  # just for display
>>> df.head()
                      eid                           doi                pii  \
0  2-s2.0-85174697039   10.1016/j.softx.2023.101565  S2352711023002613
1  2-s2.0-85169560066  10.1016/j.respol.2023.104874  S0048733323001580
2  2-s2.0-85163820321    10.1186/s40537-023-00793-6               None
3  2-s2.0-85153853127           10.1162/qss_a_00236               None
4  2-s2.0-85174908092          10.3390/jmse11101855               None

  pubmed_id                                              title subtype  \
0      None  PyblioNet – Software for the creation, visuali...      ar
1      None  The role of gender and coauthors in academic p...      ar
2      None  Bibliometric mining of research directions and...      ar
3      None  How reliable are unsupervised author disambigu...      ar
4      None  Machine Learning Solutions for Offshore Wind F...      re

  subtypeDescription      creator               afid  \
0            Article    Müller M.           60018373
1            Article  Schmal W.B.  60025310;60006341
2            Article  Lundberg L.           60016636
3            Article    Abramo G.  60027509;60021199
4             Review   Masoumi M.           60017789

                                           affilname     affiliation_city  \
0                              Universität Hohenheim            Stuttgart
1  Heinrich-Heine-Universität Düsseldorf;Universi...  Dusseldorf;Mannheim
2                         Blekinge Tekniska Högskola           Karlskrona
3  Università degli Studi di Roma "Tor Vergata";C...            Rome;Rome
4                                  Manhattan College             New York

  affiliation_country author_count  \
0             Germany            1
1     Germany;Germany            3
2              Sweden            1
3         Italy;Italy            2
4       United States            1

                                     author_names  \
0                                Müller, Matthias
1  Schmal, W. Benedikt;Haucap, Justus;Knoke, Leon
2                                  Lundberg, Lars
3       Abramo, Giovanni;D’angelo, Ciriaco Andrea
4                                 Masoumi, Masoud

                           author_ids                author_afids   coverDate  \
0                         58302698300                    60018373  2023-12-01
1  57350833800;6602422284;57377238100  60025310;60025310;60006341  2023-12-01
2                          7103325657                    60016636  2023-12-01
3             22833445200;57219528028  60021199;60021199-60027509  2023-12-01
4                         56362456200                    60017789  2023-10-01

  coverDisplayDate                            publicationName      issn  \
0    December 2023                                  SoftwareX      None
1    December 2023                            Research Policy  00487333
2    December 2023                        Journal of Big Data      None
3      Winter 2023               Quantitative Science Studies      None
4     October 2023  Journal of Marine Science and Engineering      None

     source_id     eIssn aggregationType volume issueIdentifier  \
0  21100422153  23527110         Journal     24            None
1        22900      None         Journal     52              10
2  21100791292  21961115         Journal     10               1
3  21101062805  26413337         Journal      4               1
4  21100830140  20771312         Journal     11              10

  article_number pageRange                                        description  \
0         101565      None  PyblioNet is a software tool for the creation,...
1         104874      None  This paper contributes to the literature on di...
2            112      None  In this paper a program and methodology for bi...
3           None   144-166  Assessing the performance of universities by o...
4           1855      None  The continuous advancement within the offshore...

                                        authkeywords  citedby_count  \
0  Bibliometrics | Network | Python | Science map...              0
1  Academic publishing | DEAL | Elsevier | Gender...              0
2  Bibliometrics | Fields of science and technolo...              0
3  author name disambiguation | evaluative scient...              1
4  offshore energy | offshore wind | wind farm | ...              0

   openaccess         freetoread freetoreadLabel fund_acr             fund_no  \
0           1               None            None     None           undefined
1           0       repositoryam           Green      MSI  235577387/GRK 1974
2           1       repositoryam           Green      BTH           undefined
3           1       repositoryam           Green     None           undefined
4           1  publisherfullgold            Gold     None           undefined

                                      fund_sponsor
0                                             None
1  Ministry of Science and Innovation, New Zealand
2                       Blekinge Tekniska Högskola
3                              Universiteit Leiden
4                                             None

It’s important to note that the search results include no more than 100 authors.

The EIDs of documents can be used for the AbstractRetrieval() class and the Scopus Author IDs in column “authid” for the AuthorRetrieval() class.

Downloaded results are cached to expedite subsequent analyses. This information may become outdated. To refresh the cached results if they exist, set refresh=True, or provide an integer that will be interpreted as maximum allowed number of days since the last modification date. For example, if you want to refresh all cached results older than 100 days, set refresh=100. Use ab.get_cache_file_mdate() to obtain the date of last modification, and ab.get_cache_file_age() to determine the number of days since the last modification.

Occasionally, some fields may be missing in the returned results, even though they exist in the Scopus database. For example, the EID may be missing, even though every element always has an EID. This is not a bug of pybliometrics. Instead it is somehow related to a problem in the download process from the Scopus database. For completeness checks of specific fields, use the integrity_fields parameter, which accepts any iterable. Using parameter integrity_action you can choose between two actions if the integrity check fails: Set integrity_action=”warn” to issue a UserWarning, or set integrity_action=”raise” to raise an AttributeError.

>>> s = ScopusSearch(q, integrity_fields=["eid"],
                     integrity_action="warn")

If you care about integrity of specific fields, you can attempt to refresh the downloaded file:

def robust_query(q, refresh=False, fields=["eid"]):
    """Wrapper function for individual ScopusSearch query."""
    try:
        return ScopusSearch(q, refresh=refresh, integrity_fields=fields).results
    except AttributeError:
        return ScopusSearch(q, refresh=True, integrity_fields=fields).results

The Scopus Search API offers varying depths of information through views. The view ‘COMPLETE’ is the highest unrestricted view and contains all information also included in the ‘STANDARD’ view. It is therefore the default view. However, when speed is an issue, choose the STANDARD view.

For convenience, the s.get_eids() method returns the list of EIDs:

>>> s.get_eids()
['2-s2.0-85174697039', '2-s2.0-85169560066', '2-s2.0-85163820321',
'2-s2.0-85153853127', '2-s2.0-85174908092', '2-s2.0-85163278918',
'2-s2.0-85169708007', '2-s2.0-85165723386', '2-s2.0-85162924068',
#...
'2-s2.0-85087770000', '2-s2.0-85086243347', '2-s2.0-85084027658']