pybliometrics.scopus.ScopusSearch¶
ScopusSearch() implements the Scopus Search API. It executes a query to search for documents and retrieves the resulting records. Any query that works in the Advanced Document Search on scopus.com will work (with two exceptions, see below), but with ScopusSearch() you achieve this programmatically, faster and without the download size cap.
Documentation¶
- class pybliometrics.scopus.ScopusSearch(query, refresh=False, view=None, verbose=False, download=True, integrity_fields=None, integrity_action='raise', subscriber=True, **kwds)[source]¶
Interaction with the Scopus Search API.
- Parameters:
query (
str
) – A string of the query as used in the Advanced Search on scopus.com. All fields except “INDEXTERMS()” and “LIMIT-TO()” work.refresh (
Union
[bool
,int
], optional) – Whether to refresh the cached file if it exists or not. If int is passed, cached file will be refreshed if the number of days since last modification exceeds that value.Default:False
view (
str
, optional) – Which view to use for the query, see https://dev.elsevier.com/sc_search_views.html. Allowed values: STANDARD, COMPLETE. If None, defaults to COMPLETE if subscriber=True and to STANDARD if subscriber=False.Default:None
verbose (
bool
, optional) – Whether to print a download progress bar.Default:False
download (
bool
, optional) – Whether to download results (if they have not been cached).Default:True
integrity_fields (
Union
[List
[str
],Tuple
[str
,...
]], optional) – Names of fields whose completeness should be checked. ScopusSearch will perform the action specified in integrity_action if elements in these fields are missing. This helps avoiding idiosynchratically missing elements that should always be present (e.g., EID or source ID).Default:None
integrity_action (
str
, optional) – What to do in case integrity of provided fields cannot be verified. Possible actions: - “raise”: Raise an AttributeError - “warn”: Raise a UserWarningDefault:'raise'
subscriber (
bool
, optional) – Whether you access Scopus with a subscription or not. For subscribers, Scopus’s cursor navigation will be used. Sets the number of entries in each query iteration to the maximum number allowed by the corresponding view.Default:True
kwds (
str
) – Keywords passed on as query parameters. Must contain fields and values mentioned in the API specification at https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl.- Raises:
ScopusQueryError – For non-subscribers, if the number of search results exceeds 5000.
ValueError – If any of the parameters integrity_action, refresh or view is not one of the allowed values.
Notes
The directory for cached results is {path}/{view}/{fname}, where path is specified in your configuration file and fname is the md5-hashed version of query.
- property results: List[NamedTuple] | None¶
A list of namedtuples in the form (eid doi pii pubmed_id title subtype subtypeDescription creator afid affilname affiliation_city affiliation_country author_count author_names author_ids author_afids coverDate coverDisplayDate publicationName issn source_id eIssn aggregationType volume issueIdentifier article_number pageRange description authkeywords citedby_count openaccess freetoread freetoreadLabel fund_acr fund_no fund_sponsor). Field definitions correspond to https://dev.elsevier.com/guides/ScopusSearchViews.htm and return the values as-is, except for afid, affilname, affiliation_city, affiliation_country, author_names, author_ids and author_afids: These information are joined on “;”. In case an author has multiple affiliations, they are joined on “-” (e.g. Author1Aff;Author2Aff1-Author2Aff2).
- Raises:
ValueError – If the elements provided in integrity_fields do not match the actual field names (listed above).
Notes
The list of authors and the list of affiliations per author are deduplicated.
- get_cache_file_age()¶
Return the age of the cached file in days.
- Return type:
int
- get_cache_file_mdate()¶
Return the modification date of the cached file.
- Return type:
str
- get_key_remaining_quota()¶
Return number of remaining requests for the current key and the current API (relative on last actual request).
- Return type:
str | None
- get_key_reset_time()¶
Return time when current key is reset (relative on last actual request).
- Return type:
str | None
- get_results_size()¶
Return the number of results (works even if download=False).
- Return type:
int
Examples¶
The class is initialized with a search query. There are but two exceptions to allowed keywords as compared to the Advanced Document Search: “LIMIT-TO()”, as this only affects the display of the results on scopus.com, but not the selection of results per se; and “INDEXTERMS()”. An invalid search query will result in some error. Setting verbose=True informs about the download progress.
>>> from pybliometrics.scopus import ScopusSearch >>> q = "REF(2-s2.0-85068268027)" >>> s = ScopusSearch(q, verbose=True) Downloading results for query "REF(2-s2.0-85068268027)": 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.07s/it]
You can obtain a search summary just by printing the object:
>>> print(s) Search 'REF(2-s2.0-85068268027)' yielded 88 documents as of 2023-12-03: 2-s2.0-85174697039 2-s2.0-85169560066 2-s2.0-85163820321 2-s2.0-85153853127 2-s2.0-85174908092 2-s2.0-85163278918 2-s2.0-85169708007 2-s2.0-85165723386 2-s2.0-85162924068 2-s2.0-85150443997 2-s2.0-85131324015 2-s2.0-85164415056 # output truncated
Non-subscribers must instantiate the class with subscriber=False. They may only get 5,000 results per query, whereas this limit does not exist for subscribers.
Users can determine the number of results programmatically using the .get_results_size() method:
>>> s.get_results_size() 88
This method works even if one chooses to not download results. It thus helps subscribers to decide programmatically if one wants to proceed downloading or not:
>>> other = ScopusSearch('AUTHLASTNAME(Brown)', download=False) >>> other.get_results_size() 311543
The main attribute of the class, results, returns a list of namedtuples. They can be efficiently converted into DataFrames using pandas:
>>> import pandas as pd >>> df = pd.DataFrame(s.results) >>> df.columns Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype', 'subtypeDescription', 'creator', 'afid', 'affilname', 'affiliation_city', 'affiliation_country', 'author_count', 'author_names', 'author_ids', 'author_afids', 'coverDate', 'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn', 'aggregationType', 'volume', 'issueIdentifier', 'article_number', 'pageRange', 'description', 'authkeywords', 'citedby_count', 'openaccess', 'freetoread', 'freetoreadLabel' 'fund_acr', 'fund_no', 'fund_sponsor'], dtype='object') >>> df.shape (88, 36) >>> pd.set_option('display.max_columns', None) # just for display >>> df.head() eid doi pii \ 0 2-s2.0-85174697039 10.1016/j.softx.2023.101565 S2352711023002613 1 2-s2.0-85169560066 10.1016/j.respol.2023.104874 S0048733323001580 2 2-s2.0-85163820321 10.1186/s40537-023-00793-6 None 3 2-s2.0-85153853127 10.1162/qss_a_00236 None 4 2-s2.0-85174908092 10.3390/jmse11101855 None pubmed_id title subtype \ 0 None PyblioNet – Software for the creation, visuali... ar 1 None The role of gender and coauthors in academic p... ar 2 None Bibliometric mining of research directions and... ar 3 None How reliable are unsupervised author disambigu... ar 4 None Machine Learning Solutions for Offshore Wind F... re subtypeDescription creator afid \ 0 Article Müller M. 60018373 1 Article Schmal W.B. 60025310;60006341 2 Article Lundberg L. 60016636 3 Article Abramo G. 60027509;60021199 4 Review Masoumi M. 60017789 affilname affiliation_city \ 0 Universität Hohenheim Stuttgart 1 Heinrich-Heine-Universität Düsseldorf;Universi... Dusseldorf;Mannheim 2 Blekinge Tekniska Högskola Karlskrona 3 Università degli Studi di Roma "Tor Vergata";C... Rome;Rome 4 Manhattan College New York affiliation_country author_count \ 0 Germany 1 1 Germany;Germany 3 2 Sweden 1 3 Italy;Italy 2 4 United States 1 author_names \ 0 Müller, Matthias 1 Schmal, W. Benedikt;Haucap, Justus;Knoke, Leon 2 Lundberg, Lars 3 Abramo, Giovanni;D’angelo, Ciriaco Andrea 4 Masoumi, Masoud author_ids author_afids coverDate \ 0 58302698300 60018373 2023-12-01 1 57350833800;6602422284;57377238100 60025310;60025310;60006341 2023-12-01 2 7103325657 60016636 2023-12-01 3 22833445200;57219528028 60021199;60021199-60027509 2023-12-01 4 56362456200 60017789 2023-10-01 coverDisplayDate publicationName issn \ 0 December 2023 SoftwareX None 1 December 2023 Research Policy 00487333 2 December 2023 Journal of Big Data None 3 Winter 2023 Quantitative Science Studies None 4 October 2023 Journal of Marine Science and Engineering None source_id eIssn aggregationType volume issueIdentifier \ 0 21100422153 23527110 Journal 24 None 1 22900 None Journal 52 10 2 21100791292 21961115 Journal 10 1 3 21101062805 26413337 Journal 4 1 4 21100830140 20771312 Journal 11 10 article_number pageRange description \ 0 101565 None PyblioNet is a software tool for the creation,... 1 104874 None This paper contributes to the literature on di... 2 112 None In this paper a program and methodology for bi... 3 None 144-166 Assessing the performance of universities by o... 4 1855 None The continuous advancement within the offshore... authkeywords citedby_count \ 0 Bibliometrics | Network | Python | Science map... 0 1 Academic publishing | DEAL | Elsevier | Gender... 0 2 Bibliometrics | Fields of science and technolo... 0 3 author name disambiguation | evaluative scient... 1 4 offshore energy | offshore wind | wind farm | ... 0 openaccess freetoread freetoreadLabel fund_acr fund_no \ 0 1 None None None undefined 1 0 repositoryam Green MSI 235577387/GRK 1974 2 1 repositoryam Green BTH undefined 3 1 repositoryam Green None undefined 4 1 publisherfullgold Gold None undefined fund_sponsor 0 None 1 Ministry of Science and Innovation, New Zealand 2 Blekinge Tekniska Högskola 3 Universiteit Leiden 4 None
It’s important to note that the search results include no more than 100 authors.
The EIDs of documents can be used for the AbstractRetrieval() class and the Scopus Author IDs in column “authid” for the AuthorRetrieval() class.
Downloaded results are cached to expedite subsequent analyses. This information may become outdated. To refresh the cached results if they exist, set refresh=True, or provide an integer that will be interpreted as maximum allowed number of days since the last modification date. For example, if you want to refresh all cached results older than 100 days, set refresh=100. Use ab.get_cache_file_mdate() to obtain the date of last modification, and ab.get_cache_file_age() to determine the number of days since the last modification.
Occasionally, some fields may be missing in the returned results, even though they exist in the Scopus database. For example, the EID may be missing, even though every element always has an EID. This is not a bug of pybliometrics. Instead it is somehow related to a problem in the download process from the Scopus database. For completeness checks of specific fields, use the integrity_fields parameter, which accepts any iterable. Using parameter integrity_action you can choose between two actions if the integrity check fails: Set integrity_action=”warn” to issue a UserWarning, or set integrity_action=”raise” to raise an AttributeError.
>>> s = ScopusSearch(q, integrity_fields=["eid"], integrity_action="warn")
If you care about integrity of specific fields, you can attempt to refresh the downloaded file:
def robust_query(q, refresh=False, fields=["eid"]): """Wrapper function for individual ScopusSearch query.""" try: return ScopusSearch(q, refresh=refresh, integrity_fields=fields).results except AttributeError: return ScopusSearch(q, refresh=True, integrity_fields=fields).results
The Scopus Search API offers varying depths of information through views. The view ‘COMPLETE’ is the highest unrestricted view and contains all information also included in the ‘STANDARD’ view. It is therefore the default view. However, when speed is an issue, choose the STANDARD view.
For convenience, the s.get_eids() method returns the list of EIDs:
>>> s.get_eids() ['2-s2.0-85174697039', '2-s2.0-85169560066', '2-s2.0-85163820321', '2-s2.0-85153853127', '2-s2.0-85174908092', '2-s2.0-85163278918', '2-s2.0-85169708007', '2-s2.0-85165723386', '2-s2.0-85162924068', #... '2-s2.0-85087770000', '2-s2.0-85086243347', '2-s2.0-85084027658']