pybliometrics.scopus.AffiliationSearch

AffiliationSearch() implements the Affiliation Search API. It performs a query to search for affiliations and then retrieves the corresponding records.

Documentation

class pybliometrics.scopus.AffiliationSearch(query, refresh=False, verbose=False, download=True, integrity_fields=None, integrity_action='raise', count=200, **kwds)[source]

Interaction with the Affiliation Search API.

Parameters:
  • query (str) – A string of the query. For allowed fields and values see https://dev.elsevier.com/sc_affil_search_tips.html.

  • refresh (Union[bool, int], optional) – Whether to refresh the cached file if it exists or not. If int is passed, cached file will be refreshed if the number of days since last modification exceeds that value.

    Default: False

  • verbose (bool, optional) – Whether to print a download progress bar.

    Default: False

  • download (bool, optional) – Whether to download results (if they have not been cached).

    Default: True

  • integrity_fields (Union[List[str], Tuple[str, ...]], optional) – Names of fields whose completeness should be checked. AffiliationSearch will perform the action specified in integrity_action if elements in these fields are missing. This helps avoiding idiosynchratically missing elements that should always be present (e.g. EID or name).

    Default: None

  • integrity_action (str, optional) – What to do in case integrity of provided fields cannot be verified. Possible actions: - “raise”: Raise an AttributeError - “warn”: Raise a UserWarning

    Default: 'raise'

  • count (int, optional) – (deprecated) The number of entries to be displayed at once. A smaller number means more queries with each query having fewer results.

    Default: 200

  • kwds (str) – Keywords passed on as query parameters. Must contain fields and values mentioned in the API specification at https://dev.elsevier.com/documentation/AffiliationSearchAPI.wadl.

Raises:
  • ScopusQueryError – If the number of search results exceeds 5000, which is the API’s maximum number of results returned. The error prevents the download attempt and avoids making use of your API key.

  • ValueError – If any of the parameters integrity_action or refresh is not one of the allowed values.

Notes

The directory for cached results is {path}/STANDARD/{fname}, where path is specified in your configuration file and fname is the md5-hashed version of query.

property affiliations: List[NamedTuple] | None

A list of namedtuples storing affiliation information, where each namedtuple corresponds to one affiliation. The information in each namedtuple is (eid name variant documents city country parent).

All entries are strings or None. Field variant combines variants of names with a “;”.

Raises:

ValueError – If the elements provided in integrity_fields do not match the actual field names (listed above).

get_cache_file_age()

Return the age of the cached file in days.

Return type:

int

get_cache_file_mdate()

Return the modification date of the cached file.

Return type:

str

get_key_remaining_quota()

Return number of remaining requests for the current key and the current API (relative on last actual request).

Return type:

str | None

get_key_reset_time()

Return time when current key is reset (relative on last actual request).

Return type:

str | None

get_results_size()

Return the number of results (works even if download=False).

Return type:

int

Examples

The class is initialized using a search query, details of which can be found in Affiliation Search Guide. An invalid search query results in an error.

>>> from pybliometrics.scopus import AffiliationSearch
>>> query = "AFFIL(Max Planck Institute for Innovation and Competition Munich)"
>>> s = AffiliationSearch(query)

You can obtain a search summary just by printing the object:

>>> print(s)
Search 'AFFIL(Max Planck Institute for Innovation and Competition Munich)' yielded
2 affiliations as of 2021-01-15:
    Max Planck Institute for Innovation and Competition
    Max Planck Institute for Competition and Innovation

The primary function of the class is to provide a list of namedtuples, each storing information about the affiliation. One of them is the affiliation ID which you can use for the AffiliationRetrieval class:

>>> s.affiliations
[Affiliation(eid='10-s2.0-60105007', name='Max Planck Institute for Innovation and Competition',
             variant='Max Planck Institute For Innovation And Competition', documents=380,
             city='Munich', country='Germany', parent='0'),
 Affiliation(eid='10-s2.0-117495104', name='Max Planck Institute for Competition and Innovation',
             variant='Max-plank Institut', documents=3, city='Munich', country='Germany',
             parent='0')]

Working with namedtuples is straightforward: using pandas for example you can quickly convert the results into a DataFrame:

>>> import pandas as pd
>>> pd.set_option('display.max_columns', None)
>>> print(pd.DataFrame(s.affiliations))
                 eid                                               name  \
0   10-s2.0-60105007  Max Planck Institute for Innovation and Compet...
1  10-s2.0-117495104  Max Planck Institute for Competition and Innov...

                                             variant  documents    city  \
0  Max Planck Institute For Innovation And Compet...        380  Munich
1                                 Max-plank Institut          3  Munich

   country parent
0  Germany      0
1  Germany      0

Comparing the EIDs, notice that the first affiliation’s EID starts with 10-s2.0-6, while the other begins with 10-s2.0-1. The latter denotes a non-org affiliation type. More on different types of affiliations in section tips.

Downloaded results are cached to expedite subsequent analyses. This information may become outdated. To refresh the cached results if they exist, set refresh=True, or provide an integer that will be interpreted as maximum allowed number of days since the last modification date. For example, if you want to refresh all cached results older than 100 days, set refresh=100. Use ab.get_cache_file_mdate() to obtain the date of last modification, and ab.get_cache_file_age() to determine the number of days since the last modification.

You can determine the number of results using the .get_results_size() method, even before you download the results:

>>> query = "AFFIL(Max Planck Institute)"
>>> s = AffiliationSearch(query, download=False)
>>> s.get_results_size()
398

Sometimes, information that exists in the Scopus database may be missing in the returned results. For example, the EID may be missing, even though every element always has an EID. This is not a bug of pybliometrics. Instead it is somehow related to a problem in the download process from the Scopus database. To check for completeness of specific fields, use parameter integrity_fields, which accepts any iterable. Using the`integrity_action` parameter, you can choose between two actions if the integrity check fails: Set integrity_action=”warn” to issue a UserWarning, or set integrity_action=”raise” to raise an AttributeError.

>>> s = AffiliationSearch(query, integrity_fields=["eid"], integrity_action="warn")

Occasionally, the number of search results may exceed Scopus’ limit, which is currently capped at 5,000 results. In this case the only solution is to narrow down the research, i.e., instead of “affil(‘Harvard Medical School’)” you search for “affil(‘Harvard Medical School Boston’)”.