ArXiv
Function: programmatic access to arXiv, Cornell’s open-access print repository of primarily physics, math, and computer science research data, searching, and linking facilities
Access: API calls are made using any web-enabled client (e.g. a web browser) to make an HTTP GET or POST request to an appropriate URL. API users can use the programming language of their choice
Result Format: ATOM
Registration: no registration or API key required
Limitations: none stated, but high volume users should contact arXiv through https://arxiv.org/help/contact
Development Contact: arXiv Google Group: https://groups.google.com/forum/#!forum/arxiv-api
More Information: https://arxiv.org/help/api/index
BioMed Central
Function: retrieves BMC's latest articles, BMC editors picks, data on article subscription and access, and bibliographic search data
Access: RESTful interface, queries are made as HTTP GET requests
Result Format: JSON and Prism Aggregate (PAM)
Registration: no registration required
Limitations: none stated
Development Contact: info@biomedcentral.com
More information: https://www.biomedcentral.com/getpublished/indexing-archiving-and-acces…
SAO/NASA
Function: provides access to ADS database of bibliographic data on astronomy and physics publications
Access: HTTP GET requests, or via an unofficial Python client
Result Format: varies
Registration: free; API key required
Limitations: rate limits apply, but are not specified
Development Contact: adshelp@cfa.harvard.edu
More information: general information ; terms of use
CORE
Function: gives programmatic access to metadata and full-text of millions of OA research papers. Major data sources include the PubMed OA subset, archive.org, and DOAJ.
Access: RESTful interface, queries are made as HTTP GET requests
Result Format: JSON
Registration: personal account is free to use, API key required
Limitations: Queries of up to 50,000 records. For queries between 50 – 100,000 records, a result set will be created and you will be assigned a token that allows you to scroll through these sets. For queries larger than 100,000, or to inquire about a researcher account, please contact theteam@core.ac.uk
Development Contact: theteam@core.ac.uk
More information: https://core.ac.uk/services/api/
CrossRef REST
Function: allows access to metadata records for over 75 million scholarly works that have CrossRef DOIs, covering around 5000 publishers. Can be used for text- and data-mining, checking against funder mandates, and to obtain metadata in a variety of representations.
Access: RESTful interface
Result Format: JSON
Registration: no registration required
Limitations: no stated limitations
Development Contact: support@crossref.org
Dataverse Network
Function: multiple APIs available to allow programmatic access to data and metadata in the Dataverse Network, which includes the Scholar’s Portal Ontario University Dataverse Network, Harvard Dataverse Network, MIT Libraries-purchased data, and data deposited in other Dataverse Network repositories
Access: HTTPS. A Dataverse community-written software program can also be used to access the APIs via an RCurl package
Result Format: XML; Byte Stream for Data Access Requests
Registration: metadata access does not require registration. Data set downloads require a user account and agreement to terms of use; users interested in data sets should contact DVN support. Access to restricted data sets requires approval by data owners.
dvn_support@help.hmdc.harvard.edu
Limitations: no limitations on public data set downloads after agreeing to terms of use. No limitations on restricted data set downloads after access is granted by data owners.
Development Contact: dvn_support@help.hmdc.harvard.edu; Questions can also be posted in https://groups.google.com/forum/#!forum/dataverse-community
More information: https://guides.dataverse.org/en/latest/api/index.html
Europe PubMed
Function: a RESTful Web Service giving you access to all of the publications and related information in the Europe PubMed Central database.
Access: RESTful interface
Result Format: XML, JSON, or Dublin Core
Registration: no registration required
Limitations: none stated
More information: https://groups.google.com/a/ebi.ac.uk/forum/#!forum/epmc-webservices
HathiTrust Bibliographic
Function: retrieves bibliographic and rights information for items in the HathiTrust Digital Library
Access: RESTful Interface
Result Format: JSON
Registration: no registration required
Limitations: none stated, but not designed for large scale data retrieval
Development Contact: feedback@issues.hathitrust.org/
More information: https://www.hathitrust.org/bib_api
HathiTrust Data
Function: retrieves content (page images, OCR, and in some cases whole volume packages), and metadata for HathiTrust Digital Library volumes
Access: RESTful Interface
Result Format: XML, JSON, or Binary depending on the resource queried
Registration: two methods of access: via a Web client, requiring authentication (users who are not members of a HathiTrust partner institution must sign up for a University of Michigan “Friend” Account), or programmatically using an access key that can be obtained at https://babel.hathitrust.org/cgi/kgs/request
Limitations: no stated limitations, but it is not designed for large scale data retrieval
Development Contact: feedback@issues.hathitrust.org/
More information: https://www.hathitrust.org/data_api
IEEE Xplore
Function: provides flexible query and retrieval of metadata records for more then 4 million documents comprising IEEE journals, conference proceedings, and technical standards
Access: HTTP requests using structured URL queries
Result Format: JSON, XML
Registration: required - https://developer.ieee.org/getting_started
Limitations: a maximum of 200 results may be retrieved in a single query. A query term can only contain a maximum of 10 words
Development Contact: onlinesupport@ieee.org
More information: https://developer.ieee.org/
JSTOR Data for Research
Function: this is not a true API, but allows computational analysis and selection of JSTOR’s scholarly journal and primary resource collections. Includes tools for faceted searching and filtering, text analysis, topic modeling, data extraction, and visualization
Access: web interface
Result Format: CSV
Registration: free, but registration is required to obtain results. An institutional affiliation is not required
Limitations: datasets are capped by default at 1,000 articles; users seeking larger results are asked to contact JSTOR Data for Research
Development Contact: https://www.jstor.org/contact-us/
National Library of Medicine
Multiple APIs and other data tools for accessing various NLM databases.
Includes: Entrez Programming Utilities, Digital Collection Web Service, Open-i-Open Access Image Search, PMC Open Access Web Service
More Information: https://eresources.nlm.nih.gov/nlm_eresources/
OECD Data
Function: allows programmatic access to a selection of OECD datasets
Access: two RESTful APIs available for queries in SDMX-JSON or SDMX-ML formats
Result Format: JSON and XML
Registration: no registration required
Limitations: one million data points; not all OECD datasets are covered
Development Contact: OECDdotStat@oecd.org
OpenAlex
Function: gives programmatic access to metadata for over 200 million scientific publication records. The database is built on data from Microsoft Academic Graph (MAG), which is now retired. Data has been standardized and enhanced using sources such as Crossref, ORCID and ROR.
Access: RESTful interface, queries are made as HTTP GET requests
Result Format: JSON
Registration: free to use; no registration required, but an e-mail may be provided to enter the “polite pool”, which provides faster response time. To do so, add the mailto=you@example.com parameter in your API request
Limitations: 100,000 per day. Please contact team@ourresearch.org for larger requests
Development Contact: team@ourresearch.org
More information: https://docs.openalex.org/api
ORCHID
Function: queries and searches the ORCID researcher identifier system and obtain researcher profile data
Access: RESTful interface
Result Format: HTML, XML, or JSON
Registration: two options: users can access the Public API, which only returns data marked as “public”; or become an ORCID member to receive API credentials
Limitations: data retrieved through Public API is limited
Development Contact: https://support.orcid.org/hc/en-us/requests/new
PLoS Article-Level Metrics
Function: retrieves article-level metrics (including usage statistics, citation counts, and social networking activity) for articles published in PLOS journals and articles added to PLOS Hubs: Biodiversity
Access: queries made as HTTP GET requests through a RESTful interface
Result Format: XML, JSON, CSV
Registration: free to register; API key needed; Go to https://api.plos.org/registration/
Limitations: Results limited to batches of 50 at a time
Development Contact: alm@plos.org; questions can also be posted in PLoS API Google Group
PLOS Search
Function: allows PLoS content to be queried for integration into web, desktop, or mobile applications
Access: RESTful interface, queries are made as HTTP GET requests
Result Format: XML
Registration: free to register; API key needed; go to https://api.plos.org/registration/.
Limitations: maximum of 7200 requests a day, 300 per hour, 10 per minute; users should wait 5 seconds for each query to return results; requests should not return more than 100 rows; high-volume users should contact api@plos.org; API users are limited to no more than five concurrent connections from a single IP address
Development Contact: alm@plos.org; Questions can also be posted in PLoS API Google Group
Scholars Portal (SP) Journals API (Beta)
Function: gives programmatic access to the metadata and full-text of over 65 million journal articles in the Scholars Portal Journals Collection. Articles are licensed from a variety of vendors, with major sources including Springer Nature, DeGruyter, Taylor & Francis, and Wiley.
Access: RESTful interface, queries are made as HTTP GET requests. Sample python scripts for harvester or generating a corpus are available.
Result Format: JSONL (JSON lines file)
Registration: registration is not required, but access is restricted to Canadian University IP addresses. You must be on campus or using a University VPN.
Limitations: there are no limitations. However, only articles licensed by your University are accessible (not all institutions license all collections). Attempts to retrieve articles that are not licensed will result in an error.
Development Contact: journals@scholarsportal.info
More information: https://github.com/scholarsportal/text-mining
Science Direct
Function: multiple APIs available for different use cases, including text mining of full-text content, search widgets, displaying journal or book level data, federated searching, and indexing
Access: varies
Result Format: varies
Registration: free to register
Limitations: varies
Development Contact: integrationsupport@elsevier.com
SCOPUS
Function: multiple APIs available for different use cases, including displaying publications on a website, showing cited-by counts on a website, federated searching, populating repositories with metadata, populating VIVO profiles, and others
Access: varies
Result Format: varies
Registration: free to register
Limitations: varies
Development Contact: integrationsupport@elsevier.com
Springer
Function: multiple APIs providing access to Springer Nature metadata and open access content
Access: RESTful interface, using structured URL requests
Result Format: XML, JSON, PRISM, A++ depending on query specifications
Registration: free to register, API key required
Limitations: maximum results for a single query is 100 results for metadata queries, or 20 results for open access queries
Development Contact: support.api@springer.com
More information: https://dev.springer.com/; https://dev.springer.com/docs; https://dev.springer.com/restfuloperations
Wiley Text and Data Mining
Function: allows text- and data-mining access to content in the Wiley Online Library
Access: accessible via CrossRef’s TDM service; RESTful interface
Result Format: JSON
Registration: must be part of a subscribing institution to have full text access. Users will encounter a click-through agreement and will receive a Client API Token, which is needed when requesting full text of articles
Limitations: rate-limits implemented through CrossRef rate-limiting headers, exact limitations not specified
Development Contact: TDM@wiley.com
More information: https://olabout.wiley.com/WileyCDA/Section/id-826542.html
Chronicling America
Function: provides access to information about historic newspaper and select digitized newspapers
Access: RESTful interface
Result Format: HTML, ATOM, JSON
Registration: none required
Limitations: none stated
More information: https://github.com/LibraryofCongress/chronam
Digital Public Library of America
Function: programmatic access to metadata in DPLA collections, including partner data from Harvard, New York Public Library, The Smithsonian, ARTstor, and others
Access: RESTful Interface
Result Format: structured JSON-LD objects
https://pro.europeana.eu/what-we-do/creative-industries
Registration: free to use; API key must be requested with information here: https://pro.dp.la/developers/api-codex#get-a-key
Limitations: none stated
Development Contact: https://pro.dp.la/developers/api-codex#get-a-key
Europeana
Function: a suite of four APIs available to allow access to metadata, annotation, and download of Europeana data
Access: RESTful Interface
Result Format: varies by API
Registration: Services and Tools | Europeana Pro
Limitations: none stated
Development Contact: https://groups.google.com/forum/?pli=1#!forum/europeanaapi
Library of Congress
Function: APIs available to download bibliographic data and search Library of Congress digital collections, including images, public radio and television, and historic newspapers
Access: varies by API
Result Format: varies by API used
Registration: free to use; most APIs do not require an API key
Limitations: not specified
Development Contact: https://labs.loc.gov/lc-for-robots/
Metropolitan Museum of Art Collection
Function: provides datasets of information on more than 470,000 artworks in its collection for unrestricted commercial and noncommercial use.
Access: RESTful interface
Result Format: JSON
Registration: none required
Limitations: none stated
Development Contact: openaccess@metmuseum.org
More information: https://github.com/metmuseum/openaccess
OCLC WorldCat Search
Function: search WorldCat and retrieve bibliographic records for cataloged items such as books, videos, music and more in WorldCat
Access: RESTful interface
Result Format: XML, Dublin Core; item level information available in standard bibliographic formats
Registration: users must be affiliated with a catalogue-subscribing library. API key required.
Limitations: 50,000 queries/24 hours; limit of 100 records per batch
More information: WorldCat Search API | OCLC Developer Network
STAT!Ref OpenSearch
Function: bibliographic search service for displaying STAT!Ref results on a website.
Access: OpenSearch Specifications
Result Format: RSS, ATOM, HTML
Registration: free to register for users from a subscribing institution
Limitations: not specified
Development Contact: support@statref.com
More information: https://online.statref.com/Resources/StatRefOpenSearch.aspx
Caselaw Access Project
Function: provides queryable access to all published US court decisions
Access: in-browser API viewer or RESTful interface, also available as bulk download
Result Format: structured XML, presentation HTML, or plain text
Registration: most queries do not require registration. Jurisdictions with access restrictions require a free API key
Limitations: full text of cases limited to 500 cases per person per day, unless otherwise authorized
Development Contact: https://case.law/api/#problems
More information: https://case.law/
Data.gov
Function: access to metadata of US government open datasets
Access: RESTful interface
Result Format: JSON
Registration: API key required
Limitations: 1000 requests/hour
More information: https://api.data.gov/about/#how-it-works
Open Canada
Function: access to Canadian provincial and federal open datasets
Access: RESTful interface
Result Format: JSON
Registration: registration not required
Limitations: only supports GET requests
More information: https://ckan.org/portfolio/api/
Open Parliament
Function: access to data related to Canadian federal MPs, bills, debates, and committees. This is not a government site, but rather a repository of information gathered by external stakeholders
Access: RESTful interface
Result Format: JSON
Registration: no registration required
Limitations: limits exist, but are unspecified
Development Contact: https://github.com/michaelmulley/openparliament
World Bank
Function: provide access to World Bank statistical databases, indicators, projects, and loans, credits, financial statements and other data related to financial operations
Access: three RESTful APIs available to provide access to different datasets: Indicators (time series data), Projects (data on the World Bank’s operations), Finances (World Bank financial data)
Result Format: XML, JSON, RDF, and Atom, depending on specific API used
Registration: no registration required
Limitations: request volumes are unspecified
Development Contact: data@worldbank.org or “Contact support” link here
More information: https://datahelpdesk.worldbank.org/knowledgebase/topics/125589
UN Comtrade
Function: allows access to data on International Merchandise Trade Statistics (IMTS) and the work of the International Merchandise Trade Statistics Section (IMTSS) of the United Nations Statistics Division
Access: some services in REST, some in SOAP
Result Format: XML, CSV
Registration: Comtrade Web Services requires IP authentication, users must have site license account, however, access to metadata and data availability is not restricted
Limitations: depending on access rights, the following data can be obtained: Comtrade Data, Tariff Line Data, Total Trade, Annual Totals, Processed Data or Original Data. The latest three are restricted for data exchange between UN and OECD.
Development Contact: comtrade@un.org
More information: https://comtrade.un.org/ws/
Text and data mining are associated methods for identifying patterns within large bodies of text, in the case of text mining, or data, in the case of data mining. There are a number of different techniques associated with this method.
Voyant Tools is a web-based platform for generating statistical information about text corpora that may offer preliminary information about your text(s).
Some vendors, publishers, journals, and other organizations have made text and data available via application programming interfaces (APIs). Please see our list of APIs available to researchers. University of Toronto Libraries has some locally loaded materials available for text mining as well. Some openly accessible collections may also be useful; the University of Illinois at Urbana Champaign has compiled a list of open resources for text mining. For text-wrangling and text mining skills, consult the University of Southern California's excellent list of training resources.
For help with using APIs or to inquire about available materials for text mining, contact us.