Text and data mining for researchers
Our API allows researchers to easily harvest full-text documents from all participating members, regardless of whether the content is open access or subscription. The member is responsible for delivering the full-text content requested, so open access content can simply be delivered, while subscription content is available through access control systems.
To mine our metadata, you should have a list of DOIs for the content you want to download, and a safelist of licenses that you accept. You can get a list of DOIs from citations, our metadata search, our metadata API, or another source.
For each DOI, you should:
- Use content negotiation to get the metadata for the DOI
- Check to see if the DOI has license and full-text details in its metadata
- Check the license against your safelist of acceptable licenses
- If you agree to the license, follow the link and download the full-text of the content item.
The absence of a license does not mean that the full-text can be used without one. Members should deposit both the license and the full-text link at the same time.
Watch a basic introduction or a more detailed presentation on how to perform TDM using our API.
Example using the cURL utility
You should be able to integrate with the API very easily with your TDM software.
Step 1
Fetch the metadata: at its simplest, you can issue a HTTP GET request using a Crossref DOI and use DOI content negotiation. For example, the following cURL command will retrieve the metadata for the DOI 10.5555/515151
:
curl -L -iH "Accept: application/vnd.crossref.unixsd+xml" http://0-dx-doi-org.library.alliant.edu/10.5555/515151
This will return the metadata for the specified DOI, as well as a link header which points to several representations of the full-text on the member’s site:
HTTP/1.1 200 OK Date: Wed, 31 Jul 2013 11:24:14 GMT Server: Apache/2.2.3 (CentOS) Link: <http://0-annalsofpsychoceramics-labs-crossref-org.library.alliant.edu/fulltext/10.5555/515151.pdf>; rel="http://0-id-crossref-org.library.alliant.edu/schema/fulltext"; type="application/pdf", <http://0-annalsofpsychoceramics-labs-crossref-org.library.alliant.edu/fulltext/10.5555/515151.xml>; rel="http://0-id-crossref-org.library.alliant.edu/schema/fulltext"; type="application/xml" Vary: Accept Content-Length: 2189 Status: 200 OK Connection: close Content-Type: application/vnd.crossref.unixsd+xml;charset=utf-8
Access this full-text link information using Ruby:
require 'open-uri' r = open("http://0-dx-doi-org.library.alliant.edu/10.5555/515151", "Accept" => "application/vnd.crossref.unixsd+xml") puts r.meta['link']
Access this full-text link information using Python:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('Accept', 'application/vnd.crossref.unixsd+xml')]
r = opener.open('http://0-dx-doi-org.library.alliant.edu/10.5555/515151')
print (r.info()['Link'])
Access this full-text link information using R:
library(httr) r = content(GET('http://0-dx-doi-org.library.alliant.edu/10.5555/515151', add_headers(Accept = 'application/vnd.crossref.unixsd+xml'))) r
If present, the full-text URL will also be returned in the metadata for the DOI. For instance, in our unixref schema, you would also see this in the returned metadata:
http://0-annalsofpsychoceramics-labs-crossref-org.library.alliant.edu/fulltext/10.5555/515151.pdf
http://0-annalsofpsychoceramics-labs-crossref-org.library.alliant.edu/fulltext/10.5555/515151.xml
Step 2
Deciding what to do. Members who enable mining through us need to register a stable license URL using the <license_ref>
element. For example, this unixref extract shows that the DOI is licensed under the Creative Commons CC-BY license:
<license_ref>http://creativecommons.org/licenses/by/3.0/deed.en_US
But this shows that the DOI is licensed under a member’s proprietary license:
<license_ref>http://www.annalsofpschoceramics.org/art_license.html
The license that the URL points to does not have to be machine-readable. Check the license against your safelist. If you agree to it, you can proceed. If you don’t agree to it, put it in a list of licenses to review later and add to your safelist (or blacklist).
If a content item is under embargo, a slight complication arises: the member can use a start_date
attribute on the <license_ref>
element. In this example, the content item is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:
<license_ref start_date="2013-02-03">https://0-www-crossref-org.library.alliant.edu/license <license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US
TDM tools can easily use a combination of the <license_ref>
element(s) and the start_date
attribute to determine if the content item is currently under embargo.
If you are not interested in receiving the metadata for the DOI, you can simply issue an HTTPS HEAD request and you will get the link header without the rest of the DOI record.
Step 3
Fetching the full-text: you can now perform a standard GET request on the URL to download the full-text from the member’s site. Because the bulk downloading of large amounts of data may put a strain on the member’s servers, we have defined a set of rate-limiting HTTPS headers. You are not obliged to test for and act on these headers, and not all members will use them, but doing so will avoid surprises.
An example session using rate limiting
curl -k "https://0-annalsofpsychoceramics-labs-crossref-org.library.alliant.edu/fulltext/515151" -D - -L -O HTTP/1.1 200 OK Date: Fri, 02 Aug 2013 07:10:53 GMT Server: Apache/2.2.22 (Ubuntu) X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13 CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBA CR-TDM-Rate-Limit: 5 CR-TDM-Rate-Limit-Remaining: 4 CR-TDM-Rate-Limit-Reset: 1375427514 X-Content-Type-Options: nosniff Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT Status: 200 Content-Length: 9426 Content-Type: application/pdf
Problems accessing full-text URLs using our API
If you are having trouble accessing the full-text text URLs returned to you in the link header, this may be because:
- You have hit a rate limit (learn more about rate-limiting headers)
- You are trying to access content from a publisher that requires you to accept a TDM license; consider modifying your tools to work with such publishers’ licenses.