6 minute read.Start citing data now. Not later
Recording data citations supports data reuse and aids research integrity and reproducibility. Crossref makes it easy for our members to submit data citations to support the scholarly record.
TL;DR
Citations are essential/core metadata that all members should submit for all articles, conference proceedings, preprints, and books. Submitting data citations to Crossref has long been possible. And it’s easy, you just need to:
- Include data citations in the references section as you would for any other citation
- Include a DOI or other persistent identifier for the data if it is available - just as you would for any other citation
- Submit the references to Crossref through the content registration process as you would for any other record
And your data citations will flow through all the normal processes that Crossref applies to citations. And it will be distributed openly to the community (including DataCite!) via Crossref’s services and APIs. All data citations deposited with Crossref will be exposed in the (soon-to-be launched) Data Citation Corpus.
And then, you can sit back and congratulate yourself for making your publication more useful to researchers who want to be able to reuse the data underlying your publications.
Background
You might ask, “So if submitting Data Citations to Crossref has long been possible, why do you have to write this?”
Historically, authors did not cite data in the way they cited publications. Instead, they would often refer to the data in the main text of the article. This has made it hard to determine what data lay behind the research and/or access the data.
But the research community has increasingly recognized that data is a first-class research output and that we should treat it as such. In short, we should formally cite data.
But because citing data is a comparatively new practice, it has been subject to a lot of new analysis. And unsurprisingly, people analyzing data citation have discovered that there is a lot of nuance to citation of any kind.
There are lots of reasons for citing something. There are lots of internalized conventions for citing things. And there are different conventions for citation for different research objects. And SSH citation practice differs from STEM. And legal citation practices are different from scholarly citation practices. And citation practices even vary by subdiscipline and by journal.
Those who have been looking at what it means to “cite data” have naturally stumbled into a thicket of divergent practices - some of which are historical holdovers, some of which are stylistic preferences, and some of which are clearly adaptations to deal with the specific needs of certain research objects/containers or different disciplines.
The temptation has been to try and rationalize this before extending the practice of citation to data.
“Maybe because data is a distinct record type, we should include the fact that it is a data citation in the citation itself?”
“Maybe because people cite data for different reasons, we should include a typology of citation types in all data citations?”
And so you may hear some people say, “hold off on data citation - we don’t have an optimal way to do it yet, and it can be very complicated.”
But guess what?
We currently don’t label citations to monographs as “citation to monograph.”
And we don’t currently include the reason for citation when we are citing a journal article.
It would be very cool if we did. And it would likely make citations even more useful if we did.
But citations are already useful even without these features. And so, to delay citing data indefinitely because we have an opportunity to improve the act of citation is just perverse. Our community has always opted for progress over perfection.
For one thing - the efforts are not mutually exclusive. We can start citing data with the current limitations of citation practices and simultaneously propose mechanisms for making citation more useful in the future, including new guidelines to deal with the unique issues that citing data poses.
But in the meantime, we will be doing researchers a giant favour if we at least include our imperfect and ambiguous, and unconventional references to data in the references section of an article so that they can be accessed and processed along with all the other imperfect, ambiguous and variant citations that we find so useful.
Some of our members are already doing this. They have been for a long time. And they haven’t found it any more complicated than managing non-data references in the past.
Join them and make your metadata more useful.
Cite data now. Don’t put it off.
And Crossref will continue to work with DataCite and the rest of the community to make the distribution even easier and more useful.
So who is already citing data?
Top 10 members depositing data citations from November-May 2022
(broken down by DOI prefix, which is why you see some publishers listed twice):
Prefix | Member name | Data citations deposited |
10.1038 | Springer Science and Business Media LLC | 7174 |
10.1016 | Elsevier BV | 6527 |
10.1007 | Springer Science and Business Media LLC | 4748 |
10.5194 | Copernicus GmbH | 3017 |
10.1080 | Informa UK Limited | 2346 |
10.1177 | SAGE Publications | 2082 |
10.1002 | Wiley | 2048 |
10.1111 | Wiley | 1888 |
10.1108 | Emerald | 1876 |
10.3390 | MDPI AG | 1827 |
Top 10 data citations per deposited work
(again, broken down by prefix)
Member name | Prefix | Data citations deposited | Data citations per work |
Consortium Erudit | 10.7202 | 580 | 1.149 |
SLACK, Inc. | 10.3928 | 462 | 0.646 |
S. Karger AG | 10.1159 | 1653 | 0.532 |
Proceedings of the National Academy of Sciences | 10.1073 | 973 | 0.502 |
American Academy of Pediatrics (AAP) | 10.1542 | 486 | 0.397 |
F1000 Research Ltd | 10.12688 | 552 | 0.341 |
American Association for the Advancement of Science (AAAS) | 10.1126 | 952 | 0.317 |
Springer Science and Business Media LLC | 10.1038 | 7174 | 0.231 |
JMIR Publications Inc. | 10.2196 | 864 | 0.187 |
American Geophysical Union (AGU) | 10.1029 | 692 | 0.166 |
These are for the prefixes with the most data citations deposited (>500 in 6 months) so there might be smaller members doing better than this.
Summaries are great, but I want to see some actual examples!
Here are some examples showing how data is cited by our members:
And here are some example API requests for discovering more metadata citations. You can use these API requests as examples and adapt to your own needs.
Find all the DOIs that cite Dataset X (identified by DOI)
https://0-api-eventdata-crossref-org.library.alliant.edu/v1/events?rows=20&scholix=true&obj-id=10.5061/dryad.854j2
Find all data citations from Crossref member X (identified by member prefix)
https://0-api-eventdata-crossref-org.library.alliant.edu/v1/events?rows=20&scholix=true&subj-id.prefix=10.7202
Find papers with supplementary data
https://0-api-crossref-org.library.alliant.edu/v1/works?filter=prefix:10.3390,relation.type:is-supplemented-by
Find all data citations to Crossref member X
https://0-api-eventdata-crossref-org.library.alliant.edu/v1/events?rows=20&scholix=true&obj-id.prefix=10.7202
Find all data citations to DataCite member X
https://0-api-eventdata-crossref-org.library.alliant.edu/v1/events?rows=20&scholix=true&obj-id.prefix=10.5061