Blog

 2 minute read.

Citing Data Sets

Tony Hammond

Tony Hammond – 2007 March 30

In CitationData

This D-Lib paper by Altman and King looks interesting: “A Proposed Standard for the Scholarly Citation of Quantitative Data”. (And thanks to Herbert Van de Sompel for drawing attention to the paper.) Gist of it (Sect. 3) is

_“We propose that citations to numerical data include, at a minimum, six required components. The first three components are traditional, directly paralleling print documents. … Thus, we add three components using modern technology, each of which is designed to persist even when the technology changes: a unique global identifier, a universal numeric fingerprint, and a bridge service. They are also designed to take advantage of the digital form of quantitative data.

An example of a complete citation, using this minimal version of the proposed standards, is as follows:

**Micah Altman; Karin MacDonald; Michael P. McDonald, 2005, “Computer Use in Redistricting”,

hdl:1902.1/AMXGCNKCLU UNF:3:J0PkMygLPfIyT1E/8xO/EA==

http://id.thedata.org/hdl%3A1902.1%2FAMXGCNKCLU

“_

So the abbreviated citation (author, date, title, unique ID) is supplemented by a UNF which fingerprints the data. UNFs would appear to be a sort of super MD5 in providing a signature of the data content independent of the data serialization to a filestore.

_“Thus, we add as the fifth component a Universal Numeric Fingerprint or UNF. The UNF is a short, fixed-length string of numbers and characters that summarize all the content in the data set, such that a change in any part of the data would produce a completely different UNF. A UNF works by first translating the data into a canonical form with fixed degrees of numerical precision and then applies a cryptographic hash function to produce the short string. The advantage of canonicalization is that UNFs (but not raw hash functions) are format-independent: they keep the same value even if the data set is moved between software programs, file storage systems, compression schemes, operating systems, or hardware platforms.

Finally, since most web browsers do not currently recognize global unique identifiers directly (i.e., without typing them into a web form), we add as the sixth and final component of the citation standard a bridge service, which is designed to make this task easier in the medium term.”_

Certainly looks promising. I’m not sure if there’s any other contestants in this arena.

Related pages and blog posts

Page owner: Tony Hammond   |   Last updated 2007-March-30