2 minute read.
And the DOI is …
Once structured metadata is added to a file then retrieving a given metadata element is usually a doddle. For example, for PDFs with embedded XMP one can use Phil Harvey’s excellent Exiftool utility.
Exiftool is a Perl library and application which I’ve blogged about here earlier which is available as a ‘
.zip‘ file for Windows (no Perl required) or ‘
.dmg‘ for MacOS. Note that Phil maintains this actively and has done so over the last five years. (And when I say actively I mean just that. I once made the mistake of printing out the change file.)
If Perl’s not your thing, then there’s a Ruby wrapper gem (MiniExiftool) to access the Exiftool command in trouper OO fashion. Here’s an example Ruby one-liner to get the DOI from a PDF (broken here to meet column width restriction):
% ruby -rubygems -e 'require "mini_exiftool";<br /> puts MiniExiftool.new("test.pdf")["doi"]'<br /> 10.1038/nphoton.2008.200
Of course, that could also have been run against an image, audio or video file with XMP packet.
(Makes one wonder vaguely about the feasibility of having a Swiss Army knife type of utility that could read any file to get the DOI using the embedded XMP, RDFa, RDF, HTML headers, COiNS, etc. Possibly even as last resort fall back to scanning the raw text - if any.)