11 minute read.
Feedback on automatic digital preservation and self-healing DOIs
Thank you to everyone who responded with feedback on the Op Cit proposal. This post clarifies, defends, and amends the original proposal in light of the responses that have been sent. We have endeavoured to respond to every point that was raised, either here or in the document comments themselves.
We strongly prefer for this to be developed in collaboration with CLOCKSS, LOCKSS, and/or Portico, i.e. through established preservation services that already have existing arrangements in place, are properly funded, and understand the problem space. There is low level of trust in the Internet Archive, also given a number of ongoing court cases and erratic behavior in the past. People are questioning the sustainability and stability of IA, and given it is not funded by publishers or other major STM stakeholders there is low confidence in IA setting their priorities in a way that is aligned with that of the publishing industry.
We acknowledge that some of our members have a low level of trust in The Internet Archive, but many of our (primarily open access members) work very closely with the IA and our research has shown that, without the IA, the majority our smaller open access members would have almost no preservation at all. We have already had conversations with CLOCKSS and Portico about involvement in the pilot and thinking through what a scale-to-production would look like. That said, for a proof-of-concept, the Internet Archive presents a very easy way to get off the ground, with a stable system that has been running for almost 30 years.
This seems to be a service for OA content only, but people wonder for how long. Someone already spotted an internal CrossRef comment on the working doc that suggested “why not just make it default for everything & everyone”, and that raises concern.
The primary audience for this service is small OA publishers that are, at present, poorly preserved. These publishers present a problem for the whole scholarly environment because linking to their works can prove non-persistent if preservation is not well handled. Enhancing preservation for this sector therefore benefits the entire publishing industry by creating a persistent linking environment. We have no plans to make this the “default for everything and everyone” because the licensing challenges alone are massive, but also because it isn’t necessary. Large publishers like Elsevier are doing a good job of digitally preserving their content. We want this service to target the areas that are currently weaker.
Crossref will always respect the content rights of our members. We never force our members to release their content through Crossref that they don’t ask us to release.
The purpose of the Op Cit project is to make it easier for our members to fulfil commitments they already made when they joined Crossref.
Crossref is fundamentally an infrastructure for preserving citations and links in the scholarly record. We cannot do that if the content being cited or linked to disappears.
When signing the Crossref membership agreement, members agree to employ their best efforts to preserve their content with archiving services so that Crossref can continue to link citations to it even in extremis. For example- if they have ceased operations.
Some of our members already do this well. They have already made arrangements with the major archiving providers. They do not need the Op Cit service to help them with archiving. However, the Op Cit service will still help them ensure that the DOIs that they cite continue work. So it will still benefit them even if they don’t use it directly.
However, our research shows that many of our members are not fulfilling the commitments they made when joining Crossref. Over the next few years, we will be trying to fix this. Primarily through outreach- encouraging members to set up and record with Crossref archiving arrangements with the archives of their choice.
But we know some members will find this too technically challenging and/or costly. [And frankly, given what we’ve learned of the archiving landscape, we can see their point.] The proposed Op Cit service is for these members. The vast majority of these members are Open Access publishers, so the “rights” questions are far more straightforward- making the implementation of such a service much more tractable.
Someone asked what this means for the publisher-specific DOI prefix for this content? Will this be lost?
There is concern about the interstitial page that Crossref would build that gives the user access options. The value of Crossref to publishers is adding services that are invisible and beneficial to users, not adding a visible step that requires user action.
There is nothing in Crossref’s terms that says that we have to be invisible. The basic truth is that detecting content drift is really hard and several efforts to do so before have failed. Without a reliable way of knowing whether we should display the interstitial page, which may become possible in future, we have to display something for now, or the preservation function will not work.
Crossref has, also, supported user-facing interstitial services for over a decade, including:
- Multiple Resolution
- Crossref Metdata Search
- REST API
So we have a long track record of non-B2B service provision.
There is confusion about why Crossref seems to want to build the capacity to “lock” records in absence of flexibility. People feel no need for Crossref to get involved here.
This is a misunderstanding of the terminology. The Internet Archive allows the domain owner to request content to be removed. This would mean that, in future, if a new domain owner wanted, they could remove previously preserved material from the archive, thereby breaking the preservation function. When we say we want to “lock” a record, we mean that a future domain owner cannot remove content from the preservation archive. This also prevents domain hijackers from compromising the digital preservation.
There is concern about the possibility to hack this system to give uncontrolled access to all full-text content by attacking publishing systems and making them unavailable. This is an unhappy path scenario but something on people’s minds.
The system only works on content that is provided with an explicitly stated open license (see response above).
I think this project would be improved by better addressing the people doing the preservation maintenance work that this requires. Digital preservation is primarily a labor problem, as the technical challenges are usually easier than the challenge of consistently paying people to keep everything maintained over time. Through that lens, this is primarily a technical solution to offload labor resources from small repositories to (for now) the Internet Archive, where you can get benefits from the economies of scale. There are definitely cases where that could be useful! But I think making this more explicit will further a shared understanding of advantages and disadvantages and help you all see future roadblocks and opportunities for this approach.
This consultation phase was designed, precisely, to ensure that those working in the space could have their say. While this is a technical project, we recognize that any solution must value and understand labor. That means that any scaling to production must and will also include a funding solution to address the social labor challenge.
Is there any sense in polling either the IA Wayback Machine or the LANL Memento Aggregator first to determine if snapshot(s) already exist?
We could do this, but it would add an additional hop/lookup on deposit. Plus, we want to store the specific version deposited at the specific time it is done, including re-deposits.
I would encourage looking at a distributed file system like IPFS (https://en.wikipedia.org/wiki/InterPlanetary_File_System). This would allow easy duplication, switching and peering of preservation providers. Correctly leveraged with IPNS; resolution, version tracking and version immutability also become benefits. Later after beta the IPNS metadata could be included as DOI metadata.
We had considered IPFS for other projects, but really, for this, we want to go with recognised archives, not end up running our own infrastructure for preservation.
It might be useful to look into the 10320/loc option for the Handle server: the https://www.handle.net/overviews/handle_type_10320_loc.html. I can imagine a use case where a machine agent might want to access an archive directly without needing to go to an interstitial page.
It is good to see reference to the HANDLE system and alternative ways that we might use it. We will consult internally on the technical viability of this.
In general, though, we prefer to use web-native mechanisms when they are available. We already support direct machine access via HTTP redirects and by exposing resource URLs in the metadata that can be retreivd via content negotiation. In this case, we would be looking at supporting the 300 (multiple choice) semantics.
I’m curious to see how this will work for DOI versioning mechanisms like in Zenodo, where you have one DOI to reference all versions as well as version specific DOIs. If your record contains metadata + many files and a new version just versions one of the several files my assumption is that within the proposed system an entire new set (so all files) is archived. In theory this could also be a logical package, where simply the delta is stored, but I guess in a distributed preservation framework like the one proposed here, this would be hard to achieve.
This is a good point and it could lead to many more, frustrating, hops before the user reaches the content. We will conduct further research into this scenario, but we also note that Zenodo’s DOIs do not come from Crossref, but from DataCite.
There’s a decent body of research at this point on automated content drift detection. This recent paper: https://ceur-ws.org/Vol-3246/10_Paper3.pdf likely has links to other relevant articles.
We have no illusions about the difficulty of detecting semantic drift but this is helpful and interesting. We will read this material and related articles to appraise the current state of content drift detection.
Out of curiosity, will we be using one type of archive (i.e., IA or CLOCKSS or LOCKSS or whatever) or will it possibly be a combination of a few archives? Reading the comments, it looks like some of them charge a fee, so I see why we’d use open source solutions first. Also, eventually could it be something that the member chooses? i.e. which archive they might want to use. Again, the latter question isn’t something for the prototype, but I’m curious about this use case. Also, I wonder about the implementation details if it is more than one archive. The question is totally moot of course, if we’re sticking with one archive for now.
The design will allow for deposit in multiple archives – and we will have to design a sustainability model that will cover those archives that need funding. As above, this is an important part of the move to production.
Will be good for future interoperability to make sure at least one of the hashes is a SoftWare Hash IDentifier (see swhid.org). The ID is not really software specific and will interoperate with the Software Heritage Archive and git repositories.
We will certainly ensure best practices for checksums.
Comments on the Interstitial Page
I’d keep the interstitial page without planning its eradication. (See why in the last paragraph)
I’d even advocate for it to be a beautiful and useful reminder to users that “This content is preserved”.
I’d go further and recommend that publishers deposit alternate urls of other preservation agents like PMC etc, that would also be displayed. This page could even be merged with multi-resolution system.
The why: I’m concerned of hackers and of predatory publishers exploiting the spider heuristics by highjacking small journals and keeping just enough metadata as in them as to fool the resolver and then adding links to whatever products, scams and whatnots…
Technical. Scraping landing pages is hard. We’ve had a lot of projects to do this over the years. You can mitigate the risk by tiering / heuristics. Maybe even feedback loop to publishers to encourage them to put the right metadata on the landing page.
This is the only part of this proposal that I don’t like. People are used to DOIs resolving directly to content, and I don’t think that should be changed unless absolutely necessary. I would prefer that the DOI resolves to the publisher’s copy if it exists, and the IA copy otherwise.
We will continue the discussion about the interstitial page. The basic technical fact, as above, is that detecting content drift is hard and so we may need, at least, to start with the page. However, some commentators presented reasons for keeping it.
We also have already supported interstitial pages for multiple resolution and co-access for over a decade.
It is member’s choice whether they wish to deposit alternative URLs and we already have a mechanism for this.