So here I am, apologizing again. Have I mentioned that I hate computers?
We had a large data center outage. It lasted 17 hours. It meant that pretty much all Crossref services were unavailable - our main website, our content registration system, our reports, our APIs. 17 hours was a long time for us - but it was also an inconvenient time for numerous members, service providers, integrators, and users. We apologise for this.
In my blog post on October 6th, I promised an update on what caused the outage and what we are doing to avoid it happening again. This is that update.
Crossref hosts its services in a hybrid environment. Our original services are all hosted in a data center in Massachusetts, but we host new services with a cloud provider. We also have a few R&D systems hosted with Hetzner.
We know an organization our size has no business running its own data center, and we have been slowly moving services out of the data center and into the cloud.
On October 6 at ~14:00 UTC, our data centre outside of Boston, MA went down. This affected most of our network services- even ones not hosted in the data centre. The problem was that both of our primary and backup network connections went down at the same time. We’re not sure why yet. We are consulting with our network provider. It took us 2 hours to get our systems back online.
TL;DR We missed an error that led to resource resolution URLs of some 500,000+ records to be incorrectly updated. We have reverted the incorrect resolution URLs affected by this problem. And, we’re putting in place checks and changes in our processes to ensure this does not happen again.
How we got here Our technical support team was contacted in late June by Wiley about updating resolution URLs for their content. It’s a common request of our technical support team, one meant to make the URL update process more efficient, but this was a particularly large request.