6 minute read.Rebalancing our REST API traffic
Since we first launched our REST API around 2013 as a Labs project, it has evolved well beyond a prototype into arguably Crossref’s most visible and valuable service. It is the result of 20,000 organisations around the world that have worked for many years to curate and share metadata about their various resources, from research grants to research articles and other component inputs and outputs of research.
The REST API is relied on by a large part of the research information community and beyond, seeing around 1.8 billion requests each month. Just five years ago, that average monthly number was 600 million. Our members are the heaviest users, using it for all kinds of information about their own records or picking up connections like citations and other relationships. Databases, discovery tools, libraries, and governments all use the API. Research groups use it for all sorts of things such as analysing trends in science or recording retractions and corrections.
So the chances are high that almost any tool you rely on in scientific research has somewhere incorporated metadata through us.
For some time, we’ve been noticing reduced performance in a number of ways, and periodically we have a flurry of manually blocking/unblocking IP addresses from requesters that are hammering and degrading the service for everyone else, and this is of course only minimally effective and very short term. You can always watch our status page for alerts. This is the current one about REST API performance: https://0-status-crossref-org.library.alliant.edu/incidents/d7k4ml9vvswv.
As the number of users and requests has grown, our strategies for serving those requests must evolve. This post discusses how we’re approaching balancing the growth in usage for the immediate term and provides some thoughts about things we could try in the future on which we’ll gladly take feedback and advice.
Load balancing
In 2018, we started routing users through three different pools (public, polite, and plus). This coincided with the launch of Metadata Plus, a paid-for service with monthly data dumps and very high rate limits. Note that all metadata is exactly the same and real-time across all pools. We also, more recently, introduced an internal pool. Here’s more about them:
- Plus: This is the aforementioned premium option; it’s really for ‘enterprise-wide’ use in production services and is not really relevant here.
- Public: This is the default and is the one that is struggling at the moment. You don’t have to identify yourself and, in theory, we don’t have to work through the night to support it if it’s struggling (although we often do). Public currently receives around 30,000 requests per minute.
- Polite: Traffic is routed to polite simply by detecting a mailto in the header. Any system or person including an email is being routed to a currently-quieter pool, this means we can always get in touch for troubleshooting (and only troubleshooting). Polite currently receives around 5,000 requests per minute.
- Internal: In 2021, we introduced a new pool just for our own tools where we can control and predict the traffic. Internal currently receives around 1,000 requests per minute.
The volumes of traffic across public, polite and internal pools are very different and yet each pool has always had similar resources. The purpose of each of these pools has been long-established but our efforts to ask the community to use polite by default have not been particularly successful and it is clear that we don’t have the right balance.
The internal pool has been dedicated to our internal services that have predictable usage and that have requests that are not initiated by external users. The internal pool has previously included reference matching but not Crossmark, Event Data, or search.crossref.org, which all use the polite pool instead, along with the community. We have the capacity on the internal pool to shift all of this “internal” traffic across, and in doing so we will create more capacity for genuine polite users and redefine what we consider to be “internal”.
Creating more capacity on polite will also give us the opportunity to load-balance requests to both polite and public across the two pools. We are at a point where we cannot eke more performance out of the API without architectural changes. In order to buy ourselves time to address this properly, we will modify the routing of polite and public and evenly distribute requests to the two pools 50/50.
The public and polite pools have equal resources at the moment yet handle very different volumes of traffic (30,000 req/min vs 5,000 req/min), and with the proposed changes to internal traffic the polite pool would handle a fraction of this. The result would look something like 31,000 req/min evenly distributed across public and polite.
Rate limiting
Our rate-limiting also needs review. We track a number of metrics in our web proxy but only deny requests on one of them - the number of requests per second. On public and polite we limit each IP address to sending 50 req/sec and if this rate is exceeded users are denied access for 10 seconds. These limits are generous and we cannot realistically support this volume of request for all users of the public or polite API.
However, when requests are taking a long time to return, we potentially have a separate problem of high concurrency as hundreds of requests could be sent before the first one has returned. We intend to identify and impose an appropriate rate limit on concurrent requests from each IP to prevent a small number of users from disproportionately affecting all users with long-running queries.
Longer-term
So, in the short-term we will revise our pool traffic as described above. We’ll do that this week. Then we will review the current rate limits and reduce them to something more reasonable for the majority of users. And we’ll identify and introduce a rate limit for concurrent requests from each user.
Longer-term, we need to rearchitect our Elasticsearch pools so that we can:
- Reduce shard sizes to improve performance of queries
- Balance data shards and replicas more evenly
- Optimise our instance types for our workload
Want to help?
Thanks for asking!
Firstly, please, everyone, do always put an email in your API request headers - while the short term plan will help stabilise performance, this habit will always help us troubleshoot e.g. we can always contact you instead of blocking you!
Secondly, we know many of you incorporate Crossref metadata, add lots of value to it in order to deliver important services, and also develop APIs of your own. We’d love any comments or recommendations from those of you handling similar situations on scaling and optimising API performance. You can comment on this post which is managed via our Discourse forum. We’ll also be adding updates to this thread as well as on status.crossref.org. If you’d like to be in touch with any of us directly, all our emails are firstinitiallastname@crossref.org.