Resolved - Sporadic 50x errors [Updated 19 Apr 12]
In mid- to late-March, we began to receive reports from customers of sporadic 502, 503 and 504 errors when making requests to Solr. Over the last few weeks, identifying, understanding and fixing these errors have been our top priority, and we have been addressing them from a couple of different angles.
From one end, we have been interacting directly with affected customers through our support channels, to help gather data about the kinds of errors they were seeing and the kinds of conditions they would see them in. In addition, we have been providing one-on-one application-specific recommendations to help mitigate the impact of these errors (e.g., how to queue requests, and retry errors). Finally, we've also since submitted patches to two popular Solr clients to help them interact with Solr more gracefully by default.
From our networking end, we've also been working hard to better instrument our stack to gather data on the errors involved. We quickly determined that the error rate was around 0.005% of all requests, distributed somewhat randomly. Recently we have made large strides in isolating the cause of this error as brittleness in our routing layer during periods of increased rates of packet loss within our provider's datacenters.
We have recently deployed a series of updates which continue to enhance our instrumentation and data collection, eliminate any possible contributing factors, and gradually improve resilience in the event of increased network packet loss conditions.
April 13, 2012 - Failed deploy
Around 3:00am PT on April 13, 2012, we deployed some updates intended to help improve our network resiliency, which we expected to all but eliminate the issues with sporadic 50x errors. Unfortunately, this deploy included a bug which escaped the detection of our usual pre-deployment load testing and initial isolated canary deploy.
The effect of this bug was to cause all requests to Solr to return with a HTTP 400 error. Once identified, we rolled back the deploy. The total duration of the failing requests was about 45 minutes.
April 15, 2012 - Enforcing tighter rate limits
Around 2:00pm PT on April 15, 2012 we have deployed a series of updates which improve network resiliency, and enforce tighter rate limits.
Previous rate limit enforcement was much higher than our advertised plan limits. Indexes which exceed their limits will receive a HTTP 429 Too Many Requests error. Currently only a small number of customers have exceeded these limits, and we have reached out to them to help with specific recommendations to curtail traffic.
April 19, 2012 - Resolving incident
After monitoring the situation for several days, we have not observed any more incidents, and we believe the changes of April 15 have resolved the problems.