503 Service Unavailable
Occasionally our service will return a "503 Service Unavailable" error. These occur periodically, on the order of 0.05% of all requests, and can be due to a number of causes.
One relatively mundane cause is normal system maintenance, during which we may make an index read-only for roughly 1–2 minutes to restart an instance of Solr. This can happen once or twice a month, and should not affect searches.
Other causes are trickier to pin down, due to esoteric combinations of factors such as networking packet loss and JVM garbage collection pauses.
We have a few recommendations to harden your application in the event of these errors:
- Upgrade your index to a more recent version of Solr. Some of our older indexes (early 2012 and before) are running on Solr 1.4, which has proven to be more prone to stability issues which can contribute to these kinds of 503 errors. We strongly recommend that any Solr 1.4 index which experiences problems be replaced with a newer index on Solr 3.6 or Solr 4.0.
- Retry your requests — a 503 error in our systems is almost always intermittent, and may be retried immediately, or multiple times with an exponential backoff. In particular, we recommend that incremental upgrades be processed in a queue, which lends toward easier automatic retries.
- Upgrade to a dedicated cluster. Having dedicated resources available can provide more consistency by mitigating some classes of Solr memory management issues experienced in multitenant shared clusters.
- Report the problem. If your application has experienced a high rate of 503s sustained for more than a few minutes, and we haven't announced a larger outage on @websolrstatus, it may be indicative of a larger problem that we need to know about. Let us know your index URL via help.websolr.com or an email to firstname.lastname@example.org.
As to #2, the implementation of retries will vary based on the platform and Solr client. As an example, recent versions of Sunspot include an optional session proxy which can automatically retry these kinds of errors. You could add something like this to a Rails initializer:
Sunspot.session = Sunspot::SessionProxy::Retry5xxSessionProxy.new(Sunspot.session)
User activity which gradually creates or updates single records over time should have their index updates queued with a system such as Resque. That way, temporary errors such as a 503 are isolated from the everyday operation of the rest of your application, and failed jobs can be more easily retried.
If your index is experiencing a high rate of errors that is impacting your site's ability to operate, please get in touch with us at help.websolr.com or email@example.com so that we can help get your index back on track.