Detailed Rundown of our Crisis Response

This post addresses the sequence of events that led to Assistly’s service interruptions on November 9-11, and how our team responded. We want our customers to have detailed information about how we addressed this crisis and the steps we have been taking to restore their service. (For additional background, please refer to: No Excuses and Detailed Background on Service Interruption).

This post was originally published on 11/12/11. It has been updated for clarity on 11/13/11.

On Wednesday November 9 we started seeing erratic behavior on our search clusters. By way of background, our search clusters run a technology called ElasticSearch. It’s a highly-distributed search engine and it provides the backbone to much of Assistly’s functionality. We have been working closely with Shay Bannon, the lead developer of ElasticSearch, who has helped us immensely — working tirelessly alongside our team to isolate and correct for some of the problems that were causing our service interruptions.

In an attempt to fix some of search cluster problems, we performed a maintenance update on Thursday, November 10  during our normal maintenance window at 11 PM Pacific Time. As is normal during these updates, we re-enabled indexing on our search clusters as soon as we finished the evening’s maintenance. Immediately, one of our search clusters (cluster B) started to show extremely high latency. Search cluster B is responsible for case filters, customer searches, and API based searching — within minutes it had become completely frozen.

Working through the night, our team engaged  the lead developer of Elastic Search and we began diagnosing the frozen search servers. We gathered network information and stack traces from the server for analysis.  We attempted to restart the Cluster B search servers three times, but the problems continued, so we built out a replacement search cluster (Cluster E) as an emergency replacement for Cluster B.  We were able to bring up Cluster E successfully, however this did not truly alleviate the core problem. We now believe the reason cluster E came up without incident was that there was no traffic on it when it came up as opposed to Cluster B having traffic.

Also, we identified problems in our “keep alive” code which allows us to manage connections between our servers but which wasn’t performing optimally. We developed fixes for the keep alive connections, which we deployed immediately.  This relieved the socket congestion and allowed for more requests to get through.

During the workday on Friday, the problems continued as cluster E exhibited the same symptoms as Cluster B did the previous day. Fortunately, we were provided a patch to ElasticSearch that we believed would reduce stress on the search calls.  This patch fixed a bug with the asynchronous I/O in ElasticSearch. This was a known problem in Elastic Search, which had been present for a few previous versions, but likely only began impacting us because of the size of our searches.

This patch was deployed immediately and the search performance improved. After further review we decided to increase throughput and capacity by adding three additional search nodes.  This happened quickly and, while we saw a noticeable improvement in throughput,  there was still an inconsistent response time. Further analysis showed that our search nodes were not optimally balanced, so we added two more nodes to achieve the proper balancing.

This was completed around 2 PM Paciific Time on Friday November 11, and since then all of our systems have been responding within their expected ranges.

If you’re keeping score, here are the search clusters we currently employ:

1. Cluster A – This cluster maintains the tickets visible within the agent
2. Cluster C – This cluster maintains the content for the KB and Q&A content within the customer support center
3. Cluster E – This cluster replaced Cluster B and maintains the content presented when agents conduct searches and when searches are performed via the API as well

As of this weekend, all of the clusters are running the newly patched ElasticSearch code and we are optimistic this has resolved many of the problems we were seeing.

We have guarded optimism entering the weekend on our progress but will continue to invest significant time and effort throughout the weekend with a goal of pushing forward several improvements to reduce stress on the system and to have better back offs in the event we see erratic behavior in our search clusters.  Moreover, we are also working on an “offline portal” project that will allow our customers’ Help Center portals to remain up even if other parts of the system are taxed or are completely down.

Comments are closed.