Detailed Background on Service Interruption, Week of 11/7/11

Over the past three days our search clusters have been behaving erratically, which has caused Assistly to be unresponsive. You can read how seriously we take this issue in my email that is going out to all our customers.

Search is critical part of the way Assistly operates: The Assistly agent desktop relies on the search engine to maintain real-time filter updates, search for cases, and to power the help center searches. The search technology we use is called Elastic Search and it is a distributed search meant to, among other things, prevent issues like we saw over the last few days.

For some background, we currently have two search clusters. The first cluster provides results for our case filters, knowledge base, the help center, and our community Q&A capability.  In addition, our timed business rules use this first search cluster. Our second search cluster provides results for searches within the agent desktop and for search queries from our API.

Normally, the CPU utilization in our search clusters is balanced across all the nodes. However, this past week, we noticed an anomaly: the CPU utilization of a single node in one of the clusters would spike and become non-responsive.  As the load grew, the issue spread to the other nodes in the cluster.  We went through various recovery steps (stopping individual nodes, starting back up, analyzing logs), but the issue continued. Three times now, we have had to stop all nodes in the cluster.  When all the nodes are stopped, the search index is cleared and must be rebuilt from the database.

Our investigation uncovered a “keepalive” configuration that we updated approximately six weeks ago.  We suspect that the change introduced a problem that caused additional load to spread throughout the clusters unchecked and dramatically impair performance. The keepalive configuration is an important part of how we optimize our search: It ensures the connections are reused and extra connections are not spawned. We have deployed this as an emergency hotfix and early indications are that this is performing as it should.

We also identified a problem in our search code while troubleshooting with our partners at Elastic Search. We are deploying a fix for this now and we will continue to monitor it and provide updates until service is fully restored.

Moreover, we’ve split the help center search (the one your customers use) from the case filter search on the agent desktop. Separating these two search functions reduces the risk that one will affect the other.  Finally, we’re in the process of adding a technology called Varnish which will segregate traffic going to the customer-facing help centers from traffic going to the agent interface. We expect this will make your customer-facing help center more resilient, as it will be less susceptible to interruptions affecting the agent desktop.

As of this morning, we are still seeing intermittent problems with search but we have our entire team, along with scaling and search experts from outside of Assistly, working on the problem around the clock. We will continue to provide regular posts in the Product Updates category of our blog and, more frequently, on @assistlyops.

We are also experimenting with a real-time status board which you can visit at any time via http://status.assistly.com. While still in beta, we will continue to perfect the data that appears on this status board so that it becomes a reliable and instant source of information for you. Our goal is be completely transparent with you and show you the status of our service in real time.

  • Belinda

    Thanks for the detailed explanation. Having had to deal with search farm outages and buggered nodes in previous jobs, I totally can sympathize with this problem. I look forward to being back up on Assistly and hearing about future steps to prevent future outages of this nature. 

  • mb

    Sounds like your entire set up is flawed. Figuring out that it was one setting that destroyed your entire service isn’t very reassuring.

    How are you going to prevent problems like this from ever occurring again?

    Unlike Belinda, I have no sympathy with this problem. This is what you do. If you can’t do it right, then something is wrong.

    • http://about.me/matthew.trifiro Matthew_Trifiro

      MB – There is no excuse for this service interruption. You entrust us with your brand and your daily productivity. But we will make it right. 

  • mb

    Sounds like your entire set up is flawed. Figuring out that it was one setting that destroyed your entire service isn’t very reassuring.

    How are you going to prevent problems like this from ever occurring again?

    Unlike Belinda, I have no sympathy with this problem. This is what you do. If you can’t do it right, then something is wrong.

  • http://twitter.com/johnbonobos John Rote

    Alex & Crew, thanks for the explanation.  I’m bummed about the outage like other customers are, but I think you did a good job of providing an explanation, but not making an excuse. 

    The explanation gave me confidence that the problem is IDed and a short term fix is on the horizon. I’m with MB in wanting to hear about what the long-term fix is so it doesn’t happen again, and I’m with Belinda in sympathizing with the challenge.   

    I hope the fix is imminent, but if any Assistly customers want to talk about potential immediate workarounds to keep your agents productive during the outage, let me know as I think we’ve found a pretty good short-term workaround.  @johnbonobos:twitter or john@bonobos.com.

    • http://about.me/matthew.trifiro Matthew_Trifiro

      John,

      Thanks for your support. We will follow up, probably next week, with details on our longer term plans. In advance of that, I am happy to share our thinking with you by phone (and will extend that to any other customer).

      Matt

  • http://joel.meador.myopenid.com/ Joel Meador

    Bad shit happens sometimes in the real world. It sounds like you are fixing your technical problems, doing whatever it takes to make your customers happy, and giving people better insight into your business. Thanks for being upfront.

  • http://www.facebook.com/profile.php?id=748593193 Osama Mohmad Abdalglil

    omr amry