Over the past three days our search clusters have been behaving erratically, which has caused Assistly to be unresponsive. You can read how seriously we take this issue in my email that is going out to all our customers.
Search is critical part of the way Assistly operates: The Assistly agent desktop relies on the search engine to maintain real-time filter updates, search for cases, and to power the help center searches. The search technology we use is called Elastic Search and it is a distributed search meant to, among other things, prevent issues like we saw over the last few days.
For some background, we currently have two search clusters. The first cluster provides results for our case filters, knowledge base, the help center, and our community Q&A capability. In addition, our timed business rules use this first search cluster. Our second search cluster provides results for searches within the agent desktop and for search queries from our API.
Normally, the CPU utilization in our search clusters is balanced across all the nodes. However, this past week, we noticed an anomaly: the CPU utilization of a single node in one of the clusters would spike and become non-responsive. As the load grew, the issue spread to the other nodes in the cluster. We went through various recovery steps (stopping individual nodes, starting back up, analyzing logs), but the issue continued. Three times now, we have had to stop all nodes in the cluster. When all the nodes are stopped, the search index is cleared and must be rebuilt from the database.
Our investigation uncovered a “keepalive” configuration that we updated approximately six weeks ago. We suspect that the change introduced a problem that caused additional load to spread throughout the clusters unchecked and dramatically impair performance. The keepalive configuration is an important part of how we optimize our search: It ensures the connections are reused and extra connections are not spawned. We have deployed this as an emergency hotfix and early indications are that this is performing as it should.
We also identified a problem in our search code while troubleshooting with our partners at Elastic Search. We are deploying a fix for this now and we will continue to monitor it and provide updates until service is fully restored.
Moreover, we’ve split the help center search (the one your customers use) from the case filter search on the agent desktop. Separating these two search functions reduces the risk that one will affect the other. Finally, we’re in the process of adding a technology called Varnish which will segregate traffic going to the customer-facing help centers from traffic going to the agent interface. We expect this will make your customer-facing help center more resilient, as it will be less susceptible to interruptions affecting the agent desktop.
As of this morning, we are still seeing intermittent problems with search but we have our entire team, along with scaling and search experts from outside of Assistly, working on the problem around the clock. We will continue to provide regular posts in the Product Updates category of our blog and, more frequently, on @assistlyops.
We are also experimenting with a real-time status board which you can visit at any time via http://status.assistly.com. While still in beta, we will continue to perfect the data that appears on this status board so that it becomes a reliable and instant source of information for you. Our goal is be completely transparent with you and show you the status of our service in real time.
About Using Desk.com
Customer Engagement
Customer Experience
Customer Service
Desk.com Customer Stories
Our Infographics
Product Updates
Social Media