How we switched elasticsearch clusters without anybody noticing

--

We’ve been stuck on an old (0.90) version of elasticsearch for some time now, looking for a way to upgrade to 1.3 so we can get all the benefits of the scalable percolator and other fun features. We’ve recently been working on some interesting Feed functionality that lets you follow searches on the site and be alerted when new items are posted, and we needed the scalability of the percolator in 1.3 to make this work smoothly.

The challenge was this: switching from a single 0.90 node to a cluster of 1.3 nodes on a system that’s handling 1000 requests per minute (sure, not huge but significant), and without any downtime for our customers, and ideally without anyone being the wiser.

Here’s how we did it:

Step 1: Migrate from Stretcher to Elasticsearch-Ruby

We had been using Stretcher, a gem that was basically abandoned since dec 2013 and was not up to date with the newest ES APIs. Meanwhile, ES had put out an official elasticsearch rubygem which was pretty nice.

Because we didn’t want to deal with long running branches that we couldn’t merge and that would pile up with merge conflicts, our first priority was to make sure that the code we wrote was compatible against both ES 0.90 and ES 1.3. Luckily there were only two spots where there were major differences — the percolator and the dynamic scripting language change from MVEL to Groovy (we were using dynamic scripts to do id-based sorting, but that’s another story).

The first step was building an abstraction layer around the gem, something that we had done kind of haphazardly the first time around with stretcher. The abstraction layer provided an interface between Reverb’s code and the naked elasticsearch gem.

In the abstraction layer, we had a config flag that checked for the version of ES and served up either the 0.90 versions or the 1.3 versions. These conditions ended up costing us only about ten lines of code, which was really nice (thanks to ES for not making huge breaking API changes across versions). Not only does this make swapping versions easy, but also swapping for an entirely different backing gem in the future should we so decide.

Step 2: Get the new indexes ready

We merged the code we wrote in step 1 and pushed it to production. We now had code that knew about ES 1.3 but also knew how to continue to support our production 0.90 node. As far as the rest of the codebase was concerned, they were consuming our abstraction layer class and were not wise to the version of ES underneath.

We now created a new branch of our code, this time flipping the config to point to a new cluster (we used qbox.io to build this ES cluster), and flagging the ES code to use the modified code that was 1.3 friendly.

We then ran a set of rake scripts that rebuilt all our indexes. These scripts were designed around a timestamp algorithm so that we could index the data that came in within a certain window. This was important because a full index rebuild takes a really long time and we wanted to make sure that we could keep the index up to date while we were preparing for the switch over.

We also wanted to make sure we had the option to live-reindex things in production from now on, so we made a system that uses index aliases so that our ‘products’ index is actually an alias pointing to something like ‘products_v_[timestamp]’. When we build a new index, we actually build one in parallel to the old one and then switch the alias once its done.

Step 3: The reindexing loop

Normally our app servers push content into a reindexing queue which is then drained by sidekiq workers in order to actually serialize and put things into ES.

So now we had our production app servers happily pushing data to the old ES node, and we had one new app server that was not serving any production traffic but was used to communicate with the new ES cluster.

We then launched what was basically an infinitely looping script on the new app slice. The job of this script was to pick up any database changes that came in and index them into the new cluster. Now we had the old app servers pushing data to the old ES node and the new app server pushing data to the new ES node in this infinite loop.

Step 4: Switch the ip!

We then pushed a configuration change with a restart that caused all of our app servers to point to the new ES server. One by one as they restarted, they came up pointing to the new server. Because we have rolling restarts with unicorn, there was no downtime. As servers came back up they were now both reading and writing to the new cluster.

Step 5: Make sure the old server is really dead

We used our NewRelic monitoring system to look at traffic to the new server. We could see at the time of the switch over the CPU usage dropped to zero. We ssh’ed into the server just to verify and saw nearly zero network traffic (using nload) and nearly zero cpu activity. Looks like it worked! We shut down ES on the old server to make sure.

Step 6: Kill reindexing loop

Since the new servers are now automatically indexing into the new cluster, we no longer needed the reindexing loop that was running on the temporary new app slice, so we killed it.

Not too bad for a week’s work!

Many thanks to Kyle Crum who did a lot of the heavy lifting writing the switch over scripts and abstractions around index versioning. Maybe one day we’ll open source this stuff :)

If you’re an ElasticSearch guru itching for a chance to put your skills to use in improving and innovating on search at Reverb, we’re hiring! Contact us at jobs at reverb dot com.

Till next time,
Yan Pritzker
CTO

--

--