How We Updated 10+ Million DB Rows
Here At Reverb, our mission is to make the world more musical. So far, we’ve seen millions of instruments, pedals, synths and accessories trade hands across the world. In other words, lots of database rows written and updated.
In the normal course of application development and growth, we sometimes would like data from one place to be put someplace else, or to be reformatted. For our specific case, we had some columns using the ‘double’ type that we needed to convert to the more accurate ‘numeric’ type. Therefore, we added a new numeric column to which we could copy the data. Once the data were backfilled, we could then remove the old ‘double’ column and rename the ‘numeric’ column to take its place without changing any application code.
This case study applies to Reverb’s technology stack, which uses Ruby on Rails with a PostgreSQL backend. The examples are simplified for general applicability. The example columns, ‘rate_new’ and ‘rate_old’ are chosen for readability because ‘rate_new = rate’, while technically more accurate, muddles the explanation with the concern of legacy code that refers to the ‘rate’ column being migrated to ‘rate_new’. We also left out the DDL definition of the new column, and you can read ‘rate_new = rate_old’ as an implicit conversion from the old column type to the new, thanks to PostgreSQL.
So, you want to update a hundred million rows?
That’s a lot of data to be updating. If you consider reindexing, vacuuming, reading and writing, it has the potential to create no small amount of load on your database. As the saying goes, “One does not simply Product.update_all(“rate_new = rate_old”).”
The first problem is that your database has to now figure out where in the version timeline of your table to put this update. Your instruction is, “Right now, please,” but other connections are inserting and updating this table. So, everyone must be told to stop what they’re doing while we pull all the files out, update each one, and put them all back. It’s a guaranteed timeout not just for your query, but every other client connection who was about to do something, which could bubble up to users in the form of spinners, error messages, or lost updates.
So, rather than updating all at once, you need to update the table incrementally. Say, take 100 rows, update them and put them back. Or:
Since you’re interested in more than the first 100 products, you need a loop, and Rails has your back here.
This deals with the timeout/locking problem. However, this loop will run as fast as it possibly can, and because there are multiple statements instead of one, the cycle of reading data, writing data, and indexing updates is even more burdensome. All of this activity will lead to higher and higher latency for your other clients and customers, resulting in more potential customers seeing errors.
So, the next improvement is to insert a small pause between each update so that the database has a bit of time to serve other requests.
All problems are now solved! Except for one… when will this backfill end? Assuming that the update_all statement takes zero time, what’s one million seconds? It’s eleven and a half days.
Now, this can be fine tuned by adjusting the batch size and sleep time. Perhaps one second is a bit too long, and the batch size could be ten times larger. Adjusting in this way, you can get the backfill down to about two hours.
Finally, there’s the small problem of what to do if your console crashes, is terminated, or the power goes out. We eventually caved in to the real need to add a column to track the progress of the backfill right on the table.
With this in place, we can safely pick up where we left off if anything happens. And this book is closed, except for a couple more what-if chapters.
- What if your compliance and security protocol prevent you from opening a production write console for 2.7 hours?
- What if instead of 100 million, you have 300 million rows?
- What if you have a lot of indices, columns, and constraints on this table and you cannot increase the batch size or reduce the sleep without taxing the database?
Sidekiq To the Rescue
Because of our scale and compliance constraints, we plan ahead using source control whenever large backfills are needed. We trust Sidekiq to orchestrate asynchronous tasks. It is an extremely robust and efficient system for asynchronously processing small, idempotent jobs. Using Sidekiq, we were able to perform several large backfills of this style in a hands-off, throttled, and safe way.
However, Sidekiq is not without its own limitations and constraints. Or more specifically, the constraints of the Redis backend hosting Sidekiq’s data, and your provisioned processes. Firstly, Sidekiq follows at-least-once semantics when it comes to job processing. This means that if you schedule a job, Sidekiq will read and execute it at least once. But it might be twice, if the process crashes the instant before reporting its work is done. Therefore, idempotency is the golden rule for job design. Let’s translate the script to a Sidekiq job.
Now, you can simply queue this job in a rake task or console session and let it do its thing. Eventually it will finish, but if it crashes Sidekiq will automatically reschedule it and it will begin anew, picking up where it left off thanks to the backfilled_at column. You may choose to decorate this example with notifiers or metrics emitters for better observability, but this is sufficient.
But what if you need to go faster? If you have triple the rows, this design’s timeline extends to 8 hours again, and if you have ten times the rows, the runtime exceeds one day. Further, if your table is very wide or has lots of indices, the time to execute each update statement may extend. Increasing the batch size beyond 1,000 increases the risk of statement timeouts. Put all together it becomes clear, as we experienced at Reverb, that this simple approach wouldn’t meet our expectations.
Many Small Idempotent Jobs
That’s the Sidekiq Advice. Following those best practices, let’s say we will queue 1000-row jobs using simple identifiers as arguments, and those jobs do the actual updating when they are picked up by a process.
There’s a big problem with this, but a subtle one. All job data goes into a Redis set, and gets serialized as JSON. Therefore the IDs for this job get put into Redis like so:
[[“1”, “2”, “3”, “4” …]]
For 1,000 IDs, with an average length of 8 bytes, each job’s payload turns into 23kB of data! And at this batch size, you would insert 300,000 jobs for 300 million rows, or 6.5GB of data into Redis. And, that’s not counting whatever overhead Redis needs to index and keep a set that big. No matter how we slice it, that’s too much data to simply stuff into a Redis set. It doesn’t just threaten to exceed allocated memory for Redis, but places a significant burden on all the other clients of Redis.
Self-Chaining Jobs
Since we could not queue jobs with the target IDs for updating, we would need to program each job to incrementally find the next 1,000 rows to update, update them, and then promptly exit. This job could be scheduled via Cron and invoked from Rake task every few seconds. However, deploying new Crons, observing the progress, and signalling stop in case of a problem would all need to be built as well. With Sidekiq, we suspected there was a less orthodox, but viable solution that would self-throttle and self-terminate. A Sidekiq worker that queues itself.
This job runs until it is either instructed to stop by the environment variable, exceeds the hard-coded maximum number of cycles, or runs out of rows to update. The inclusion of batch_size and offset especially allow us to parallelize like so:
Although these jobs will sometimes stumble over one another, the 0th job should always be taking up the rear with its zero offset. We rely on the database and idempotency of the backfill’s effect to alleviate any concern about the probability of jobs updating the same set of rows.
As subscribers to Sidekiq Enterprise, we also make use of the Batch API to grant observability of the progress. However, because we add jobs to the batch dynamically, it has an appearance of being perpetually 10 jobs pending until completion. Batch can also replace the environment variable control flag with its invalidation API.
Finally, rather than the crude throttle, we use the Limiters API to enforce a jobs-per-second limit. Altogether this approach is satisfactory and consumes very little resources, is easy to start, stop and monitor, and can be resumed.
Conclusion: Sidekiq is Up to the Task
We’ve covered the challenges of doing very large backfills and walked through how a solution can be built up incrementally from basic principles. Sidekiq is a fast and flexible toolkit for handling asynchronous jobs, and in our case, it was up to the task. Hopefully you find our research and experiences in this area useful for your own projects!
Interested in solving challenges like this one and making the world more musical? We’re hiring! Visit Reverb Careers to see open positions and learn more about our team, values, and benefits.
The trademarks referenced in this post are trademarks of their respective owners. Use of these trademarks do not state or imply that Reverb is affiliated with, endorsed, sponsored, or approved by the trademark owners.