Automated Redshift Setup with Protobufs
Recently Reverb wrote a service (the log router) that parses log lines from S3 and adds them to an event-specific table in Redshift. While tackling this project, we wanted to make sure that when we decided to track and analyze a new event, it was relatively painless to update the log router to do so. Ultimately, we added functionality to the service that automatically populates our Redshift database with tables and columns related to the events we want to analyze, with the help of Google’s Protocol Buffers (abbreviated Protobufs).
We’ve been using Protobufs, Google’s strongly-typed binary serialization format, in order to define events we want to track. If the events cross services and languages, having them defined as a Protobuf assures us we are collecting and logging the same data every time. It also gives our log router a complete understanding of what data we need to load into Redshift, before it starts filtering and routing log lines.
How did we do it? We start by defining a Protobuf and adding its name to an environment variable for the log router (defining the events we want to track).
When the log router starts up, it checks Redshift to make sure all tables are established. If it finds a table is missing (for instance, when we start routing logs for a new event), it will generate a table with some default columns:
After creating the table, the service determines what columns need to be added by looking at the Protobuf. Because events can have nested information (like a user object that has a name and id), some recursion is required to flatten the event:
After determining the table’s columns, each column is added via an ALTER TABLE command to Redshift. At the end, we get a simple Redshift table that looks something like this:
With this relatively simple code, we’ve made our log routing service easily expandable with almost no further effort on our part. This is just one of the many scenarios in which we’ve found Protocol Buffers immensely useful, and just a tiny piece in Reverb’s growing analytics pipeline.
See previous posts: Owning Our Data and Exposing Our Data