Exposing Our Data
A few months ago I wrote a post about Owning Our Data. In it, I detailed Reverb’s quest to enhance personalization and search relevancy beginning with standardizing our event data and funneling it all through one pipeline. Today, I’d like to follow up on that with a sequel of sorts, focusing on exposing our now-homogenous data.
After streaming all our data through a single source, we ended up aggregating our events in two places: an Elasticsearch cluster and an S3 bucket. The ES cluster allows us to use a Kibana-powered web-based dashboard for searching logs, making visualizations, and adding them to dashboards. The S3 bucket serves as a more fail-safe source of truth that is intended to remain untouched (no TTLs).
A few steps were missing before we could truly begin to understand our customers. We had the data, but we didn’t have any clear way to use it. We’d like to be able to analyze it, but JSON events strewn about a multitude of giant files isn’t the most accessible format, nor is S3 the most accessible place for this information (and Elasticsearch/Kibana offers limited ability to analyze).
The Case for Redshift
Redshift is an AWS-hosted Postgres-based relational database. It’s also column-oriented, allowing for quick analysis of massive amounts of data. In addition, the low infrastructure cost provided by AWS made it a relatively simple solution to our problem. Hosting our data in Redshift would allow our data science team to explore logged events in an efficient, open-ended manner, but we needed something to take our log blobs from S3 and load them into Redshift.
The First Attempt
Our initial attempt to move our logged events into Redshift used a pre-built AWS lambda function that moves data from S3 to Redshift. This function built the Redshift tables based on the keys in the JSON of each line. With no ability to create sub-level columns (i.e. handle nested JSON), we ended up with one giant JSON blob as the DATA column. We wanted to analyze specific types of events, so this immensely slowed our ability to meaningfully analyze data; we had to unmarshal the entire column to figure out if we were interested in the event represented by a row.
Enter the Log Router
We decided to build our own service that filters S3 log files and loads the data into Redshift. This would give us total freedom to organize the data how we wanted, and additionally allow us to create separate tables for each type of event we log. We chose to write this service in Go. Its concurrency and speed were appealing, and it also interfaced via the AWS SDK with SQS, a simple messaging queue, also hosted by AWS, which we could use to trigger our Log Router.
We split our service into two distinct binaries (which allows us to tweak their workers separately). One pulls a log file from S3, sends each line to a stream depending on the event it represents, and then writes new S3 files for each event. The other one uses Redshift’s COPY command to load a new filtered log file directly from S3.
Now we have all of our events automatically loaded into Redshift within a few minutes of their creation. The Log Router service is resilient, blazing fast, and allows us to backfill months of data from our S3 “source of truth” in hours.
Benefits Provided
We now have the ability to run MapReduce queries, a requirement for developing our own relevance engines, learning about our users, and eventually creating more engaging and dynamically-delivered content. We’ve also linked Redshift to Chartio, a powerful data-modeling and dashboard tool that we use to track user trends across our primary and analytics databases.
Ultimately, this is just the next step on our journey to data-mastery. I hope you’ll join us next time as we continue to enhance the Reverb experience through personalization and discovery tooling!