Owning Our Data
When a company is young, time/money/resources are scarce and it’s usually easiest to rely on external services for non-critical needs. As a company matures, room for error decreases, which often necessitates replacing these services with custom-built solutions. At Reverb.com, we’ve recently begun the process of owning our data. Today I want to summarize briefly our first steps in this process and what benefits we hope to reap.
But first, some context.
What do I mean by “data”?
In general, I’m speaking of analytics data generated by user actions and the aggregation of these “events”. A few months ago, our analytics were a mess. We sent back-end events to a variety of services including our own ELK (Elasticsearch, Logstash, Kibana) stack. We piped our front-end events to external services as well, but not the same ones that received our back-end events. Our mobile apps also had their own services for analytics tracking.
With data spread all over the place, it became difficult to get a big-picture view of our analytics. The services we used allowed us to answer most of our smaller, more feature-focused questions, but we wanted to start consolidating our data to derive deeper insights.
Controlling the Pipeline
We started by refining our own pipeline. Instead of sending events to a variety of services, we wanted one place to send and store our data. We created an event-logging microservice in Go with a simple bulk /events endpoint that takes a JSON payload which includes an event name and arbitrary attributes. The events are delivered directly to FluentD, which prisms those streams to a number of collectors, including Elasticsearch. We wired this API up to our front-end, with the intention of having our mobile apps use the same API in the near future.
Standardizing Events
Once our data was flowing to the same place, we worked on standardizing our events. Previously we had data logged in a variety of different formats. Some events were logged using a custom logger that placed the majority of its information under a “data” key in a hash. Some used the built-in Rails logger with no formatting. Others still, like our API, used their own unique format. The result was that events contained a variety of information under a variety of different keys (or no key at all). Searching through logs was tedious at best and impossible at worst.
To resolve this, we configured all of our loggers to use a custom formatter. With every event being formatted the same, we were able to iterate through several different universal formats for events, ultimately settling on JSON objects with top-level @event_name, @event_source, and data keys. This allows us to quickly find events via their name (“analytics.worker.add_to_wishlist”), their origin (Rails, API, ElasticSearch, etc.), and also to provide a clear place for freeform, more detailed information.
What do we get out of this?
There are immediate benefits related to enhancing the searchability of our data; Developers can troubleshoot problems more effectively and data scientists have more confidence in their research. But, that’s really just the tip of the iceberg. With the ability to generate more comprehensive business analytics, we’re now squarely on the path to better site monitoring and personalization.
We can now more easily track performance of systems (like search), allowing us to iterate on algorithms until we nail something that works well. We can also A/B test more effectively, allowing data to better inform our design decisions.
With full control of our analytics, we can also achieve deep insights into what our customers are doing, what they like, and what they want to see. We can use this information to create relevance engines to help us to profile our customers and determine what content is most important to them. Subsequently, we can use these tools to create more pertinent and engaging pages, emails, and articles, all of which can be surfaced dynamically depending on the (analytics-driven) interests of the user.
We’re excited about these possibilities and more! As an envoy of Reverb’s newly-minted Discovery team, I hope you’ll join us in the future as we continue to document our quest to own our data and, ultimately, to bring world-class personalization and search to Reverb!
Joe Levering
joe@reverb.com