How We Decreased GraphQL Latency by 60%
At Reverb, we like to learn quickly and much of our learning comes from experimentation. With a constantly growing site and many successful experiments, it’s easy to make the pages slower over time as more users interact with more features. In the tradition of Agile and Lean, we empower engineers to discover performance bottlenecks and work together to find solutions.
In our previous post, we described how we sped up page renders by making changes to the HTML. However, that’s just one of many optimizations we’ve made to improve our user experience.
In this post, we’ll talk about how we modified data loading within GraphQL to make our pages snappier and decrease our customers’ wait time.
What is GraphQL
GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. It allows clients to request only the data that they need and leaves the fetching of that data up to the GraphQL layer.
To understand the benefits of GraphQL, it’s helpful to compare it to REST to see the differences. Let’s say I want to see a listing as well as information about the shop that sells that listing. In a standard REST world, I would first have to fetch the listing data:
GET /api/listings/1234{ id: 1234, …… shop_id: 432}
With this data, I would then fetch the shop data:
GET /api/shops/432{ id: 432, name: “Planet of the Moogs” …}
This has several disadvantages:
- The client has to manually keep track of each HTTP request. If the first request to the listing fails, it must know not to request the shop. Managing state with multiple requests can be difficult especially if the client is on a slow or flaky connection like a mobile phone.
- There is an implicit contract between the data returned by the server and the client. Therefore, if the structure or type of the data changes on the server, it could break clients in unexpected ways.
- Some of the data returned in the response might not ever be used by the client.
With GraphQL, the client just asks for the data it wants, and the GraphQL layer decides how to fulfill that data request:
listing(id: 1234) { title shop { name }}
How Reverb Does GraphQL
There are multiple ways to implement GraphQL. At Reverb, we use Apollo both as a client and a server to request and fulfill data requests. The Apollo server layer can see what data is being requested and then makes HTTP API requests to endpoints specially designed to be consumed by GraphQL. In this way, GraphQL works like a middleware for fetching data.
To load data faster, we try to bundle requests as much as possible. Take an example query like this, which may load many listings and shops:
search(keyword: “moog”) { listings { title shop { name } }}
Fulfilling this request and loading this data would do the following:
- Do a search to get back relevant listing ids ( e.g. ids 827, 382, 298 )
- Fetch information about all of those listings at once. ( e.g. GET /api/listings?id[]=827&id[]=382&id[]=298 )
- From those responses, we would then have the shop ids to fetch the shop data. ( e.g. GET /api/shops?id[]=….)
As you can see, this obfuscates the details to the client but we still have to make multiple requests in sequence to get all of the data needed. This looks much like a waterfall as we wait for one request to succeed before we continue with subsequent requests.
Shortening the waterfall
One of the techniques we can use to load data quicker is to shorten the waterfall. Take the following GraphQL request:
search(keyword: “moog”) { listings { … listing attributes … categories { id name } }}
Originally, when fetching a product’s categories (e.g. “12-string”, “electric guitars”), GraphQL made a request to the products API to fetch all the matching products, waited for that call to resolve, and then fetched the category IDs off of each product in the response (since we return each product’s category IDs in the products API response) and passed the resulting category IDs to the categories API.
Notice how the categories call is fired off after everything else:
Instead of sending the categories API request synchronously after the products API call, we introduced a new categories API endpoint which supported lookups in a flexible way that allowed us to lookup by product IDs. After doing that, we updated the code in GraphQL to stop waiting for the products API call to finish, allowing us to fire off the products and categories API requests in parallel.
In the flame graph below, you can see the categories API call being run in-parallel with the rest of the calls:
We saw a ~40ms boost in performance across all percentiles (p95, p90, etc.).
Pre-calculating Expensive Data
Since GraphQL is flexible enough to change where it gets its data, one technique we use to decrease query times is to pre-calculate computationally expensive data and then fetch that data rather than calculating it on the fly.
In one instance, we saw that some pages were timing out. After looking into it, we noticed that product review data was slow to resolve. Users rely on data such as the total number of reviews and the review average when making decisions on what to buy. However, it can get expensive to calculate and show this information on every page load.
This trace is for an endpoint which is used to fetch product reviews:
We use this endpoint in search to fetch product reviews for ~15 products, and display their average rating and total reviews count. We noticed that the SQL queries (in green) were performant most of the time, but noticed that a lot of time was being spent in Rails’ ORM instantiating records. After some investigation, we found out that we were loading each product’s reviews into memory and doing a lot of in-memory calculations (for the average rating and count, amongst other things). In the best-case scenario, this wasn’t too bad. But for pages with popular products that have instruments with hundreds of reviews, this in-memory approach added significant latency.
Since we were already storing some metadata about products in Elasticsearch, we decided to store this expensive data there too. We then wired up some new fields in GraphQL — “reviewsCount” and “averageReviewRating” — and updated the client code to read from those new fields. The results spoke for themselves: average p95 latency dropped by almost 600ms, and our error rate also dropped considerably.
Conclusion
Performance improvements can happen in any part of the stack and it’s important to look at the full trace of a request to see where bottlenecks lie. By changing how we load data in GraphQL, we were able to make pages load faster and improve the shopping experience for our buyers.
Are you looking for a new challenge and intrigued by the way we work at Reverb? We’re hiring and It’s an incredible time to join our fast-growing musical instruments marketplace! Visit Reverb Careers to see open positions and learn more about our team, values, and benefits.