At JustWatch, we’re collecting and aggregating millions of different streaming offers from all kinds of Video on Demand providers (such as Netflix or Amazon Instant Video) on our platform. After being collected by our custom web scraper built in Go, they get ingested into our Content API in order to be matched, combined, cleaned and finally being stored into our PostgreSQL and ElasticSearch backends.
Recently, we found ourselves upgrading the PostgreSQL database hardware behind our Content API pretty often. On one hand, some of this was expected - as within just one and a half years, we went from indexing just a handful of streaming providers in three countries to indexing 70 streaming providers in 17 countries on a daily base. Meanwhile, on the traffic side we were growing from zero to several million active users per month.
On the other hand, something still seemed off - and just crunching the numbers on top of my head, our ingest rate was getting slightly slower every day than it actually should, even when accounting for additional catalogue size and traffic growth. Our database load used to be rather write heavy, but the reads increased in a disproportionate amount every day, and we kept throwing money at the problem without really knowing why.
But then one day, I went to actually find out - and this story is meant to guide you in a similar effort.
First check: Black box metrics
A good first step in diagnosing performance problems is simply looking at the database from the outside. Fortunately, we’re using Amazon RDS for PostgreSQL, which in this case brought us a lot of necessary metrics right out of the box.
Actually, this didn’t look too far off. Yes, the CPU is quite high but I knew we were rather CPU bound and did do so on purpose. Lots of RAM still freeable, connections look stable, no sudden spiky behavior anywhere. If there would have been simple resource starvations or problems with too frequent AUTOVACUUMs, we would have seen more pronounced changes or flatlines at this stage already.
Second check: Peeking inside
If we can’t see anything from the outside, let’s have a look inside. For a quick first check on any databases running wild, I’m usually using the super helpful PGHero tool by the Instacart people:
Having a look on the dashboard after connecting reveals nothing that’s particularly off. Index size and usage looks good, just some yellow tabs that aren’t that important to us right now: Cache hit rate is “low” at 96% which is still pretty okay, also there are some long running maintenance and VACUUM queries lingering around. Nothing to see here, move on.
Third check: Index size
Now one of the caveats to watch out for when working with RDBMS is the point where heavily used indices don’t fit into RAM anymore. As a rule of thumb, if you see this happening, it looks like your queries per second graph is hitting a brick wall and everything will get worse pretty suddenly. With SSDs, this will be a tad less pronounced but still very visible. For this reason, it was less likely to be the root problem here, but I wanted to check it anyway. As this isn’t one of the metrics PG will effortlessly tell you, we need a trick here.
One of the more awesome things about relational databases like MySQL or PostgreSQL is that they heavily dogfood in a pretty cool way. Which means that a lot of their own internal runtime metadata is stored in relational tables and in that way is available to query by SQL - just like you already know how to do. So let’s use that!
-- adapted from https://wiki.postgresql.org/wiki/Disk_Usage -- ...you'll obviously need superuser permissions for this WITH entries AS ( SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM ( SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME , c.reltuples AS row_estimate , pg_total_relation_size(c.oid) AS total_bytes , pg_indexes_size(c.oid) AS index_bytes , pg_total_relation_size(reltoastrelid) AS toast_bytes FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace WHERE relkind = 'r' AND nspname = 'contentdb' -- insert your DB name here ) a ) SELECT SUM(total_bytes), pg_size_pretty(SUM(index_bytes)) FROM entries ;
…which yields 13 GiB, quite close to RAM size but not enough to explain the increased reads. Not it.
(Note that you can look up some of these metrics via PGHero too, but that would be just half the fun.)
Fourth check: Slow queries
Slow queries usually stem from queries not using indices for whatever reason. If your use case is mainly OLTP - which means rather short and transactional queries on individual entities, compared to a reporting setup where long queries are expected - slow queries are some of the things you should consider logging. Be sure to take a reasonable minimum time (100ms is probably a good first start), as logging also costs significant performance and disk space.
We already had them activated, but scanning the slow query logs, I didn’t see any queries that regularly came up. This was getting harder.
At this point, I knew we hadn’t done any obvious mistakes and the server was just “busy” doing things.
But busy with what exactly? When a server is running millions of queries per day, there are some that might just not be as optimized as others or have minor different problems. These queries might not be “slow” per se, but they might still be a magnitude slower than others - and this can add up pretty quickly.
Usually, this problem affects a similar type of query, but the parameters might be different, making them hard to group. Logging all of them away will produce large amounts of logs and impact performance - and you need to set it up first. Tools like PGAnalyze can be great helpers here, but they’re expensive, also need some setup and have their own security and performance implications.
I needed something quick and simple. I just had a PostgreSQL client and I was ready to use it.
So what do you do if you don’t know what the database is currently busy with?
Let’s remember the pg_catalog with the runtime metadata and go for a different table this time:
SELECT * FROM pg_stat_activity
Now what we have here is a list of all activities that Postgres is currently happily executing for us - and the ones idling around or waiting for locks as well. Nice!
What we don’t know yet is how much every activity is taxing the database. But we have one proxy that’s good enough for us: Query time!
SELECT (CURRENT_TIMESTAMP - query_start) as query_time , * -- ^^ SQL protip: separating SELECT parameters by linebreaks and putting the -- comma first lets you quickly comment out single parameters FROM pg_stat_activity ORDER BY query_time DESC
Ordering the activities by query time gives us the equivalent of a “poor man’s dbtop”. On the very top you might have some long-running statements like autovacuum operations, idling connections or long-running transactions - but directly below, you should see the offenders already.
What did I say? Bingo!
Well - almost. Copying out the query column from Postico didn’t go all the way just yet, as queries in pg_stat_activity are truncated to 1kb. However, I now had enough context to identify which query was running wild and I quickly found it in our code repository. Here it is in its full beauty:
SELECT a.old_retail_price as last_change_retail_price , a.retail_price - a.old_retail_price as last_change_difference , CASE WHEN a.old_retail_price = 0::money THEN 999::real ELSE -(1-(a.retail_price/a.old_retail_price))*100::real END as last_change_percent , to_char(a.created_at, 'YYYY-MM-DD') as last_change_date , to_char(a.created_at, 'YYYY-MM-DD') || '_' || a.provider_id as last_change_date_provider_id FROM ( SELECT ao.old_retail_price, ao.new_retail_price, o.retail_price, ao.created_at, ao.provider_id FROM movie_reference ref JOIN audit_offers ao ON ($1 = ao.object_type AND ref.id = ao.reference_id) JOIN offer o ON ($1 = o.object_type AND ref.id = o.reference_id) WHERE ref.title_id = $2 AND ref.provider_id = $3 AND ref.id = ao.reference_id AND ref.provider_id = ao.provider_id AND ref.id = o.reference_id AND ref.provider_id = o.provider_id AND ao.country_id = $4 AND ao.country_id = o.country_id AND ao.monetization_type = $5 AND ao.monetization_type = o.monetization_type AND ao.presentation_type = $6 AND ao.presentation_type = o.presentation_type AND ao.created_at > NOW() - INTERVAL '14 DAY' LIMIT 1 ) a WHERE a.old_retail_price IS NOT NULL AND a.new_retail_price IS NOT NULL AND a.retail_price = a.new_retail_price
Behind the scenes
By now, you’ll probably need a little bit of context on what’s going on here: We’re using PostgreSQL as our main data store for streaming content, caring about correctness, normalization, constraints and consistency. To make the site fast (and enabling multi-language full-text search), we’re constantly materializing all our entities and sub-entities into large, nested ElasticSearch title documents that contain a denormalized view on our content catalogue.
Also, in order to always able to track what happened, we’re never writing metadata into a movie entity directly for example - but we’re writing them into a movie reference. A reference to us is the representation of metadata as seen from a certain provider (i.e. iTunes or Hulu). When these are matched to a movie, we are composing all the different knowledge pieces to a coherent JustWatch title and materializing the result back into ElasticSearch for retrieval and display.
As we have multiple types of titles we are dealing with (movies, shows, seasons, …), we opted for modelling them as separate tables, joined by a composite
(object_type, object_id) key. There is no standard way of modelling object inheritance into SQL, but after comparing benefits and drawbacks, we decided for a multi-table inheritance model in the end.
What happens in this query now is a part of this materialization process that summarizes the available streaming offers for a title. Furthermore, as we want our users to filter for both absolute and relative price drops, we are tracking our own history of price changes in the audit_offers table and use this data to materialize the differences between the historic and the current price into the final title document as well.
Continuing the chase
So this query was quite a lot slower than the other queries we were executing.
Unfortunately, it wasn’t anything obvious. We already had nicely fitting indices on all the relevant
JOINed columns here (composite index on
PRIMARY index on
(id)). Any completely missing indices for this query we’d probably have found in the previous steps anyway.
So I did an
EXPLAIN ANALYZE and looked for clues in the query plan - table scans are nearly always bad, index scans come in a close second - but still no good lead. By the way: If you’re not familiar deciphering execution plans in text form, use this amazing visualizer tool to help you out, especially in the beginning. “Explaining EXPLAIN” will get you to the nitty gritty details, should really want to go down this rabbit hole later on.
At first, I still was lost - but when I slowly scrolled through the execution plan a second time, I got pretty suspicious about the cost estimate for joining the
audit_offers table. Something wasn’t right about that - and then it suddenly hit me. The
audit_offers table might be the actual problem here, as it’s an append only table: That means it we were throwing all price changes in there, but never ever cleaning it up. Now what happens if we have lots of price changes for that one movie?
Let’s find out:
-- get amount of historic offers in worst case scenario SELECT object_type, object_id, count(*) FROM audit_offers GROUP BY object_type, reference_id ORDER BY count(*) DESC LIMIT 1 -- result: 1204
Ouch. 1200 row lookups for materializing one single title.
We’re getting onto something here. For each title, we have lots of different providers and offer types - and there are apparently some providers that change their movie prices almost every day… No wonder this one query does lots of random I/O. But while the database is looking back to the beginning of time, the query makes clear we’re obviously only interested in price changes from the last 14 days here. So what can we do?
Towards a solution
Now the most common solution to less row lookups would be using an index. We could create one on
created_at and call it a day. But we’d probably only see a very small speedup. Why is that?
As we’re not interested in “the last 14 days of price changes overall” here, but only in “the last 14 days of price changes of one single reference”, what the database would need to do with two separate indices is called a
bitmap scan. It would first look up
everything that ever happened to the offers of this reference from the index, then fetch
everything that happened within the last 14 days, then doing a bitmap scan (which is a fancy word for calculating the intersection between the slices) and only then look up the actual rows that lighted up in both result sets.
If this sounds complex and taxing, it actually is - and for this reason it’s an optimization that often times so small the database won’t even attempt it.
Fortunately, in this case, there’s a much better way. Remember that we have an index of
(object_type, object_id) already. Now a nice property of btree indices (which is the standard case) is that all the values inside the index are sorted.
This means that on the last column of a composite index you can’t just make exact lookups, but efficient range lookups too - but only if you always use all columns at once. This comes in super helpful in this case - all we need to do is add
created_at at the end of the already existing index.
For a visual clue on how and why this works, remember how this index is organized:
object_type, object_id, created_at movie, 14, 2015-05-18_12:34:56 movie, 15, 2015-05-15_12:34:56 movie, 15, 2015-05-16_12:34:56 movie, 15, 2015-05-17_12:34:56 <-- querying for (movie, 15, '> 2015-05-17') movie, 15, 2015-05-18_12:34:56 <-- will need to fetch only these two rows movie, 17, 2015-05-15_12:34:56
Getting into prod
After getting to the root of the problem, all that’s left is fixing it. Let’s write a migration in order to get this out into production. I wrote a migration (you’re using migrations, right?) in order to fix the issue:
SET SCHEMA 'contentdb'; -- Extend index to help deal with an ever growing audit_offers table -- that presents an O(time) problem. Used in often executed -- query 'history.lazy.sql' CREATE INDEX CONCURRENTLY audit_offers_created_idx ON audit_offers (object_type, reference_id, created_at); DROP INDEX audit_offers_idx;
As we’re caring for high availability, I opted for creating this index concurrently. Compared to MySQL, Postgres fortunately fully supports this feature out of the box - which is worth its weight in gold if you’re trying to stay up and not lock any tables. (Braintree did a nice overview of how to do HA operations on a Postgres backend).
Unfortunately, there’s a single drawback to this option: It absolutely cannot run within a transaction.
This would not have been a problem if our (otherwise totally awesome) migration tool Flyway would not enforce transactions around every migration - which also usually is a pretty sensible feature.
In this case, it meant we had to execute the migrations live on the DB, letting our migrations diverge from the actual database schema. Fortunately, the Flyway team let us co-sponsor support for non-transactional migrations in their upcoming 4.1 release and we’re already looking very much forward to trying them out.
After waiting for almost an hour in order for the index building to complete (doing this
CONCURRENTly will run in a kind of background thread, taking significantly longer), I was curious for the effects.
And voilà, our read IOPS more than halved!
That was it, lots more room to scale and one less headache for our SRE team - we hoped you learned some useful tips and tricks along the way about the best general purpose open source RDMBS of our time.
I’m Dominik Raute, CTO and part-time Database Whisperer at JustWatch, signing off from our first post. If you like to learn more about databases, we’re currently hiring Data Engineers, SREs and Business Intelligence Specialists.
Stay tuned for more - we’ve only just begun.