Making postgresql query 1000 times faster
Abstract
There’s a handy feature in PostgreSQL called row constructor comparisons that allows me to compare tuples of columns. Instead of doing
CreateAt > ?1 OR (CreateAt = ?1 AND Id > ?2)
, we can do(CreateAt, Id) > (?1, ?2)
.
Mattermost uses Elasticsearch in large deployments to reduce the stress the database suffers when running search queries while returning even better, fine-tuned results. For this to work, Elasticsearch needs to index all data we want to search for, so that it can retrieve it quickly when requested. Once the data is indexed, all works as expected, our users are happy, our developers are happy, and life is good.
However, I recently tested something I haven’t tried in a while: indexing a fairly large database (with 100 million posts) completely from scratch. When the database is already indexed, subsequent indexes of new posts and files are quite fast, so the normal usage of Elasticsearch is flawless, but an index from scratch is slow:
This screenshot is our job system informing us that the Elasticsearch indexing job has been running for around 18 hours, and hasn’t even finished half of what it needs to do 🙁 And the progress was not linear, slowing down more and more the further it progressed! Something was clearly wrong here.
The culprit
Let’s start the investigation by identifying what exactly is slow here, since there are many moving parts: it could be the database, the Mattermost server, the Elasticsearch server, the network, or an under-resourced machine.
Looking at our Mattermost Performance Monitoring Grafana dashboard while the indexing job was running, the problem was crystal clear at first sight:
The graph above shows the top 10 database calls by duration, which comes down (simplifying it a bit) to the following Prometheus query:
topk(10,
sum(increase(mattermost_db_store_time_sum[5m])) by (method)
/
sum(increase(mattermost_db_store_time_count[5m])) by (method)
)
We measure the time it takes for each database method to finish, averaged over the last 5 minutes, and we plot that by second, showing only the top 10 methods.
And looking at the graph, there’s a clear outlier there: PostStore.GetPostsBatchForIndexing
, which is taking more and more time as the indexing job progresses, eventually hitting the 30 seconds mark, when it simply times out. Looking at the code, we see the exact query that’s causing all these problems:
This query does a simple job: it orders all posts by their creation timestamp and returns the first N (a limit provided by the caller) posts that are newer than a provided timestamp. The query needs to check the ID of the post and cover the rare case of two posts happening simultaneously. It also joins with the Channels table, but only to return the post’s TeamId along with its other data.
The indexing job is then built on top of this query, running it over and over again until all posts are indexed. In pseudo-code, it looks something like this:
The first investigation
Now that we understand what we’re dealing with and what piece of the whole system is misbehaving, we can start the real investigation.
Trying to optimize a SQL query never follows a predefined plan, but one can apply some methodologies that can help find the problem and possibly the solution faster and more consistently. I didn’t do any of that, though, and my investigation was chaotic and instinct-driven at first. Never do this at home.
My first concern with this query was: why on earth does it take longer and longer each time it gets executed? The answer was actually pretty simple, but I had no idea at the beginning. So the first thing I looked at was the data I already had at hand: the AWS Performance insights of the query:
This gave me the first clue: the query spends nearly 100% of its running time on CPU. This means the query is not bottlenecked by the time it takes to read or write data, but by the time it takes to process that data. Interesting.
Then I started playing with the query. The first thing I did was, of course, run an EXPLAIN ANALYZE
. Lately, I’ve gotten into the habit of adding the BUFFERS
option to this, so the EXPLAIN
gives a bit more data. There’s a fantastic blog post on the BUFFERS
option, so I won’t go into much detail here, but by copying the definition from pgMustard, which is short and to the point, we can understand what this does:
BUFFERS extends EXPLAIN by adding values to describe the data read/written by each operation.
We’re not only getting info on the query plan, but also on the actual data we move around. Neat.
Now, choosing some data for the arguments the query receives (the timestamp, the post ID and the limit), I ran EXPLAIN (ANALYZE, BUFFERS)
and got the following result:
That’s… a lot to digest. Some key points:
- The second-to-last line, the execution time, says this query took around 24 seconds to finish. This is what we want to fix.
- The query plan shows the nodes. From inner to outer:
- First, a couple of Index scan nodes:
- One on the Posts table, with a Filter applying the WHERE condition, using an index covering the CreateAt field. Why is it not using the other index this table has on (CreateAt, Id), though? We don’t know yet.
- Another on the Channels table, with an Index Cond applying the join condition (the post’s ID equals the channel’s ID), using the the index on the primary key, which is the ID.
- Then, it performs the JOIN itself between the Posts and Channels table.
- Finally, it does an Incremental Sort, using Posts.CreateAt and Posts.Id as the sort keys, noting that we have an already sorted key, Posts.CreateAt, due to the fact that we used the idx_posts_create_at index before.
- First, a couple of Index scan nodes:
- The number of shared hit buffers is forty million on the Index Scan node for the Posts table and on the wrapping Nested Loop Left Join node. That’s a lot.
Point 3 here is important. Although the number of hits inside a nested loop has its nuances, it’s something that gives us another clue: here’s where the CPU is being spent. Even if this doesn’t mean we’re reading 40 million blocks from the cache (we would need a pretty large cache for that), it means we are processing 40 million blocks. We can check how much data this means, actually:
So that query is processing three hundred gigabytes. Not bad.
The first solution
The first thing I tried to solve this was to remove the JOIN
with the Channels
table. Both tables are pretty large, so I was worried this was adding a lot of overhead to the query. That ended up not being the case. The time taken on the query without the JOIN
was virtually the same as with it.
What next, then? The next step was mostly luck. I wanted to keep understanding the query, so I kept simplifying it. If the JOIN was not the problem, the only other complexity was in the WHERE
condition: CreateAt > t OR (CreateAt = t AND Id > id)
.
So I ran EXPLAIN
with the query skipping the second part of the OR
; i.e.:
And I got a very interesting result:
See that execution time? That’s 30 _milli_seconds, or ~1000 times less than the original query! And the shared hit buffers is at 13949, or a more sane 109 MiB. Before I could keep looking at this plan, I immediately tried to run a query with only the second part of the OR
condition; i.e.:
Without thinking too much about it, I honestly expected this to take around 24 seconds again, so that I could focus on optimizing only this part. But here’s what I got:
It took 0.047 milliseconds!!!! That’s… well, ~500000 (five hundred thousand) times quicker than the original query. What????
Ah, wait, it returned no rows, that makes sense. And it didn’t return rows because it covers a quite rare case, since we’re looking for a post happening in the same millisecond as another one and whose ID is larger than the given one. Not impossible, but not frequent either.
But the important question was: where did my original 24 seconds go? Why are these two separate queries taking 30 _milli_seconds in total instead of 30 seconds?
I still had too many questions and not enough answers, but I made a decision: I’ll split the original query into those two, run them one after the other, and collect the results in the Mattermost server code afterwards. It will make the code a bit more complex, but it’ll be worth it. I’ll do that tomorrow morning, I said to myself.
Actually understanding everything
I didn’t do that the next morning either, of course, and I instead went back to understand what was happening. Why were we getting so many shared hit buffers in the original query? Why was it taking longer each time it got executed? Why did the split queries return the exact same answer but 1000 times faster?
Well, I already had all data I needed to understand everything; it’s just that I hadn’t paid the attention it deserved.
Looking back at the query plans of the original query and the one filtering only by CreateAt > ...
, we see a very important difference. Let’s look a bit closer to the Index Scan
node in both cases.
First, the original query:
Now, the one filtering only by CreateAt > ...
:
Did you spot the difference? The first one uses a Filter
, removing 40 million rows in the process, while the second one uses an Index Cond
, just checking the index itself.
And that’s the main difference. Going back to the short and to-the-point descriptions of pgMustard:
Index Cond is the condition used to find the locations of rows from the index. Postgres uses the structured nature of the index to quickly jump to the rows it’s looking for.
Although similar in purpose to “Filter”, the implementation is completely different. In a “Filter”, rows are retrieved and then discarded based on their values. As such, you can find an “Index Cond” and a “Filter” on the same operation.
So that’s the key difference: the original query is going through all the rows in the Posts
table, then discarding or keeping them based on the condition, while the second one just looks at the index and retrieves the rows it needs. One question answered!
But this also explains why the query takes longer and longer each time it gets executed: as it orders the posts from oldest to newest, and each query uses the newest timestamp of the last batch, it needs to discard more and more rows each time. To see this more clearly, let’s say our Posts table has 1000 posts, and we index batches of 100 posts each time. Then the job runs as follows:
- The first batch will return posts 0 to 99, discarding no posts at all, so it will be quite fast. In total, it has processed 100 posts, as one could expect.
- The second batch will return posts 100 to 199, but only after having discarded the first 100 posts. Those first 100 posts were read, though, so in total, it has processed 200 posts.
- The third batch will return posts 200 to 299, discarding the first 200 posts. That’s 300 posts processed in total.
- …
- …
- …
- The tenth and last batch will return posts 900 to 999, discarding the first 900 posts. This last batch processed 1000 posts, that is, the whole table.
My other two questions have been answered as well: the original query had so many shared hit buffers and was taking longer and longer simply because it was eventually going through the one hundred million posts in our database. Poor query.
The beautiful solution
Although it’s true that we now understand everything, we are still left with the original task: split the original query in two, run them independently, and collect the results afterwards in the server.
But there’s an alternative, though! Now that I understand everything, I was able to look for similar cases with the exact incantation needed to summon the gods of StackOverflow. And they delivered. Thank you, StackOverflow. And thank you, Laurenz Albe.
It turns out there’s a handy feature in PostgreSQL called row constructor comparisons that allows me to compare tuples of columns. That’s exactly what we need. Instead of doing CreateAt > ?1 OR (CreateAt = ?1 AND Id > ?2)
, we can do (CreateAt, Id) > (?1, ?2)
. And the row constructor comparisons are lexicographical, meaning that it’s semantically the same as what we had before!
So we just need to convert our original query to the following:
And what’s the result? Well, let’s see:
34 milliseconds, yay! A query that’s 1000 times faster, and that’s even more readable than the original one, a textbook win-win.
If you take a look at the Index Scan
node, you’ll notice that PostgreSQL is now using an Index Cond
, thus checking the index to find those posts, and only afterwards applying a Filter. The shared hit buffers went down to 9955, which is only 78 MiB. Isn’t it beautiful?
(Why the query planner doesn’t automatically convert a condition like x > a OR (x == a AND y > b)
to (x, y) > (a, b)
is something I still don’t understand to this day, though.)
The solution we need
So I was done. But then I remembered that we not only support PostgreSQL, but also MySQL. And that made me shiver.
Of course, the behaviour in MySQL was the complete opposite 🙁
Let’s run an EXPLAIN ANALYZE
(no BUFFERS
in MySQL, sorry) on the query running in MySQL. First, the original one:
0.18 seconds. That’s… actually good. What about the new one?
Ah, good, 50 times slower. Uuuugh.
This outcome had no easy escape: I had to split the code flow in two, using the good-looking query when using PostgreSQL, and the original one when using MySQL. And that’s the change that finally got merged. A bit sad, but it is what it is.
What we learned
This was a fun journey after all, and definitely a good optimization: who doesn’t love making something a thousand times faster? It ended up landing on v9.7.0 and later, and got backported to the 9.5 ESR as well, starting in v9.5.3.
Also, I learned a couple of very interesting things:
- Always use
BUFFERS
when running anEXPLAIN
. It gives some data that may be crucial for the investigation. - Always, always try to get an
Index Cond
(calledIndex range scan
in MySQL) instead of aFilter
. - Always, always, always assume PostgreSQL and MySQL will behave differently. Because they do.