I am working on learning my SQL and optimization skills, and I am currently using the recently released Snapchat phone number list as my test data. This data consists of 2 tables: A records table containing 4.6 million phone number and username pairs, and a areacodes table containing a 3-digit area code from the phone number and a location name for it. I'm using postgresql for it, as that's the database I am currently using the most.
I am trying to produce, and optimize the production of, a table of all of the area codes listing how many phone numbers are in each one. My first try looked like this, using a left join to account for my guess that there might be area codes in the records table not in the area codes table:
This worked, but it took about 27 seconds to run. Looking at the execution plan:
I get the not terribly surprising result that most of the time seems to be spent in sorting the entire records table based on the result of the substring function. I start reading this site some more, and thinking of ways I can do better. First, I create an index on the phone column in records - it's a string column, to handle the fact that the numbers all have the last 2 digits obfuscated as XX. Then I start writing some smaller queries to test the speed. I first try to get the count for a single area code using the following:
With the execution plan:
Takes a full second to run, scans the table, and doesn't use the index at all. Not too surprising. I decide to try a range instead of a function to actually use the index:
Now I get an index only scan, taking 22ms. That's more like it! I then rewrite the full query in this style:
And get the execution plan:
Hmm, 8.5 seconds, not what I was expecting. I figured it could do this query in roughly 22ms*76 rows = 1.6s. It looks like the trouble is that Recheck Cond and Bitmap Heap scan. Seems that, after it runs the indexed query to get the count of phone numbers in that area code, it then rescans the records table to get the full row for those records. I'm not sure why it's doing that, as I'm not using any information about those rows, and the way the query is set up, I don't think I could easily put in any information about them if I wanted to. I've tried changing things around in several ways, but I can't get it to stop doing that time-consuming bitmap heap scan. Does anybody have any idea how I could change my query to prevent that?
asked Jan 04 at 20:13
Let's start with getting the same results as you did. Instead of snapchat data, I've just generated test data using
Now, lets look at your questions:
Why does PG do a Bitmap Heap Scan (not Index Only Scan?)
First of all, you seem to expect to get the
The Bitmap Index+Heap Scan operations are an optimization of the regular Index Scan: Instead of accessing the Heap right after fetching a row from the index, the Bitmap Index Scan completes the index lookup first, keeping track of all rows that might be interesting (in a, you guess it, bitmap). The Bitmap Heap Scan than accesses all the interesting heap pages sequentially. For that, it first sorts the bitmap. The aim is to reduce the random IO by using a little memory and CPU for the bitmap. The Bitmap Index+Heap Scan's make only sense if you access many rows. Further the Bitmap itself makes only sense if you intend to fetch data from the heap. Consequently, the Bitmap operations are not the right tool to implement index-only scans. Only the Index Only Scan operation actually implements index-only scans.
So, why doesn't it do an Index Only Scan?
Simple: Because you didn't
Right after that, the execution plan looks like this:
The reason this works is two-fold: * Analyze - Analyzing the table after filling it provides correct statistics to the optimizer to make it's estimates. * Vacuum - Here it comes: PostgreSQL's Index-Only Scan only works on vacuumed tables. If you don't vacuum, even the Index Only Scan operation must visit the heap to check if the row is visible. This is just a limitation how PostgreSQL implements Indexes (without visibility information) and how they made Index Only Scans working nevertheless—if vacuumed ;)
Now let's continue with questions you didn't ask:
But what about the original query?
Considering that you have an index on
But that means nothing more than we need to create an index that has the order the group by needs — an expression index:
Now, the index order is in line with the order required by the sort operation and the execution plan looks like this:
Hmm, let's analyze the table again, just because that should be done when adding expression indexes. Now the execution plan looks like this:
Although the new index isn't used, the sort operation has gone. Instead it uses a HashAggregation (was previously GroupAggregate).
Here is the fun part: the pure presence of the index (and new stats!) has influenced the execution plan and execution speed, even though the index wasn't used. But as a matter of fact, the statistics about the index were used during optimization.
Note the following difference between the execution plans:
Without the index, and it's statistics, the optimize estimated that the group by operation will yield 4.5 million rows (essentially not grouping anything). However, as
Wanna see the statistic that makes all the difference?
Note how the result of this query doesn't change by just creating the index, but by analyze after creating the index.
This problems might be more nicely solvable (just adding stats without the useless index) by a feature like Cross-Column Statistics. However, this feature doesn't exist yet, so it is hard to predict if it would be able to solve this problem too.
But why the heck does it still not use the index?
Here comes a very unfortunate answer: PostgreSQL, as of 9.3, can't use expression indexes for index only scans (ref).
That means, the query planner can vary only these options * SeqScan vs. Index Only Scan on the full phone index * HashAggragate vs. Sort+GroupAggregate.
Obviously, HashAggregate is the better option. Further, a SeqScan does sequential IO while a Index Only Scan does scattered IO - hence the SeqScan wins (especially because it cannot benefit from the index order!).
But how fast would it be, if the expression index could be used with GroupAggregate?
To answer the question how fast it could possibly get, I've added the areacode to the records table itself. Now we can create a straight index on areacode and try to use it with GroupAggregate without sort + Index Only Scan.
And it still does a SeqScan:
Obviously for the same reason as before (sequential IO vs. random IO). So let's "disable" SeqScan just to see the IOS:
So, we have finally got what we expected: IOS + GroupAggregate (not needing any memory, because of the pre-sorted data = pipelined execution). The execution time is the best we've seen so far.
Note that disabling the SeqScan is just done for demo purposes. In Production you would probably need to fiddle with the costing parameters.
Please also not that a part of this speedup is also because the substring expression is not evaluated during execution anymore. However, this is the performance you could get if PG would be able to use an expression index for IOS.
Although PG can currently not use an expression index for IOS, the statistics for the expression index are still useful for the optimizer to chose the HashAggregate over Sort+GroupAggregate. Once taken tha HashAgg performance is way better (7s). The proper IOS on an expression index would cut this execution time down to the half again.
Your alternative query using the sub-select in the select list might still be an option but doesn't consider
Creating the test data
Execution plan with original query
Execution plan with sub-query