Question

I am working on learning my SQL and optimization skills, and I am currently using the recently released Snapchat phone number list as my test data. This data consists of 2 tables: A records table containing 4.6 million phone number and username pairs, and a areacodes table containing a 3-digit area code from the phone number and a location name for it. I'm using postgresql for it, as that's the database I am currently using the most.

I am trying to produce, and optimize the production of, a table of all of the area codes listing how many phone numbers are in each one. My first try looked like this, using a left join to account for my guess that there might be area codes in the records table not in the area codes table:

 select ac.name, ac_cnt.areacode, ac_cnt.count from
 (select r_ac.areacode as areacode, count(*) as count from 
    (select substring(phone from 0 for 4) as areacode from records) r_ac group by areacode) as ac_cnt 
 left join areacodes as ac on ac_cnt.areacode = ac.areacode order by ac_cnt.count desc;

This worked, but it took about 27 seconds to run. Looking at the execution plan:

 Sort  (cost=948711.44..949736.70 rows=410101 width=56) (actual time=27161.412..27161.419 rows=76 loops=1)
   Sort Key: (count(*))
   Sort Method: quicksort  Memory: 31kB
   ->  Hash Left Join  (cost=835547.82..882443.51 rows=410101 width=56) (actual time=22519.462..27161.286 rows=76 loops=1)
         Hash Cond: (("substring"((records.phone)::text, 0, 4)) = (ac.areacode)::text)
         ->  GroupAggregate  (cost=835545.11..875243.53 rows=410101 width=11) (actual time=22519.404..27161.134 rows=76 loops=1)
               ->  Sort  (cost=835545.11..847069.16 rows=4609621 width=11) (actual time=22519.389..26544.989 rows=4609621 loops=1)
                     Sort Key: ("substring"((records.phone)::text, 0, 4))
                     Sort Method: external merge  Disk: 63080kB
                     ->  Seq Scan on records  (cost=0.00..89013.26 rows=4609621 width=11) (actual time=0.039..1230.948 rows=4609621 loops=1)
         ->  Hash  (cost=1.76..1.76 rows=76 width=20) (actual time=0.034..0.034 rows=76 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 4kB
               ->  Seq Scan on areacodes ac  (cost=0.00..1.76 rows=76 width=20) (actual time=0.007..0.015 rows=76 loops=1)
 Total runtime: 27171.353 ms

I get the not terribly surprising result that most of the time seems to be spent in sorting the entire records table based on the result of the substring function. I start reading this site some more, and thinking of ways I can do better. First, I create an index on the phone column in records - it's a string column, to handle the fact that the numbers all have the last 2 digits obfuscated as XX. Then I start writing some smaller queries to test the speed. I first try to get the count for a single area code using the following:

 select count(*) from records where substring(phone from 0 for 4) = '754';

With the execution plan:

  Aggregate  (cost=100594.93..100594.94 rows=1 width=0) (actual time=1016.151..1016.151 rows=1 loops=1)
   ->  Seq Scan on records  (cost=0.00..100537.32 rows=23048 width=0) (actual time=587.029..1015.407 rows=10131 loops=1)
         Filter: ("substring"((phone)::text, 0, 4) = '754'::text)
         Rows Removed by Filter: 4599490
 Total runtime: 1016.192 ms

Takes a full second to run, scans the table, and doesn't use the index at all. Not too surprising. I decide to try a range instead of a function to actually use the index:

select count(*) from records where phone between '7540000000' and '7549999999';
 Aggregate  (cost=8.45..8.46 rows=1 width=0) (actual time=21.511..21.511 rows=1 loops=1)
   ->  Index Only Scan using records_on_phone on records  (cost=0.43..8.45 rows=1 width=0) (actual time=0.096..20.750 rows=10131 loops=1)
         Index Cond: ((phone > '7540000000'::text) AND (phone < '7549999999'::text))
         Heap Fetches: 10131
 Total runtime: 21.573 ms

Now I get an index only scan, taking 22ms. That's more like it! I then rewrite the full query in this style:

select name, areacode, 
 (select count(*) from records where phone > areacodes.areacode || '0000000' 
  and phone < areacodes.areacode || '9999999') as count 
from areacodes order by count desc;

And get the execution plan:

  Sort  (cost=2392784.21..2392784.40 rows=76 width=20) (actual time=8504.098..8504.107 rows=76 loops=1)
   Sort Key: ((SubPlan 1))
   Sort Method: quicksort  Memory: 31kB
   ->  Seq Scan on areacodes  (cost=0.00..2392781.84 rows=76 width=20) (actual time=9.208..8503.955 rows=76 loops=1)
         SubPlan 1
           ->  Aggregate  (cost=31483.94..31483.95 rows=1 width=0) (actual time=111.890..111.890 rows=1 loops=76)
                 ->  Bitmap Heap Scan on records  (cost=592.68..31426.32 rows=23048 width=0) (actual time=100.770..107.540 rows=60651 loops=76)
                       Recheck Cond: (((phone)::text > ((areacodes.areacode)::text || '0000000'::text)) AND ((phone)::text < ((areacodes.areacode)::text || '9999999'::text)))
                       ->  Bitmap Index Scan on records_on_phone  (cost=0.00..586.92 rows=23048 width=0) (actual time=100.710..100.710 rows=60651 loops=76)
                             Index Cond: (((phone)::text > ((areacodes.areacode)::text || '0000000'::text)) AND ((phone)::text < ((areacodes.areacode)::text || '9999999'::text)))
 Total runtime: 8504.153 ms

Hmm, 8.5 seconds, not what I was expecting. I figured it could do this query in roughly 22ms*76 rows = 1.6s. It looks like the trouble is that Recheck Cond and Bitmap Heap scan. Seems that, after it runs the indexed query to get the count of phone numbers in that area code, it then rescans the records table to get the full row for those records. I'm not sure why it's doing that, as I'm not using any information about those rows, and the way the query is set up, I don't think I could easily put in any information about them if I wanted to. I've tried changing things around in several ways, but I can't get it to stop doing that time-consuming bitmap heap scan. Does anybody have any idea how I could change my query to prevent that?

Answer 1

Let's start with getting the same results as you did. Instead of snapchat data, I've just generated test data using generate_series (see Appendix below). I'm running on 9.3.2 and was able to get the same executionplans as you posted with the scripts mentioned in the Appendix.

Now, lets look at your questions:

Why does PG do a Bitmap Heap Scan (not Index Only Scan?)

First of all, you seem to expect to get the Bitmap Index Scan operation without Bitmap Heap Scan. Although that sounds logical if you'd expect an index only scan, it's not logical if you keep in mind what these Bitmap operations try to accomplish.

The Bitmap Index+Heap Scan operations are an optimization of the regular Index Scan: Instead of accessing the Heap right after fetching a row from the index, the Bitmap Index Scan completes the index lookup first, keeping track of all rows that might be interesting (in a, you guess it, bitmap). The Bitmap Heap Scan than accesses all the interesting heap pages sequentially. For that, it first sorts the bitmap. The aim is to reduce the random IO by using a little memory and CPU for the bitmap. The Bitmap Index+Heap Scan's make only sense if you access many rows. Further the Bitmap itself makes only sense if you intend to fetch data from the heap. Consequently, the Bitmap operations are not the right tool to implement index-only scans. Only the Index Only Scan operation actually implements index-only scans.

So, why doesn't it do an Index Only Scan?

Simple: Because you didn't VACUUM ANALYZE the table:

vacuum analyze records;

Right after that, the execution plan looks like this:

Sort  (cost=360661.70..360662.81 rows=445 width=11) (actual time=3869.051..3869.160 rows=445 loops=1)
 Sort Key: ((SubPlan 1))
 Sort Method: quicksort  Memory: 34kB
->  Seq Scan on areacodes  (cost=0.00..360642.12 rows=445 width=11) (actual time=8.336..3868.565 rows=445 loops=1)
     SubPlan 1
       ->  Aggregate  (cost=810.40..810.41 rows=1 width=0) (actual time=8.690..8.691 rows=1 loops=445)
             ->  Index Only Scan using records_on_phone on records  (cost=0.44..753.30 rows=22843 width=0) (actual time=0.141..7.064 rows=6470 loops=445)
                   Index Cond: ((phone > ((areacodes.areacode)::text || '0000000'::text)) AND (phone < ((areacodes.areacode)::text || '9999999'::text)))
                   Heap Fetches: 0
Total runtime: 3869.311 ms

The reason this works is two-fold: * Analyze - Analyzing the table after filling it provides correct statistics to the optimizer to make it's estimates. * Vacuum - Here it comes: PostgreSQL's Index-Only Scan only works on vacuumed tables. If you don't vacuum, even the Index Only Scan operation must visit the heap to check if the row is visible. This is just a limitation how PostgreSQL implements Indexes (without visibility information) and how they made Index Only Scans working nevertheless—if vacuumed ;)

Now let's continue with questions you didn't ask:

But what about the original query?

Considering that you have an index on phone wouldn't it be possible to use this index to avoid the sort operation? Unfortunately not, because you are not sorting on phone but on substring(phone, 0,4). Although both expressions yield the same order, the optimizer doesn't know this and can therefore not use the index to avoid the sort operation.

But that means nothing more than we need to create an index that has the order the group by needs — an expression index:

create index phone_on_ac on records (substring(phone, 0,4));

Now, the index order is in line with the order required by the sort operation and the execution plan looks like this:

 -- unchanged, still doing sort --

Hmm, let's analyze the table again, just because that should be done when adding expression indexes. Now the execution plan looks like this:

Sort  (cost=111548.20..111550.43 rows=894 width=47) (actual time=6904.681..6904.902 rows=900 loops=1)
  Sort Key: (count(*))
  Sort Method: quicksort  Memory: 52kB
  ->  Hash Left Join  (cost=111471.97..111504.38 rows=894 width=47) (actual time=6902.616..6903.989 rows=900 loops=1)
     Hash Cond: (("substring"((records.phone)::text, 0, 4)) = (ac.areacode)::text)
     ->  HashAggregate  (cost=111458.96..111470.13 rows=894 width=8) (actual time=6902.041..6902.670 rows=900 loops=1)
           ->  Seq Scan on records  (cost=0.00..88615.82 rows=4568626 width=8) (actual time=0.024..3825.362 rows=4568628 loops=1)
     ->  Hash  (cost=7.45..7.45 rows=445 width=11) (actual time=0.548..0.548 rows=445 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 14kB
           ->  Seq Scan on areacodes ac  (cost=0.00..7.45 rows=445 width=11) (actual time=0.012..0.224 rows=445 loops=1)
Total runtime: 6905.132 ms

Although the new index isn't used, the sort operation has gone. Instead it uses a HashAggregation (was previously GroupAggregate).

Here is the fun part: the pure presence of the index (and new stats!) has influenced the execution plan and execution speed, even though the index wasn't used. But as a matter of fact, the statistics about the index were used during optimization.

Note the following difference between the execution plans:

GroupAggregate  (cost=781375.40..872747.92 *rows=4568626* width=8) (actual ... rows=900)
HashAggregate  (cost=111458.96..111470.13 *rows=894* width=8) (actual ... rows=900)

Without the index, and it's statistics, the optimize estimated that the group by operation will yield 4.5 million rows (essentially not grouping anything). However, as explain analyze shows in the actual part, it yields only 900 rows. However, after creating the index (and stats!) the optimizer expects 894 rows from the group by. In this case, the HashAggregate operation is more favorable. HashAggregate doesn't need to sort at all.

Wanna see the statistic that makes all the difference?

select *
 from pg_statistic
 join pg_class on (     pg_statistic.starelid = pg_class.oid
                    and pg_statistic.staattnum=1 -- should come from pg_attribute, lazy me
                  )
 where relname='phone_on_ac';

Note how the result of this query doesn't change by just creating the index, but by analyze after creating the index.

This problems might be more nicely solvable (just adding stats without the useless index) by a feature like Cross-Column Statistics. However, this feature doesn't exist yet, so it is hard to predict if it would be able to solve this problem too.

But why the heck does it still not use the index?

Here comes a very unfortunate answer: PostgreSQL, as of 9.3, can't use expression indexes for index only scans (ref).

That means, the query planner can vary only these options * SeqScan vs. Index Only Scan on the full phone index * HashAggragate vs. Sort+GroupAggregate.

Obviously, HashAggregate is the better option. Further, a SeqScan does sequential IO while a Index Only Scan does scattered IO - hence the SeqScan wins (especially because it cannot benefit from the index order!).

But how fast would it be, if the expression index could be used with GroupAggregate?

To answer the question how fast it could possibly get, I've added the areacode to the records table itself. Now we can create a straight index on areacode and try to use it with GroupAggregate without sort + Index Only Scan.

create index records_ac on records(areacode);

select ac.name, ac_cnt.areacode, ac_cnt.count from
(select areacode as areacode, count(*) as count from  records group by areacode) as ac_cnt
left join areacodes as ac on ac_cnt.areacode = ac.areacode order by ac_cnt.count desc;

And it still does a SeqScan:

 Sort  (cost=100124.88..100127.13 rows=899 width=19) (actual time=4742.925..4743.157 rows=900 loops=1)
 Sort Key: (count(*))
 Sort Method: quicksort  Memory: 52kB
  ->  Hash Left Join  (cost=100050.43..100080.77 rows=899 width=19) (actual time=4740.851..4742.264 rows=900 loops=1)
     Hash Cond: ((records.areacode)::text = (ac.areacode)::text)
     ->  HashAggregate  (cost=100037.42..100046.41 rows=899 width=4) (actual time=4740.279..4740.889 rows=900 loops=1)
           ->  Seq Scan on records  (cost=0.00..77194.28 rows=4568628 width=4) (actual time=0.043..1441.143 rows=4568628 loops=1)
     ->  Hash  (cost=7.45..7.45 rows=445 width=11) (actual time=0.547..0.547 rows=445 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 14kB
           ->  Seq Scan on areacodes ac  (cost=0.00..7.45 rows=445 width=11) (actual time=0.012..0.222 rows=445 loops=1)
 Total runtime: 4743.384 ms

Obviously for the same reason as before (sequential IO vs. random IO). So let's "disable" SeqScan just to see the IOS:

set enable_seqscan = off;
 Sort  (cost=10000131572.45..10000131574.70 rows=899 width=19) (actual time=3508.713..3508.932 rows=900 loops=1)
 Sort Key: (count(*))
 Sort Method: quicksort  Memory: 52kB
 ->  Hash Left Join  (cost=10000000013.45..10000131528.35 rows=899 width=19) (actual time=5.158..3507.918 rows=900 loops=1)
     Hash Cond: ((records.areacode)::text = (ac.areacode)::text)
     ->  GroupAggregate  (cost=0.43..131493.98 rows=899 width=4) (actual time=4.620..3506.353 rows=900 loops=1)
           ->  Index Only Scan using records_ac on records  (cost=0.43..108641.85 rows=4568628 width=4) (actual time=0.050..1841.752 rows=4568628 loops=1)
                 Heap Fetches: 0
     ->  Hash  (cost=10000000007.45..10000000007.45 rows=445 width=11) (actual time=0.531..0.531 rows=445 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 14kB
           ->  Seq Scan on areacodes ac  (cost=10000000000.00..10000000007.45 rows=445 width=11) (actual time=0.012..0.231 rows=445 loops=1)
Total runtime: 3509.155 ms

So, we have finally got what we expected: IOS + GroupAggregate (not needing any memory, because of the pre-sorted data = pipelined execution). The execution time is the best we've seen so far.

Note that disabling the SeqScan is just done for demo purposes. In Production you would probably need to fiddle with the costing parameters.

Please also not that a part of this speedup is also because the substring expression is not evaluated during execution anymore. However, this is the performance you could get if PG would be able to use an expression index for IOS.

Conclusion

Although PG can currently not use an expression index for IOS, the statistics for the expression index are still useful for the optimizer to chose the HashAggregate over Sort+GroupAggregate. Once taken tha HashAgg performance is way better (7s). The proper IOS on an expression index would cut this execution time down to the half again.

Your alternative query using the sub-select in the select list might still be an option but doesn't consider records without areacode mapping. If it is faster, than only because you miss many areacode mappings.

Appendix

Creating the test data

create table records (
    phone varchar(255) not null,
    areacode varchar(255) not null, -- for later demo
    username varchar(255) not null
);

insert into records
select gen, substring(gen::text, 0,4), 'user ' || gen
from generate_series('1234567', '78901234', 17) gen;

create table areacodes (
    areacode varchar(255) not null,
    name    varchar(255) not null
);

insert into areacodes
select gen, 'ac ' || gen
from generate_series(123, 567) gen;

create index records_on_phone on records (phone);

Execution plan with original query

Sort  (cost=1908248.88..1919670.45 rows=4568626 width=47) (actual time=51699.907..51700.126 rows=900 loops=1)
Sort Key: (count(*))
Sort Method: quicksort  Memory: 52kB
->  Hash Left Join  (cost=781388.42..981265.80 rows=4568626 width=47) (actual time=45865.108..51699.041 rows=900 loops=1)
     Hash Cond: (("substring"((records.phone)::text, 0, 4)) = (ac.areacode)::text)
     ->  GroupAggregate  (cost=781375.40..872747.92 rows=4568626 width=8) (actual time=45864.545..51697.405 rows=900 loops=1)
           ->  Sort  (cost=781375.40..792796.97 rows=4568626 width=8) (actual time=45858.621..49677.266 rows=4568628 loops=1)
                 Sort Key: ("substring"((records.phone)::text, 0, 4))
                 Sort Method: external merge  Disk: 62496kB
                 ->  Seq Scan on records  (cost=0.00..88615.82 rows=4568626 width=8) (actual time=0.513..3902.085 rows=4568628 loops=1)
     ->  Hash  (cost=7.45..7.45 rows=445 width=11) (actual time=0.542..0.542 rows=445 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 14kB
           ->  Seq Scan on areacodes ac  (cost=0.00..7.45 rows=445 width=11) (actual time=0.013..0.230 rows=445 loops=1)
Total runtime: 51712.268 ms

Execution plan with sub-query

Sort  (cost=13981009.24..13981010.35 rows=445 width=11) (actual time=4352.651..4352.759 rows=445 loops=1)
 Sort Key: ((SubPlan 1))
 Sort Method: quicksort  Memory: 34kB
 ->  Seq Scan on areacodes  (cost=0.00..13980989.67 rows=445 width=11) (actual time=9.274..4351.934 rows=445 loops=1)
     SubPlan 1
       ->  Aggregate  (cost=31417.93..31417.94 rows=1 width=0) (actual time=9.775..9.776 rows=1 loops=445)
             ->  Bitmap Heap Scan on records  (cost=530.58..31360.82 rows=22843 width=0) (actual time=5.561..8.145 rows=6470 loops=445)
                   Recheck Cond: (((phone)::text > ((areacodes.areacode)::text || '0000000'::text)) AND ((phone)::text < ((areacodes.areacode)::text || '9999999'::text)))
                   ->  Bitmap Index Scan on records_on_phone  (cost=0.00..524.87 rows=22843 width=0) (actual time=5.543..5.543 rows=6470 loops=445)
                         Index Cond: (((phone)::text > ((areacodes.areacode)::text || '0000000'::text)) AND ((phone)::text < ((areacodes.areacode)::text || '9999999'::text)))
Total runtime: 4352.912 ms