I have a table of about 700k rows in Postgres 9.3.3, which has the following structure:

Columns:
 content_body  - text                        
 publish_date  - timestamp without time zone 
 published     - boolean

Indexes:
    "articles_pkey" PRIMARY KEY, btree (id)
    "article_text_gin" gin (article_text)
    "articles_publish_date_id_index" btree (publish_date DESC NULLS LAST, id DESC)

The query that I am making has a full text search query and a limit, as follows:

When I search for a string which is in my index with a limit and order in the query it is fast:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id limit 10;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1293.88 rows=10 width=1298) (actual time=2.073..9.837 rows=10 loops=1)
   ->  Index Scan using articles_pkey on articles  (cost=0.42..462150.49 rows=3573 width=1298) (actual time=2.055..9.711 rows=10 loops=1)
         Filter: (article_text @@ '''in_index'''::tsquery)
         Rows Removed by Filter: 611
 Total runtime: 9.952 ms

However if the string is not in the index it takes much longer:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1293.88 rows=10 width=1298) (actual time=5633.684..5633.684 rows=0 loops=1)
   ->  Index Scan using articles_pkey on articles  (cost=0.42..462150.49 rows=3573 width=1298) (actual time=5633.672..5633.672 rows=0 loops=1)
         Filter: (article_text @@ '''not_in_index'''::tsquery)
         Rows Removed by Filter: 796146
 Total runtime: 5633.745 ms

However if I remove the order clause it is fast for either case:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index')  limit 10;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=55.69..90.22 rows=10 width=1298) (actual time=7.748..7.853 rows=10 loops=1)
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=7.735..7.781 rows=10 loops=1)
         Recheck Cond: (article_text @@ '''in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=5.977..5.977 rows=8910 loops=1)
               Index Cond: (article_text @@ '''in_index'''::tsquery)
 Total runtime: 7.952 ms

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index')  limit 10;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=55.69..90.22 rows=10 width=1298) (actual time=0.083..0.083 rows=0 loops=1)
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.065..0.065 rows=0 loops=1)
         Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=0.047..0.047 rows=0 loops=1)
               Index Cond: (article_text @@ '''not_in_index'''::tsquery)
 Total runtime: 0.163 ms

Removing the limit clause has the same effect, although the in index query is noticably slower:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=106.347..140.481 rows=8910 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 12288kB
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=5.618..50.329 rows=8910 loops=1)
         Recheck Cond: (article_text @@ '''in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=4.243..4.243 rows=8910 loops=1)
               Index Cond: (article_text @@ '''in_index'''::tsquery)
 Total runtime: 170.987 ms

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=0.067..0.067 rows=0 loops=1)
   Sort Key: id
   Sort Method: quicksort  Memory: 25kB
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.044..0.044 rows=0 loops=1)
         Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=0.026..0.026 rows=0 loops=1)
               Index Cond: (article_text @@ '''not_in_index'''::tsquery)
 Total runtime: 0.148 ms

The little I can deduce is that overall, a bitmap index scan+bitmap heap scan is overall better for my queries then an index scan. How can I tell the query planner to do that though?

asked Mar 09 '14 at 12:40

Mohan%20Krishnan's gravatar image

Mohan Krishnan
31113


One Answer:

Accuracy of statistics

First of all, I would start off looking at the statistics. It seems that they're not accurate and, perhaps, your indexes are missing them completely. Note, that in all the plans posted for all indexes estimated number of rows is rows=3573. And query without LIMIT clause have returned 2.5 more rows, then it was estimated.

Please, run VACUUM ANALYZE articles; and try your queries again.

Still, it may happen that stats will be off for the article_text_gin index related plans (not something I would expect for a 700k table, though). They're influenced by the article_text column in fact, so I would recommend increasing this column's statistics target (from default is 100):

ALTER TABLE articles ALTER article_text SET STATISTICS 500;

Don't set too high value, as it affects planning time.

Available memory

This line

Sort Method: external merge  Disk: 12288kB

in the first of the plans without LIMIT clause indicates, that your work_mem is low. Unfortunately, it is not one-to-one match between disk-based sorts and in-memory ones, later requiring more memory. I do not know the exact formula here, but try setting

SET work_mem TO '20MB';

and try your queries again.

Why “wrong” index is used?

Let's look at the initial plans. Planner expects, that within the first 3573 rows returned by any index, it'll find all 10 required results. Given that we're retrieving rows in the right order straight away (articles_pkey is used, which is influenced by the ORDER BY id clause), planner estimates this to take small time. This has bad side effects in cases when filtering conditions is always false (non existing values) or when data in the columns is highly correlated and one finds desired values after scanning the biggest part of the index.

This happens due to the fact, that planner always assumes uniform value distribution and, as stated above, in the case of inaccurate stats this leads to heavily underestimated costs. In situations when LIMIT is used parser is considering the starting cost of the node. Compare starting costs in

-> Index Scan using articles_pkey on articles  (cost=0.42..462150.49 rows=3573 width=1298)
-> Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=7.735..7.781 rows=10 loops=1)

Total cost is great for BitmapHeapScan, yet starting one is way better for IndexScan. So planner decides, that if articles_pkey will be used, it'll filter out all the wanted 10 rows very quickly, as this access path starts producing rows almost instantly.

In all other situations BitmapIndexScan is a winner, thus it is used. Please, note, that this type of access is used in all cases, while I would expect IndexScan on articles_text_gin to be used for queries with LIMIT clause. This is another indicator, that your stats are not accurate — planner prefers BitmapIndexScans in cases when index is there but it lacks stats.

TL;DR

Actually, I have a feeling, that initially table had been created, populated and Primary Key had been introduced, then it was analyzed and later GIN index had been created, without refreshing table stats.

So my bets are — refreshing stats should help. If not, then you can always force planner to use the right indexes via WITH construct, as it is known to be an optimization barrier:

WITH src AS (
    SELECT * FROM "articles"
     WHERE article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index')
)
SELECT * FROM src
ORDER BY id LIMIT 10;

Please, give it a try.

answered Sep 25 '14 at 22:26

Victor's gravatar image

Victor
21114