Hi,

I need to optimize some queries (postgresql used). One of them looks like the following query:

SELECT UPPER(colors.name) as filter_name, colors.code as filter_color
  FROM "item_features"
 INNER JOIN "colors" ON "colors"."id" = "item_features"."color_id"
WHERE (feature like '%arbe:%') AND (item_id (IN 566,3254,.......))
GROUP BY color_id, colors.name, colors.code, colors.sorting
ORDER BY colors.sorting asc

The query needs around ~320ms at the moment.

The result of explain analyze is:

explain analyze

If the above picture is too small:

Sort  (cost=38238.18..38238.29 rows=41 width=1040) (actual time=303.132..303.135 rows=17 loops=1)"
  Sort Key: colors.sorting"
  Sort Method: quicksort  Memory: 26kB"
  ->  HashAggregate  (cost=38236.57..38237.09 rows=41 width=1040) (actual time=303.092..303.107 rows=17 loops=1)"
      ->  Hash Join  (cost=1.38..38217.62 rows=1895 width=1040) (actual time=0.164..300.620 rows=1972 loops=1)"
          Hash Cond: (item_features.color_id = colors.id)"
          ->  Seq Scan on item_features  (cost=0.00..38189.77 rows=2006 width=4) (actual time=0.124..298.552 rows=1973 loops=1)"
                Filter: (((feature)::text ~~ '%arbe:%'::text) AND (item_id = ANY ('{5666,3254,5129,4072,5848,5385,5506,3723,5087,5472,3479,5511,5635,4779,3754,5445,4901,5507,4569,4910,4122,5704,4382,5841,4054,3584,5849,2692,305,4584,5694,4582,5864,4258 (...)"
                Rows Removed by Filter: 13162"
          ->  Hash  (cost=1.17..1.17 rows=17 width=1040) (actual time=0.025..0.025 rows=17 loops=1)"
                Buckets: 1024  Batches: 1  Memory Usage: 2kB"
                ->  Seq Scan on colors  (cost=0.00..1.17 rows=17 width=1040) (actual time=0.003..0.011 rows=17 loops=1)"
Total runtime: 303.282 ms"

The numbers resp. the IDs in the IN(...) part are arbitrary. They are generated during run time of the application.

I tried to optimize the query by using a temporary table (called tmp_ItemIds). In the temporary table I saved the IDs and changed the original query to:

SELECT ... FROM ... INNER JOIN ... ON ...
 WHERE ... AND (item_id IN SELECT id FROM tmp_ItemIds)
 GROUP BY ...
 ORDER BY ... asc

The query needs around ~50ms now. I'm not sure if this is the best solution. Can tell me somebody how I can optimize it in a better way and if using a temporary table is a appropriate solution?

The tables item_features and colors have the following structure. As you can see two indexes are defined up to now:

table structure

If I understand one of your articel right, it would theoretical help to change LIKE '%arbe%' to LIKE 'arbe%'?

Thank's, Chris

asked May 21 '14 at 20:59

kid34's gravatar image

kid34
1112

edited May 27 '14 at 12:02

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120


One Answer:

If I understand one of your articel right, it would theoretical help to change LIKE '%arbe%' to LIKE 'arbe%'?

Well, it would help performance, but also change the result. This statement was meant like an application design advice, not a tuning advice ;)

Looking at the execution plan you posted, it is clearly visible where the time is lost:

Seq Scan on item_features  (cost=0.00..38189.77 rows=2006 width=4) (actual time=0.124..298.552 rows=1973 loops=1)"

So there is a full table scan on item_feature it takes 298.552ms which is almost everything of the total execution time of 303.282ms. So, let's try to fix that.

There are two ways to approach this:

  1. Try to use an ordinary btree index. Since the LIKE filter on cannot use a btree index (leading wildcard) it just leaves the other option (item_id first):

    CREATE INDEX ... ON item_features (item_id, text varchar_pattern_ops);
    

    Additionally adding the colum color_id enables an index-only scan.

  2. Use the LIKE expression as primary access path with a special purpose filter. If you are running on PostgreSQL 9.1, trigram indexes is for you :

    CREATE INDEX ON item_features USING gin (text gin_trgm_ops);
    

    Additionally, it might make sense to add a btree index on item_id (heading for a Bitmap). But that really depends on your data.

So, there are two different approaches, each has might further benefit from an "on-top" optmization (index-only; second index). Your milage will vary based on your data. Try these ideas and let us know how it worked out.

ps.: I think there are better options than temp tables, but let us know if the above suggestions don't work for you.

answered May 27 '14 at 12:17

Markus%20Winand's gravatar image

Markus Winand ♦♦
93651120