Question

I have some troubles understanding what happens with one of my queries.

Here is the table structures. There is a tiny dataset within (less than 5000 entries in each).

CREATE TABLE `tracker_artifact` (
  `id` int(11) NOT NULL auto_increment,
  PRIMARY KEY  (`id`),
);

CREATE TABLE `tracker_changeset` (
  `id` int(11) NOT NULL auto_increment,
  `artifact_id` int(11) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `artifact_id` (`artifact_id`,`id`)
);

Then I have a query that looks like that:

SELECT c.id
FROM tracker_changeset AS c
   INNER JOIN tracker_artifact AS a ON (a.id = c.artifact_id)
WHERE a.id IN (928,933,934,...) 
AND c.id IN (1398,1432,...);

But the 2 ranges are quite big (~1900 values in each) and the query is taking a lot of time (8s) but I don't undestand why. Even explain is taking ages:

+----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+
| id | select_type | table | type  | possible_keys       | key         | key_len | ref          | rows | Extra                    |
+----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+
|  1 | SIMPLE      | a     | range | PRIMARY             | PRIMARY     | 4       | NULL         | 1909 | Using where; Using index | 
|  1 | SIMPLE      | c     | ref   | PRIMARY,artifact_id | artifact_id | 4       | a.id         |    1 | Using where; Using index | 
+----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+
2 rows in set (3.33 sec)

2 questions arise:

What I'm doing wrong there
What is the impact of huge IN() statements?

Answer 1

Hi!

I was able to re-produce this problem on my side. My observation was that EXPLAIN takes about the same time as the execution.

This makes me believe the problem is the optimization step, not the execution. I don't know exactly why the long IN lists cause a optimizer problem. However, I've done one more test: I just took one long IN list and shorted the other one to just a view values. Result: quick responses (execution and explain). Another test: cut each list by 50%. Result: 4 times as fast as the original query. You see the pattern? It seems the optimizer does something for each value-pair! In other words, the optimizer does something 1900*1900 = 3.6 Million times.

So, that seems to be the reason it is slow — not the execution but the optimization.

In lack of a query plan cache, MySQL cannot re-use previously optimized execution plans.

What I'm doing wrong there

Well, I have a question: where do this long list come from? Please don't say "from the database". In that case, you should obviously re-write the query with joins or even sub-queries.

Otherwise, it is not exactly your fault. Still you could probably find a workaround.

Answer 2

Thanks for your answer. I progressed on my side and solved the issue. I share your conclusion about query optimizer issue.

What I've done is:

Run optimize table on both tables
Remove the second column on artifact_id index

Aside note: If I change the engine to innodb without any other changes, I don't have the problem (mysql 5.5). I'm not sure to understand why.

To answer your question, about "where the ids comes from", it comes from the DB ;) that pre-filter the results.

Actually, the snippet I pasted here is a tiny subset of a bigger query that already involve 15 joins. The query that generates the ids also have 15 joins. So I merge them I'll will have 30 joins in one query and I don't think MySQL will like him, am I wrong ?