Question

I have read you section on index and where on primary key. I don't see a mention of a select using an IN clause.
What I am confused about is how a statement of the type:

SELECT first_name, last_name FROM
employees WHERE employee_id IN ( A
really HUGE list of IDs)

The employee_id is the primary key. So how is this query handled? Do we have an access of leaf node (hence a table access) for each id part of the IN list? In this case could the implementation prefer to do a full table scan depending of the size of the IN list?
Could you please help explaining how the IN clause works and affects performance?

Answer 1

Hi!

There is a reason why IN is not covered in SQL Performance Explained: The execution varies from database to database.

As you have mentioned, one possibility is to use each ID for an index access (if there is a matching index). If these are many IDs it might not be the best thing to do and other options (including full table scan) could be preferred by the optimizer. In theory, it is up to the optimizer to decide. In practice the optimizers estimates are not always good enough to do a proper job here.

Just recently I've had a case with MySQL where the where clause has two IN lists of the form

WHERE ID1 IN (....) AND ID2 IN (....)

The one list had several hundred entries while the second just about 10. Guess what MySQL did? It did one index access per key-pair! That means it was accessing the index several thousand times.

One manual way to 'optimize' such cases is to add a redundant where clause that covers the complete IN range:

WHERE ID1 BETWEEN <min-id1> and <max-id2> AND ID1 IN (....) AND ID2 IN (....)

That might turn several thousand index seeks into one range scan for the entire possible range. However, it makes only sense if the IDs are rather closely stored in the index. Additionally you might need to obfuscate the ID1 IN list in case the optimizer insists on using it for an index seek.

To make a long story short: look at the execution plan, try to understand what the database does. Then visualize the index and think if there could be a better way to do it. Then try to change the statement so that you get the optimizer to take the path you want it to take (e.g. adding redundant where clauses).

Sometimes it is the best to access the index several (thousand) times nevertheless.