Question

Hello Everyone! First of all, thanks to Markus for the great book, I have never encountered such an simple explanation of all these SQL concepts.

I am still reading the book and the questions emerge one by one. Here are some of them:

My setup: Postgres, 1000000 rows:

INSERT INTO index_test(id,last_modified,value,item_type, text)
SELECT *
     , now()
     , floor(random()*10)
     , md5(random()::text)
     , md5(random()::text)
  FROM generate_series(1,1000000);

As you see, 'value' column has small distribution, while all item_type values seem to be different.

1) I cannot get what db is doing when looking for some predicate which is not indexed:

it takes the whole row, compares, stores the row if matches?
it just fetches the column we need to compare and stores the rowId of matching rows?
it takes the column which needs to compare and then, if necessary, fetches the columns it needs to return?

So, a good advice is to index all the fields we need to return from query if such an index is not slowering down the insertions too much?

2) Here are my considerations about Hash Joins: Hash table search is o(n) and when using the nested loops for inner join the second table search can be log(n) if the table is correspondingly indexed. So the nested loops algorithm is theoretically faster, but the difference emerges because we can load hash table into memory. If we could load one or both tables into memory - the nested algorithm would be faster. Am I right? Who decides whether we load the whole table into memory? Who decides to enable hash join or not? Can you provide me with an example of use-case for hash joins?

Accepted Answer

Usually, databases store all columns of a row together. There is no simple way to fetch just one column from many rows without reading the other rows (unless you have an index for that).

So, if you do something that is not supported by an index the DB will do a full table scan (Seq Scan in PostgreSQL) and checks for the predicates.

Regarding Hash-join: I seems your O(n) notation compares apples to oranges. When using the Big-O notation you have to be careful to know what "n" stands for. If you say "n" is one index-tree traversal for a nested loops join but also say "n" is one hash-value lookup, then both would look like equally performant. However, on index-tree traversal for a nested loops join needs many steps to traverse the tree--some of them cached, others not. On the other hand, the "n" for the hash join just needs one access which is always cached.

The book has an example for a hash join: http://use-the-index-luke.com/sql/join/hash-join-partial-objects Let me know if there is anything unclear in this example.

To know when to use a hash join the optimizer does a cost estimation for all varianants (hash join, nested loops and possible also sort/merge join) and finally selects the one with the smallest cost estimation.