Database joins 101 - introduction to join algorithms
This is the third post in a series of twelve in Tim Fords Entry-Level Content Challenge. In this blog post, I will explain the basics on join algorithms in SQL.
First of all, it is crucial to understand that joins in a SQL query are performed pairwise by a database engine. You can think of a join as a machine with components that take two input streams and return one output stream:
Even though a SQL query can have many joins:
INNER JOIN DimProductSubcategory ON DimProduct.ProductSubcategoryKey = DimProductSubcategory.ProductSubcategoryKey
INNER JOIN DimProductCategory ON DimProductSubcategory.ProductCategoryKey = DimProductCategory.ProductCategoryKey
INNER JOIN FactInternetSales ON DimProduct.ProductKey = FactInternetSales.ProductKey
this will (almost) always be executed as three joins, where the order of tables is determined by the query optimizer. If you look at an execution plan for the query (click the icon to include the actual execution plan and execute the query) in SSMS
you can see the join order reading from right to left:
From the execution plan, we can see that the query really consists of three “join machines.”
In theory there are 4*3*2 = 24 different ways the optimizer can decide to perform the joins in the query above. One of such a way is called a join order. For the example above, one possible join order is
(((FactInternetSales JOIN DimProduct) JOIN DimProductCategory)
and we can see from the execution plan, that the query optimizer chose a different join order, namely
(DimProduct JOIN (DimProductCategory JOIN DimProductSubcategory)))
The way a “join machine” performs the join is called a join algorithm. Relational databases typically implement three different join algorithms:
It is possible to tell the optimizer which algorithm it should use for joins in a query using join hints:
INNER LOOP JOIN DimProductSubcategory ON DimProduct.ProductSubcategoryKey = DimProductSubcategory.ProductSubcategoryKey
INNER MERGE JOIN DimProductCategory ON DimProductSubcategory.ProductCategoryKey = DimProductCategory.ProductCategoryKey
INNER HASH JOIN FactInternetSales ON DimProduct.ProductKey = FactInternetSales.ProductKey
Using join the hints above, the execution plan changes to this:
A merge join (sometimes also called sort-merge join) algorithm assumes the two input streams are sorted on the join key(s). The algorithm is very simple: it is a merge of two sorted lists. When inputs are sorted, then the algorithm runs in linear time of the sum of the data sizes of the inputs. If not, then the cost of sorting the inputs needs to be added to the cost.
A hash join first builds an in-memory lookup table for one input (typically the smallest of the two). Then for each element in the second input, a lookup is done in the hash table to search for a match. Hash joins only works for equijoins. The hash join performs great if one input is small (and fits in memory) and the other is very large.
The nested loop algorithm is the brute-force member of the family. For each element in the first input, the second input is scanned in a loop, to search for a match.
Each algorithm has its performance pros and cons, and the optimizer will choose an algorithm for each join in the query according to these.
Want to learn more?
In my opinion, the best source to understand execution plans, join orders and join algorithms is in Dan Tows book “SQL tuning”
Read a review of the book at this blog post: