SQL join flavors
Abstract
There is more to SQL joins than you might think. Let’s explore them a bit.
We’ll use two simple tables: companies
and jobs
they offer.
There are three completely fictional companies — Hoogle, Emazon and Neta — that offer a surprisingly small number of jobs:
jobs companies
┌────────┬─────────┬──────────────┐ ┌─────────┬───────────┐
│ job_id │ comp_id │ job_name │ │ comp_id │ comp_name │
├────────┼─────────┼──────────────┤ ├─────────┼───────────┤
│ 1 │ 10 │ Data Analyst │ │ 10 │ Hoogle │
│ 2 │ 20 │ Go Developer │ │ 20 │ Emazon │
│ 3 │ 20 │ ML Engineer │ │ 30 │ Neta │
│ 4 │ 99 │ UI Designer │ └─────────┴───────────┘
└────────┴─────────┴──────────────┘
Hoogle is interested in Data analysts. Emazon is hiring Go developers and ML engineers. Neta is not hiring at all. And some unknown company with id 99
is desperately looking for a UI designer.
Now let’s join!
Qualified join
“Qualified join” is a umbrella term for the most common types of joins: inner
, left
, right
and full
. If you haven’t heard of these or have forgotten what they do, I suggest you check out the SQL Cheat Sheet.
A qualified join connects records from two datasets into one, according to the matching criteria you specify. Here is how it looks like in general:
table [join-type] JOIN table join-specification
A table
is not necessarily a table per se. It can be a view, a subquery, or basically any table-like structure. But we’ll call it “table” for brevity.
A join-type
is one of the following:
Four join types ought to be enough for anybody.
inner
includes only matching rows from both tables.left
includes matching rows from both tables and non-matching rows from the left table.right
includes matching rows from both tables and non-matching rows from the right table.full
includes matching rows from both tables and non-matching rows from both tables.
There is also an optional outer
keyword for left, right, and full joins, which changes nothing (as if SQL wasn’t complex enough):
LEFT OUTER = LEFT
RIGHT OUTER = RIGHT
FULL OUTER = FULL
Oh, and the entire join-type
can also be omitted. By default, every join is an inner
join.
A join-specification
defines the matching criteria. It comes in two flavors.
Join “on” specific columns or “using” a common one? Make your choice, adventurer.
The first one uses the on
keyword, which you’ve probably seen a thousand times. For example, let’s select jobs with corresponding company names:
The second join-specification
flavor is less known. It employs the using
keyword and works only if the column we are joining on has the same name in both tables:
While we can use any comparison operator with on
, using
only checks for equality.
Nice!
Natural join
A natural join is like a qualified join with using
, except that we don’t specify matching column names at all:
A natural join finds all pairs of columns with the same name and uses them as join criteria.
Similar to using
, a natural join only checks for equality. And just like a qualified join, a natural join can be either inner (default), left, right, or full:
table NATURAL [join-type] JOIN table
A natural join is like an automatic, heartless “using”. Pretty unnatural, if you ask me.
Using natural joins is almost always a bad idea. Imagine we have a name
column in both tables:
Completely valid design. However, a natural join between jobs
and companies
would return an empty result, because it implicitly matches on the following condition:
jobs.comp_id = companies.comp_id and jobs.name = companies.name
So I’d take qualified join with using
over natural join any day.
Cross join
The third and final variation of join is a cross join, also known as a Cartesian join:
A cross join completely ignores column values. It joins every row from the left table (N rows) with every row from the right table (M rows), resulting in N×M rows.
A separate name for the least used type of join. That makes sense! Or not. Ask the SQL committee.
A cross join can be useful if you need to generate all pairs of values from two tables, such as all “size-color” combinations for a product. Not so useful in our jobs example, I suppose.
A cross join is equivalent to an inner join without matching criteria:
Sometimes you will see a cross join written this way:
While this query doesn’t use the join
keyword at all, it is exactly the same cross join.
Partitioned join
Didn’t I say earlier that there are only three variations of joins — qualified join, cross join, and natural join? Well, yes.
But here is a thing. Remember our qualified join specification?
table [join-type] JOIN table join-specification
According to the SQL standard, the table
here can be a partitioned join table (not related to table partitioning in any way!):
Let’s say we have products
and their sales
over time:
sales products
┌────┬────────────┬────────────┬──────────┐ ┌────┬───────┐
│ id │ sale_dt │ product_id │ quantity │ │ id │ name │
├────┼────────────┼────────────┼──────────┤ ├────┼───────┤
│ 1 │ 2023-06-01 │ 10 │ 30 │ │ 10 │ Alpha │
│ 2 │ 2023-06-01 │ 20 │ 60 │ │ 20 │ Beta │
│ 3 │ 2023-06-01 │ 30 │ 90 │ │ 30 │ Gamma │
│ 4 │ 2023-06-02 │ 20 │ 60 │ └────┴───────┘
│ 5 │ 2023-06-03 │ 10 │ 30 │
│ 6 │ 2023-06-03 │ 30 │ 90 │
└────┴────────────┴────────────┴──────────┘
Let’s select sales with corresponding product names:
So far so good. But what I’d like to see here is how each of the products performed on each day (when there was at least one sale). Including zero sales for some of the products:
┌────────────┬───────┬──────────┐
│ sale_dt │ name │ quantity │
├────────────┼───────┼──────────┤
│ 2023-06-01 │ Alpha │ 30 │
│ 2023-06-01 │ Beta │ 60 │
│ 2023-06-01 │ Gamma │ 90 │
│ 2023-06-02 │ Alpha │ 0 │
│ 2023-06-02 │ Beta │ 60 │
│ 2023-06-02 │ Gamma │ 0 │
│ 2023-06-03 │ Alpha │ 30 │
│ 2023-06-03 │ Beta │ 0 │
│ 2023-06-03 │ Gamma │ 90 │
└────────────┴───────┴──────────┘
There is no way I can accomplish this without additional tricks. Unless I use a partitioned join.
A partitioned join works independently within each partition.
A partitioned join tells the database engine to perform the join independently within each of the partitions we have defined for a table. So if we partition sales
by date:
Then the database engine will independently join sales and products for 2023-06-01
, then for 2023-06-02
, then for 2023-06-03
, and finally union the results. This will give us the exact result we wanted (except for null
s instead of zeros):
┌────────────┬───────┬──────────┐
│ sale_dt │ name │ quantity │
├────────────┼───────┼──────────┤
│ 2023-06-01 │ Alpha │ 30 │
│ 2023-06-01 │ Beta │ 60 │
│ 2023-06-01 │ Gamma │ 90 │
│ 2023-06-02 │ Alpha │ (null) │
│ 2023-06-02 │ Beta │ 60 │
│ 2023-06-02 │ Gamma │ (null) │
│ 2023-06-03 │ Alpha │ 30 │
│ 2023-06-03 │ Beta │ (null) │
│ 2023-06-03 │ Gamma │ 90 │
└────────────┴───────┴──────────┘
I have to say that this is a rather strange feature. I’m surprised it’s in the standard at all (probably something to do with Oracle having it). Other vendors besides Oracle have never implemented partitioned joins, and I can’t blame them.
Anyway, I just wanted you to know about it so you could share my burden of unnecessary SQL knowledge.
In case you were wondering (I know you weren’t), here’s how to solve the problem without partitioned joins:
Easy-peasy.
Lateral join
SQL is a strange beast. While the standard includes the partitioned join (which no one but Oracle supports), it also contains a more powerful and widely supported lateral join.
A lateral join (as opposed to a traditional join) allows you to use correlated subqueries. We’ll see what that means in a moment.
sales products
┌────┬────────────┬────────────┬──────────┐ ┌────┬───────┐
│ id │ sale_dt │ product_id │ quantity │ │ id │ name │
├────┼────────────┼────────────┼──────────┤ ├────┼───────┤
│ 1 │ 2023-06-01 │ 10 │ 30 │ │ 10 │ Alpha │
│ 2 │ 2023-06-01 │ 20 │ 60 │ │ 20 │ Beta │
│ 3 │ 2023-06-01 │ 30 │ 90 │ │ 30 │ Gamma │
│ 4 │ 2023-06-02 │ 20 │ 60 │ └────┴───────┘
│ 5 │ 2023-06-03 │ 10 │ 30 │
│ 6 │ 2023-06-03 │ 30 │ 90 │
└────┴────────────┴────────────┴──────────┘
Let’s go back to our products/sales example and see how each product sold on June 2nd:
Now let’s say we want to see product sales for all days (just like we did with the partitioned join). It would be nice to select distinct dates with a subquery and join it with the product/sales subquery from above:
The product/sales subquery references the sales date from the dates subquery. We can say that the product/sales subquery is correlated with the dates subquery.
Nah. We can’t reference the sale_dt
column provided by the preceding d
subquery in the ps
subquery. Unless we use a lateral join, which allows just that:
The d
subquery selects the dates, while the ps
subquery joins with it on the sale_dt
column to select the product sales for that date. All thanks to the lateral join. Nice!
You may be wondering about the on true
join condition. Thing is, the actual join on sale_dt
already happens inside the ps
subquery, so we don’t need to repeat it outside. Try changing on true
to d.sale_dt = ps.sale_dt
and see that nothing changes.
Lateral joins are supported by PostgreSQL, MySQL and Oracle. MS SQL Server does not support them per se, but provides the same functionality with cross apply
(= join lateral
) and outer apply
(= left join lateral
) syntax.
Summary
There are three variations of joins defined in the SQL standard:
- qualified join (= using specific match critera),
- natural join (= automatic match critera),
- cross join (= inner join with no specific match critera).
“SQL Standard in Pictures” would be an absolute hit, wouldn’t it?
A qualified join comes in four types: inner
, left
, right
, and full
.
It also provides on
and using
flavors of matching criteria.
Most vendors support them all, with the notable exception of MS SQL Server, which doesn’t know anything about using
and natural joins.
There are also table modifiers that change the join behavior:
lateral
, supported by PostgreSQL, MySQL and Oracle (and MS SQL Server with a different syntax). It allows correlated subqueries in thejoin
part of the query.partition by
, supported by Oracle only. It performs the join independently within each of the defined partitions.
Oh, and MySQL doesn’t support full
joins. Just because.
That’s it!