Common Table Expressions
We've learned how to nest subqueries inside an outer query. This can go arbitrarily deep, having subqueries inside subqueries inside subqueries etc. However, this can get difficult to read when there are 2 or more nested subqueries. There's a more readable way to write these using a common table expression (CTE).
This is defined with a WITH
clause, which takes a table expression and an alias.
You can take this query from the previous lesson…
SELECT
top_salaries.*,
employees.name
FROM (
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company
) top_salaries
JOIN employees
ON top_salaries.company = employees.company
AND top_salaries.top_salary = employees.salary;
…and rewrite it with a common table expression like this:
WITH top_salaries AS (
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company
)
SELECT
top_salaries.*,
employees.name
FROM top_salaries
JOIN employees
ON top_salaries.company = employees.company
AND top_salaries.top_salary = employees.salary;
company VARCHAR | top_salary INTEGER | name VARCHAR |
---|---|---|
Poogle | 300000 | Marianne Schamberger |
Wamazon | 277000 | Ollie Murphy |
Footbook | 254000 | Wilson Maggio |
Footbook | 254000 | Ronald Graham |
The WITH
clause goes at the top, before the SELECT
.
To access the results of the subquery, simply refer to its alias in your FROM
or JOIN
clause(s).
This produces the exact same result as the nested subquery in the previous section, but it flows a little better.
It reads more like how you would expect, top-to-bottom instead of inside-out.
The difference between the 2 styles is more pronounced when there are multiple CTEs.
as WITH
actually takes a list of table expressions.
WITH first_step AS (
SELECT …
FROM some_table
…
),
second_step AS (
SELECT …
FROM first_step
…
)
SELECT …
FROM second_step
…
These execute one-after-another in a sequence. The nested subquery equivalent would be something like this:
SELECT …
FROM (
SELECT …
FROM (
SELECT …
FROM some_table
…
) first_step
…
) second_step
…
The more steps you have, the more confusing the nested version is to read.
Why "Common" Table Expression?
CTEs aren't just an alternative style of writing nested subqueries. There is an important way in which CTEs differ. A CTE can be referenced in any table expression that appears later in the query, any number of times. So it isn't called "common" in the sense that it isn't rare, rather in a sense that it's open for common use by other parts of the query. This is not possible with nested subqueries. Let's consider an example with a calculation involving a few steps.
Suppose we want to generate several random 2D points in Cartesian coordinates, compute the center of this point cloud, then sort the points in the cloud by distance from the center. How can we do that?
First, we can use the random()
function to generate a random number between 0 and 1.
Calling this twice for the X and Y coordinates…
SELECT
random() AS x,
random() AS y;
x FLOAT | y FLOAT |
---|---|
0.3036018280906543 | 0.8431151472453315 |
That gives us 1 row representing a random point.
If we want a bunch of points, say 6, we can use the table function generate_series()
.
SELECT
random() AS x,
random() AS y
FROM generate_series(1, 6);
x FLOAT | y FLOAT |
---|---|
0.7314309408592212 | 0.0756531781870855 |
0.07742163737243568 | 0.5629862830538765 |
0.3649608187734379 | 0.8811242032912141 |
0.2023326162419421 | 0.5876014964526013 |
0.1401487172584197 | 0.4718419918909049 |
0.49421224292809485 | 0.17993594599096685 |
The random numbers will be different each time this query is actually executed.
We don't actually care about the numbers generated by generate_series
,
We just need a convenient way of generating a table with 6 rows.
We can then wrap that part in a WITH
clause, and compute the center point.
Remember that the actual points will be different from the last run.
WITH points AS (
SELECT
random() AS x,
random() AS y
FROM generate_series(1, 6)
)
SELECT
avg(x) AS x,
avg(y) AS y
FROM points;
x FLOAT | y FLOAT |
---|---|
0.45200754152700373 | 0.4810444892004078 |
We can move the center calculation inside the WITH
clause and do our final step.
Finally, we do a query to compute the distance of each point from the center, and sort them.
WITH points AS (
SELECT
random() AS x,
random() AS y
FROM generate_series(1, 6)
),
center AS (
SELECT
avg(x) AS x,
avg(y) AS y
FROM points
)
SELECT
points.x,
points.y,
center.x AS center_x,
center.y AS center_y,
sqrt((center.x - points.x) ^ 2 + (center.y - points.y) ^ 2) AS distance
FROM points
CROSS JOIN center
ORDER BY 5;
x FLOAT | y FLOAT | center_x FLOAT | center_y FLOAT | distance FLOAT |
---|---|---|---|---|
0.6285200195229974 | 0.5718006429906721 | 0.5322831646461991 | 0.5922003815541699 | 0.09837520810670237 |
0.6595578970423166 | 0.6603686800951409 | 0.5322831646461991 | 0.5922003815541699 | 0.14438065809683195 |
0.6402480171685254 | 0.35731275606257373 | 0.5322831646461991 | 0.5922003815541699 | 0.25851229369074114 |
0.8453215748267071 | 0.6459989393002448 | 0.5322831646461991 | 0.5922003815541699 | 0.3176276610496916 |
0.19580953572129745 | 0.565104352091552 | 0.5322831646461991 | 0.5922003815541699 | 0.33756287973432664 |
0.22424194359535066 | 0.7526169187848359 | 0.5322831646461991 | 0.5922003815541699 | 0.347308017879771 |
A
CROSS JOIN
is basically anOUTER JOIN
with noON
clause. It joins every row on the left-hand side with every row on the right. This is desirable here because thecenter
table has only 1 row. We'll revisit this in a later section.
This example makes it clear how CTEs are functionally different from nested subqueries.
This query has no nested equivalent because points
is referenced twice.