Common Table Expressions

We've learned how to nest subqueries inside an outer query. This can go arbitrarily deep, having subqueries inside subqueries inside subqueries etc. However, this can get difficult to read when there are 2 or more nested subqueries. There's a more readable way to write these using a common table expression (CTE).

This is defined with a WITH clause, which takes a table expression and an alias.

You can take this query from the previous lesson…

SELECT
  top_salaries.*,
  employees.name
FROM (
  SELECT
    company,
    max(salary) AS top_salary
  FROM employees
  GROUP BY company
) top_salaries
JOIN employees
  ON top_salaries.company = employees.company
  AND top_salaries.top_salary = employees.salary;

…and rewrite it with a common table expression like this:

WITH top_salaries AS (
  SELECT
    company,
    max(salary) AS top_salary
  FROM employees
  GROUP BY company
)
SELECT
  top_salaries.*,
  employees.name
FROM top_salaries
JOIN employees
  ON top_salaries.company = employees.company
  AND top_salaries.top_salary = employees.salary;

`company` `VARCHAR`	`top_salary` `INTEGER`	`name` `VARCHAR`
Poogle	300000	Marianne Schamberger
Wamazon	277000	Ollie Murphy
Footbook	254000	Wilson Maggio
Footbook	254000	Ronald Graham

The WITH clause goes at the top, before the SELECT. To access the results of the subquery, simply refer to its alias in your FROM or JOIN clause(s).

This produces the exact same result as the nested subquery in the previous section, but it flows a little better. It reads more like how you would expect, top-to-bottom instead of inside-out. The difference between the 2 styles is more pronounced when there are multiple CTEs. as WITH actually takes a list of table expressions.

WITH first_step AS (
  SELECT …
  FROM some_table
  …
),
second_step AS (
  SELECT …
  FROM first_step
  …
)
SELECT …
FROM second_step
…

These execute one-after-another in a sequence. The nested subquery equivalent would be something like this:

SELECT …
FROM (
  SELECT …
  FROM (
    SELECT …
    FROM some_table
    …
  ) first_step
  …
) second_step
…

The more steps you have, the more confusing the nested version is to read.

Why "Common" Table Expression?

CTEs aren't just an alternative style of writing nested subqueries. There is an important way in which CTEs differ. A CTE can be referenced in any table expression that appears later in the query, any number of times. So it isn't called "common" in the sense that it isn't rare, rather in a sense that it's open for common use by other parts of the query. This is not possible with nested subqueries. Let's consider an example with a calculation involving a few steps.

Suppose we want to generate several random 2D points in Cartesian coordinates, compute the center of this point cloud, then sort the points in the cloud by distance from the center. How can we do that?

First, we can use the random() function to generate a random number between 0 and 1. Calling this twice for the X and Y coordinates…

SELECT
  random() AS x,
  random() AS y;

`x` `FLOAT`	`y` `FLOAT`
0.3036018280906543	0.8431151472453315

That gives us 1 row representing a random point. If we want a bunch of points, say 6, we can use the table function generate_series().

SELECT
  random() AS x,
  random() AS y
FROM generate_series(1, 6);

`x` `FLOAT`	`y` `FLOAT`
0.7314309408592212	0.0756531781870855
0.07742163737243568	0.5629862830538765
0.3649608187734379	0.8811242032912141
0.2023326162419421	0.5876014964526013
0.1401487172584197	0.4718419918909049
0.49421224292809485	0.17993594599096685

The random numbers will be different each time this query is actually executed. We don't actually care about the numbers generated by generate_series, We just need a convenient way of generating a table with 6 rows.

We can then wrap that part in a WITH clause, and compute the center point. Remember that the actual points will be different from the last run.

WITH points AS (
  SELECT
    random() AS x,
    random() AS y
  FROM generate_series(1, 6)
)
SELECT
  avg(x) AS x,
  avg(y) AS y
FROM points;

`x` `FLOAT`	`y` `FLOAT`
0.45200754152700373	0.4810444892004078

We can move the center calculation inside the WITH clause and do our final step. Finally, we do a query to compute the distance of each point from the center, and sort them.

WITH points AS (
  SELECT
    random() AS x,
    random() AS y
  FROM generate_series(1, 6)
),
center AS (
  SELECT
    avg(x) AS x,
    avg(y) AS y
  FROM points
)
SELECT
  points.x,
  points.y,
  center.x AS center_x,
  center.y AS center_y,
  sqrt((center.x - points.x) ^ 2 + (center.y - points.y) ^ 2) AS distance
FROM points
CROSS JOIN center
ORDER BY 5;

`x` `FLOAT`	`y` `FLOAT`	`center_x` `FLOAT`	`center_y` `FLOAT`	`distance` `FLOAT`
0.6285200195229974	0.5718006429906721	0.5322831646461991	0.5922003815541699	0.09837520810670237
0.6595578970423166	0.6603686800951409	0.5322831646461991	0.5922003815541699	0.14438065809683195
0.6402480171685254	0.35731275606257373	0.5322831646461991	0.5922003815541699	0.25851229369074114
0.8453215748267071	0.6459989393002448	0.5322831646461991	0.5922003815541699	0.3176276610496916
0.19580953572129745	0.565104352091552	0.5322831646461991	0.5922003815541699	0.33756287973432664
0.22424194359535066	0.7526169187848359	0.5322831646461991	0.5922003815541699	0.347308017879771

A CROSS JOIN is basically an OUTER JOIN with no ON clause. It joins every row on the left-hand side with every row on the right. This is desirable here because the center table has only 1 row. We'll revisit this in a later section.

This example makes it clear how CTEs are functionally different from nested subqueries. This query has no nested equivalent because points is referenced twice.