Correlated Subqueries
What Are They? They are very similar to the subqueries we looked at in the previous section in Value Expressions, but with 1 major difference.
Suppose we are a growth analyst working on a blogging website similar to Medium called "Tedium".
For the report we're working on, we want to analyze user behavior after registration.
The users are in the bloggers
table and their posts are in blog_posts
.
column_name | type |
---|---|
id | INTEGER |
name | VARCHAR |
email | VARCHAR |
registered_at | TIMESTAMPTZ |
column_name | type | reference |
---|---|---|
id | INTEGER | NULL |
title | TEXT | NULL |
body | TEXT | NULL |
blogger_id | INTEGER | bloggers.id |
created_at | TIMESTAMPTZ | NULL |
We want to count the number of posts each blogger made within their first week of signing up,
i.e. the number of posts where the created_at
date is less than or equal to 7 days later than its blogger's registered_at
date.
First, we need to know how to determine whether a post fits in that 1-week window.
Note this window is different for every blogger because it depends on each blogger's registered_at
.
This involves doing arithmetic on dates.
Given a timestamp, you can add 7 days to it using an INTERVAL
.
You can do that with CAST
.
Alternatively, because we are casting from a string literal, we can also use the special casting shortcut for string literals.
SELECT
now(),
now() + CAST('7 days' AS INTERVAL) AS later,
now() + INTERVAL '7 days' AS later;
now TIMESTAMPTZ | later TIMESTAMPTZ | later TIMESTAMPTZ |
---|---|---|
2023-07-17 00:02:23.463093+00 | 2023-07-24 00:02:23.463093+00 | 2023-07-24 00:02:23.463093+00 |
There are a few ways to accomplish this.
First, we'll look at a method using a correlated subquery.
Much in the way we put a subquery in a WHERE
clause, we can also do this in the SELECT
clause.
As the SELECT
clause expects a list of value expressions, keep in mind the 1-row 1-column limitation!
Imagine this: first we select the bloggers, then for each blogger we count the number of posts they made within a week of signing up,
and put that number in a computed column.
For each row in this computed column, we query the posts
table to count the posts within the 7-day window.
Here is how we write the query for this solution:
SELECT
email,
(
SELECT count(*)
FROM blog_posts
WHERE blogger_id = b.id
AND created_at <= b.registered_at + CAST('7 days' AS INTERVAL)
)
FROM bloggers b;
email VARCHAR | count BIGINT |
---|---|
Audrey50@yahoo.com | 1 |
Kaela.Schuster47@hotmail.com | 0 |
Rowan_Bogisich40@hotmail.com | 1 |
Nyasia58@gmail.com | 1 |
Stephania92@gmail.com | 2 |
Samara_Yundt97@gmail.com | 1 |
Laura_Homenick70@gmail.com | 0 |
Jimmie_Weissnat@hotmail.com | 1 |
Rosie.Crist@yahoo.com | 0 |
Ben.Cummerata94@hotmail.com | 0 |
We select the email
column.
After that, enclosed in the ()
is the subquery.
Notice how the subquery has a couple references to the bloggers
table, aliased as b
, from the outer query: b.id
and b.registered_at
.
That's what makes this different.
These references are what makes this subquery a correlated subquery.
Recall that each expression in the SELECT
clause is evaluated once per row in the input table, which for the outer query is bloggers
.
Therefore, the subquery itself must be re-evaluated for every row, as b.id
and b.registered_at
are different for every row.
Correlated vs Uncorrelated
Subqueries are not "correlated" if they don't reference the table in the outer query.
An uncorrelated subquery will only run once and have its value copied over the rows.
There isn't really a reason to re-evaluate the subquery if its result isn't somehow specific to each row, so it can just be run once and copied.
To show this principle, we can demonstrate 2 queries that call the random()
function, which just returns a random number.
SELECT x, random()
FROM numbers;
x INTEGER | random FLOAT |
---|---|
1 | 0.701839262577556 |
2 | 0.892290287026106 |
3 | 0.1338628375881683 |
4 | 0.5050126244149102 |
5 | 0.6422605073503369 |
SELECT x, (SELECT random())
FROM numbers;
x INTEGER | random FLOAT |
---|---|
1 | 0.03842464364028131 |
2 | 0.03842464364028131 |
3 | 0.03842464364028131 |
4 | 0.03842464364028131 |
5 | 0.03842464364028131 |
In the first example, the random numbers are all different.
This means the random()
function was called 5 times.
Since the expression random()
appears in the SELECT
clause, it is evaluated once per row in the numbers
table.
In the second example, the numbers are all the same, meaning the random()
function was only called once.
It was called inside a subquery which is uncorrelated because it doesn't reference the bloggers
table.
Because it's uncorrelated, it is only run once and its value is copied to each row.
Efficiency
It turns out that correlated subqueries are a risk of being slow to execute, so their use is discouraged. This is because the subquery needs to re-execute for every row. If there are a lot of rows, this is a lot of computation which can be slow.
There's a better way to write this query using a LEFT JOIN
and a GROUP BY
instead.
SELECT
b.email,
count(p.*)
FROM bloggers b
LEFT JOIN blog_posts p
ON p.blogger_id = b.id
AND p.created_at < b.registered_at + CAST('7 day' AS INTERVAL)
GROUP BY b.email;
email VARCHAR | count BIGINT |
---|---|
Audrey50@yahoo.com | 1 |
Rosie.Crist@yahoo.com | 0 |
Jimmie_Weissnat@hotmail.com | 1 |
Stephania92@gmail.com | 2 |
Samara_Yundt97@gmail.com | 1 |
Ben.Cummerata94@hotmail.com | 0 |
Rowan_Bogisich40@hotmail.com | 1 |
Laura_Homenick70@gmail.com | 0 |
Kaela.Schuster47@hotmail.com | 0 |
Nyasia58@gmail.com | 1 |
We LEFT JOIN
so that we can preserve bloggers that didn't write any posts in that window.
To make sure we only join posts written in the 7 day window, we include the condition AND p.created_at < b.registered_at + CAST('7 day' AS INTERVAL)
in the ON
clause.
This produces an equivalent result, but it will be more efficient if the tables have a lot of rows.
WHERE Clauses
Correlated subqueries can appear in the WHERE
clause too.
Suppose we want to get a list of blogger's emails who wrote at least 1 post withing their first week of signing up.
SELECT email
FROM bloggers
WHERE 0 < (
SELECT count(*)
FROM blog_posts
WHERE blogger_id = bloggers.id
AND created_at < bloggers.registered_at + CAST('7 day' AS INTERVAL)
);
email VARCHAR |
---|
Audrey50@yahoo.com |
Rowan_Bogisich40@hotmail.com |
Nyasia58@gmail.com |
Stephania92@gmail.com |
Samara_Yundt97@gmail.com |
Jimmie_Weissnat@hotmail.com |
The same restriction here applies that the subquery must return a table of 1 row and 1 column.
This particular pattern of counting rows in a subquery and checking if it's more than 0 is quite common.
In fact, there's a shortcut operator for it: EXISTS
.
EXISTS
takes a table expression as input and outputs TRUE
(a value expression) if there is 1 or more rows in the result, FALSE
otherwise.
SELECT email
FROM bloggers
WHERE EXISTS (
SELECT *
FROM blog_posts
WHERE blogger_id = bloggers.id
AND created_at < bloggers.registered_at + CAST('7 day' AS INTERVAL)
);
email VARCHAR |
---|
Audrey50@yahoo.com |
Rowan_Bogisich40@hotmail.com |
Nyasia58@gmail.com |
Stephania92@gmail.com |
Samara_Yundt97@gmail.com |
Jimmie_Weissnat@hotmail.com |
Since EXISTS
returns a value expression, it is valid in a WHERE
clause.
Notice that we no longer need to compute the count inside the subquery.
When Should I Use Correlated Subqueries?
So correlated subqueries are inefficient if there are a lot of rows in the table. When should I use them? There are several scenarios when you should use them anyways.
1. The outer query selects from a small table.
If there aren't a lot of rows in the outer query, then efficiency may not be an issue.
2. The subquery doesn't select from a table.
In this case, the subquery is querying the table blog_posts
.
Sometimes, we use subqueries on a table function instead of a physical table like blog_posts
.
This is quite common when working with arrays and unnest()
, but we'll discuss that later.
These can be reasonably fast.
3. If there are are no other ways to do it.
Some examples of correlated subqueries can be rewritten as joins, but not all cases. The correlated subquery is kind of like a for-loop in other programming languages. For each row, run a query and make an attribute out of each result. If you truly need that type of logic, and you can't write it a different way, then use a correlated subquery.