Correlated Subqueries

What Are They? They are very similar to the subqueries we looked at in the previous section in Value Expressions, but with 1 major difference.

Suppose we are a growth analyst working on a blogging website similar to Medium called "Tedium". For the report we're working on, we want to analyze user behavior after registration. The users are in the bloggers table and their posts are in blog_posts.

Table Name:

bloggers

`column_name`	`type`
id	INTEGER
name	VARCHAR
email	VARCHAR
registered_at	TIMESTAMPTZ

Table Name:

blog_posts

`column_name`	`type`	`reference`
id	INTEGER	NULL
title	TEXT	NULL
body	TEXT	NULL
blogger_id	INTEGER	bloggers.id
created_at	TIMESTAMPTZ	NULL

We want to count the number of posts each blogger made within their first week of signing up, i.e. the number of posts where the created_at date is less than or equal to 7 days later than its blogger's registered_at date.

First, we need to know how to determine whether a post fits in that 1-week window. Note this window is different for every blogger because it depends on each blogger's registered_at. This involves doing arithmetic on dates. Given a timestamp, you can add 7 days to it using an INTERVAL. You can do that with CAST. Alternatively, because we are casting from a string literal, we can also use the special casting shortcut for string literals.

SELECT
  now(),
  now() + CAST('7 days' AS INTERVAL) AS later,
  now() + INTERVAL '7 days' AS later;

`now` `TIMESTAMPTZ`	`later` `TIMESTAMPTZ`	`later` `TIMESTAMPTZ`
2023-07-17 00:02:23.463093+00	2023-07-24 00:02:23.463093+00	2023-07-24 00:02:23.463093+00

There are a few ways to accomplish this. First, we'll look at a method using a correlated subquery. Much in the way we put a subquery in a WHERE clause, we can also do this in the SELECT clause. As the SELECT clause expects a list of value expressions, keep in mind the 1-row 1-column limitation!

Imagine this: first we select the bloggers, then for each blogger we count the number of posts they made within a week of signing up, and put that number in a computed column. For each row in this computed column, we query the posts table to count the posts within the 7-day window.

Here is how we write the query for this solution:

SELECT
  email,
  (
    SELECT count(*)
    FROM blog_posts
    WHERE blogger_id = b.id
      AND created_at <= b.registered_at + CAST('7 days' AS INTERVAL)
  )
FROM bloggers b;

`email` `VARCHAR`	`count` `BIGINT`
Audrey50@yahoo.com	1
Kaela.Schuster47@hotmail.com	0
Rowan_Bogisich40@hotmail.com	1
Nyasia58@gmail.com	1
Stephania92@gmail.com	2
Samara_Yundt97@gmail.com	1
Laura_Homenick70@gmail.com	0
Jimmie_Weissnat@hotmail.com	1
Rosie.Crist@yahoo.com	0
Ben.Cummerata94@hotmail.com	0

We select the email column. After that, enclosed in the () is the subquery. Notice how the subquery has a couple references to the bloggers table, aliased as b, from the outer query: b.id and b.registered_at. That's what makes this different. These references are what makes this subquery a correlated subquery. Recall that each expression in the SELECT clause is evaluated once per row in the input table, which for the outer query is bloggers. Therefore, the subquery itself must be re-evaluated for every row, as b.id and b.registered_at are different for every row.

Correlated vs Uncorrelated

Subqueries are not "correlated" if they don't reference the table in the outer query. An uncorrelated subquery will only run once and have its value copied over the rows. There isn't really a reason to re-evaluate the subquery if its result isn't somehow specific to each row, so it can just be run once and copied. To show this principle, we can demonstrate 2 queries that call the random() function, which just returns a random number.

SELECT x, random()
FROM numbers;

`x` `INTEGER`	`random` `FLOAT`
1	0.701839262577556
2	0.892290287026106
3	0.1338628375881683
4	0.5050126244149102
5	0.6422605073503369

SELECT x, (SELECT random())
FROM numbers;

`x` `INTEGER`	`random` `FLOAT`
1	0.03842464364028131
2	0.03842464364028131
3	0.03842464364028131
4	0.03842464364028131
5	0.03842464364028131

In the first example, the random numbers are all different. This means the random() function was called 5 times. Since the expression random() appears in the SELECT clause, it is evaluated once per row in the numbers table. In the second example, the numbers are all the same, meaning the random() function was only called once. It was called inside a subquery which is uncorrelated because it doesn't reference the bloggers table. Because it's uncorrelated, it is only run once and its value is copied to each row.

Efficiency

It turns out that correlated subqueries are a risk of being slow to execute, so their use is discouraged. This is because the subquery needs to re-execute for every row. If there are a lot of rows, this is a lot of computation which can be slow.

There's a better way to write this query using a LEFT JOIN and a GROUP BY instead.

SELECT
  b.email,
  count(p.*)
FROM bloggers b
LEFT JOIN blog_posts p
  ON p.blogger_id = b.id
  AND p.created_at < b.registered_at + CAST('7 day' AS INTERVAL)
GROUP BY b.email;

`email` `VARCHAR`	`count` `BIGINT`
Audrey50@yahoo.com	1
Rosie.Crist@yahoo.com	0
Jimmie_Weissnat@hotmail.com	1
Stephania92@gmail.com	2
Samara_Yundt97@gmail.com	1
Ben.Cummerata94@hotmail.com	0
Rowan_Bogisich40@hotmail.com	1
Laura_Homenick70@gmail.com	0
Kaela.Schuster47@hotmail.com	0
Nyasia58@gmail.com	1

We LEFT JOIN so that we can preserve bloggers that didn't write any posts in that window. To make sure we only join posts written in the 7 day window, we include the condition AND p.created_at < b.registered_at + CAST('7 day' AS INTERVAL) in the ON clause.

This produces an equivalent result, but it will be more efficient if the tables have a lot of rows.

WHERE Clauses

Correlated subqueries can appear in the WHERE clause too. Suppose we want to get a list of blogger's emails who wrote at least 1 post withing their first week of signing up.

SELECT email
FROM bloggers
WHERE 0 < (
  SELECT count(*)
  FROM blog_posts
  WHERE blogger_id = bloggers.id
    AND created_at < bloggers.registered_at + CAST('7 day' AS INTERVAL)
);

`email` `VARCHAR`
Audrey50@yahoo.com
Rowan_Bogisich40@hotmail.com
Nyasia58@gmail.com
Stephania92@gmail.com
Samara_Yundt97@gmail.com
Jimmie_Weissnat@hotmail.com

The same restriction here applies that the subquery must return a table of 1 row and 1 column. This particular pattern of counting rows in a subquery and checking if it's more than 0 is quite common. In fact, there's a shortcut operator for it: EXISTS. EXISTS takes a table expression as input and outputs TRUE (a value expression) if there is 1 or more rows in the result, FALSE otherwise.

SELECT email
FROM bloggers
WHERE EXISTS (
  SELECT *
  FROM blog_posts
  WHERE blogger_id = bloggers.id
    AND created_at < bloggers.registered_at + CAST('7 day' AS INTERVAL)
);

`email` `VARCHAR`
Audrey50@yahoo.com
Rowan_Bogisich40@hotmail.com
Nyasia58@gmail.com
Stephania92@gmail.com
Samara_Yundt97@gmail.com
Jimmie_Weissnat@hotmail.com

Since EXISTS returns a value expression, it is valid in a WHERE clause. Notice that we no longer need to compute the count inside the subquery.

When Should I Use Correlated Subqueries?

So correlated subqueries are inefficient if there are a lot of rows in the table. When should I use them? There are several scenarios when you should use them anyways.

1. The outer query selects from a small table.

If there aren't a lot of rows in the outer query, then efficiency may not be an issue.

2. The subquery doesn't select from a table.

In this case, the subquery is querying the table blog_posts. Sometimes, we use subqueries on a table function instead of a physical table like blog_posts. This is quite common when working with arrays and unnest(), but we'll discuss that later. These can be reasonably fast.

3. If there are are no other ways to do it.

Some examples of correlated subqueries can be rewritten as joins, but not all cases. The correlated subquery is kind of like a for-loop in other programming languages. For each row, run a query and make an attribute out of each result. If you truly need that type of logic, and you can't write it a different way, then use a correlated subquery.