Table Expressions

When we first learned about GROUP BY, we got top salaries at each company.

SELECT
  company,
  max(salary) AS top_salary
FROM employees
GROUP BY company;
company
VARCHAR
top_salary
INTEGER
Footbook
254000
Wamazon
277000
Poogle
300000

While interesting, we still cannot see who at the company has this salary. Recall that trying to select name in this query gives us this error:

SELECT
  company,
  name,
  max(salary) AS top_salary
FROM employees
GROUP BY company;
error: column "employees.name" must appear in the GROUP BY clause or be used in an aggregate function

This is because columns that don't appear in the GROUP BY clause can't be selected without being aggregated. Unfortunately, there is no aggregate function in standard SQL that we can use to get the right names, though some SQL vendors have added non-standard aggregate functions to do this.

So, we need another way. We need to take our result and add the name column to it afterwards. Imagine doing the aggregation first, taking the result, and joining the original employees table to it in order to add the name column.

company, (salary)top_salary employees company max SELECT AS FROM GROUP BY JOIN company name salary Poogle Lindsay 293000 Poogle François 131000 Poogle Jenny 162000 Wamazon Remmy 149000 Wamazon Milton 42000 company name salary Poogle Lindsay 293000 Poogle François 131000 Poogle Jenny 162000 Wamazon Remmy 149000 Wamazon Milton 42000 company top_salary Poogle 293000 Wamazon 149000 company top_salary name Poogle 293000 Lindsay Wamazon 149000 Remmy

That's the idea, but it isn't possible to write 1 single SELECT statement to aggregate and then join. Recall that the clause order has JOIN occurring before GROUP BY. We'll actually need 2 separate SELECT statements. How can we take the result of 1 SELECT statement and take it as an input to a another one?

Algebraic Closure

When discussing the relational model, we briefly mentioned that SQL queries output tables. The inputs and outputs are both tables. This is important because it means that conceptually, there is nothing preventing the output of 1 query from becoming the input to another (it's still a table after all). This property is referred to in Mathematics as "algebraic closure." Relational algebra is closed. An algebra being closed just means that the inputs and outputs of any operation are in the same set: tables. Inputs and outputs are tables, and there aren't any operations you can do to produce something that isn't a table.

So, the output of 1 query can become the input of another. How do we actually accomplish this in the code? Conceptually, it would look something like this…

SELECT
  top_salaries.*,
  employees.name
FROM (top_salary_query_result) top_salaries
JOIN employees
  ON top_salaries.company = employees.company
  AND top_salaries.top_salary = employees.salary;

Where (top_salary_query_result) would be the result table from our initial query getting the top salaries. That result is being given the alias top_salaries since the inputs need names when joining multiple tables.

Writing a Subquery

As it turns out, we can just replace (top_salary_query_result) with that first query. In general, you can put SELECT statements inside of the FROM clause instead of a table name…

SELECT
  top_salaries.*,
  employees.name
FROM (
  SELECT
    company,
    max(salary) AS top_salary
  FROM employees
  GROUP BY company
) top_salaries
JOIN employees
  ON top_salaries.company = employees.company
  AND top_salaries.top_salary = employees.salary;
company
VARCHAR
top_salary
INTEGER
name
VARCHAR
Poogle
300000
Marianne Schamberger
Wamazon
277000
Ollie Murphy
Footbook
254000
Wilson Maggio
Footbook
254000
Ronald Graham

Footbook has 2 names. This is because 2 employees are tied for the top salary.

This is called a "subquery" because 1 whole query is nested inside another. If you need, it is possible to nest a subquery inside a subquery inside a subquery etc.

Table Expressions vs Value Expressions

Let's add a bit more detail to the diagram showing the structure of a SELECT statement.

Clause Expression Value height_cm Keyword SELECT Statement Clause Expression Table daters Keyword FROM Clause Expression Value name = ‘Ben’ Keyword WHERE

Many clauses are a keyword plus an expression(s). But some expect a value expression and others expect a table expression. A value expression evaluates to a single value, like a number, string, boolean, etc. These can be columns names, math calculations, and most function calls. A table expression evaluates to a table instead of a single value. These include table names, subqueries, and even some functions that return tables.

The FROM clause is the first clause we've seen that takes a table expression. Another one is JOIN.

We could've accomplished this query by simply swapping the FROM and JOIN clauses.

SELECT
  top_salaries.*,
  employees.name
FROM employees
JOIN (
  SELECT
    company,
    max(salary) AS top_salary
  FROM employees
  GROUP BY company
) top_salaries
  ON top_salaries.company = employees.company
  AND top_salaries.top_salary = employees.salary;
company
VARCHAR
top_salary
INTEGER
name
VARCHAR
Poogle
300000
Marianne Schamberger
Wamazon
277000
Ollie Murphy
Footbook
254000
Wilson Maggio
Footbook
254000
Ronald Graham

There is actually another way to write this query without using a subquery. We'll look at that more in the section on window functions.

Table Functions

All the functions we've seen so far return values, and can be called wherever a value expression is expected. However, SQL also has table functions which are functions that return tables. Some standard ones include unnest and json_table, but we'll study those later when we look at denormalization. For now, check out Postgres' table function called generate_series for creating sequences of numbers or timestamps.

SELECT *
FROM generate_series(0, 10);
generate_series
INTEGER
0
1
2
3
4
5
6
7
8
9
10
SELECT day
FROM generate_series(
  CAST('2020-04-01' AS TIMESTAMP),
  CAST('2020-04-14' AS TIMESTAMP),
  CAST('1 day' AS INTERVAL)
) quarantine_period(day);
day
TIMESTAMP
2020-04-01 00:00:00
2020-04-02 00:00:00
2020-04-03 00:00:00
2020-04-04 00:00:00
2020-04-05 00:00:00
2020-04-06 00:00:00
2020-04-07 00:00:00
2020-04-08 00:00:00
2020-04-09 00:00:00
2020-04-10 00:00:00
2020-04-11 00:00:00
2020-04-12 00:00:00
2020-04-13 00:00:00
2020-04-14 00:00:00

They behave just like normal tables, but they are generated on the fly. You can give them aliases, join them with other tables, put them in subqueries, etc.

© 2022 Andrew Carlson. All rights reserved.