Table Expressions
When we first learned about GROUP BY
, we got top salaries at each company.
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company;
company VARCHAR | top_salary INTEGER |
---|---|
Footbook | 254000 |
Wamazon | 277000 |
Poogle | 300000 |
While interesting, we still cannot see who at the company has this salary.
Recall that trying to select name
in this query gives us this error:
SELECT
company,
name,
max(salary) AS top_salary
FROM employees
GROUP BY company;
error: column "employees.name" must appear in the GROUP BY clause or be used in an aggregate function
This is because columns that don't appear in the GROUP BY
clause can't be selected without being aggregated.
Unfortunately, there is no aggregate function in standard SQL that we can use to get the right names,
though some SQL vendors have added non-standard aggregate functions to do this.
So, we need another way.
We need to take our result and add the name
column to it afterwards.
Imagine doing the aggregation first, taking the result, and joining the original employees
table to it in order to add the name
column.
That's the idea, but it isn't possible to write 1 single SELECT
statement to aggregate and then join.
Recall that the clause order has JOIN
occurring before GROUP BY
.
We'll actually need 2 separate SELECT
statements.
How can we take the result of 1 SELECT
statement and take it as an input to a another one?
Algebraic Closure
When discussing the relational model, we briefly mentioned that SQL queries output tables. The inputs and outputs are both tables. This is important because it means that conceptually, there is nothing preventing the output of 1 query from becoming the input to another (it's still a table after all). This property is referred to in Mathematics as "algebraic closure." Relational algebra is closed. An algebra being closed just means that the inputs and outputs of any operation are in the same set: tables. Inputs and outputs are tables, and there aren't any operations you can do to produce something that isn't a table.
So, the output of 1 query can become the input of another. How do we actually accomplish this in the code? Conceptually, it would look something like this…
SELECT
top_salaries.*,
employees.name
FROM (top_salary_query_result) top_salaries
JOIN employees
ON top_salaries.company = employees.company
AND top_salaries.top_salary = employees.salary;
Where (top_salary_query_result)
would be the result table from our initial query getting the top salaries.
That result is being given the alias top_salaries
since the inputs need names when joining multiple tables.
Writing a Subquery
As it turns out, we can just replace (top_salary_query_result)
with that first query.
In general, you can put SELECT
statements inside of the FROM
clause instead of a table name…
SELECT
top_salaries.*,
employees.name
FROM (
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company
) top_salaries
JOIN employees
ON top_salaries.company = employees.company
AND top_salaries.top_salary = employees.salary;
company VARCHAR | top_salary INTEGER | name VARCHAR |
---|---|---|
Poogle | 300000 | Marianne Schamberger |
Wamazon | 277000 | Ollie Murphy |
Footbook | 254000 | Wilson Maggio |
Footbook | 254000 | Ronald Graham |
Footbook has 2 names. This is because 2 employees are tied for the top salary.
This is called a "subquery" because 1 whole query is nested inside another. If you need, it is possible to nest a subquery inside a subquery inside a subquery etc.
Table Expressions vs Value Expressions
Let's add a bit more detail to the diagram showing the structure of a SELECT
statement.
Many clauses are a keyword plus an expression(s). But some expect a value expression and others expect a table expression. A value expression evaluates to a single value, like a number, string, boolean, etc. These can be columns names, math calculations, and most function calls. A table expression evaluates to a table instead of a single value. These include table names, subqueries, and even some functions that return tables.
The FROM
clause is the first clause we've seen that takes a table expression.
Another one is JOIN
.
We could've accomplished this query by simply swapping the FROM
and JOIN
clauses.
SELECT
top_salaries.*,
employees.name
FROM employees
JOIN (
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company
) top_salaries
ON top_salaries.company = employees.company
AND top_salaries.top_salary = employees.salary;
company VARCHAR | top_salary INTEGER | name VARCHAR |
---|---|---|
Poogle | 300000 | Marianne Schamberger |
Wamazon | 277000 | Ollie Murphy |
Footbook | 254000 | Wilson Maggio |
Footbook | 254000 | Ronald Graham |
There is actually another way to write this query without using a subquery. We'll look at that more in the section on window functions.
Table Functions
All the functions we've seen so far return values, and can be called wherever a value expression is expected.
However, SQL also has table functions
which are functions that return tables.
Some standard ones include unnest
and json_table
,
but we'll study those later when we look at denormalization.
For now, check out Postgres' table function called generate_series
for creating sequences of numbers or timestamps.
SELECT *
FROM generate_series(0, 10);
generate_series INTEGER |
---|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
SELECT day
FROM generate_series(
CAST('2020-04-01' AS TIMESTAMP),
CAST('2020-04-14' AS TIMESTAMP),
CAST('1 day' AS INTERVAL)
) quarantine_period(day);
day TIMESTAMP |
---|
2020-04-01 00:00:00 |
2020-04-02 00:00:00 |
2020-04-03 00:00:00 |
2020-04-04 00:00:00 |
2020-04-05 00:00:00 |
2020-04-06 00:00:00 |
2020-04-07 00:00:00 |
2020-04-08 00:00:00 |
2020-04-09 00:00:00 |
2020-04-10 00:00:00 |
2020-04-11 00:00:00 |
2020-04-12 00:00:00 |
2020-04-13 00:00:00 |
2020-04-14 00:00:00 |
They behave just like normal tables, but they are generated on the fly. You can give them aliases, join them with other tables, put them in subqueries, etc.