What Are Windows?
Windows are a very different concept from everything we've done so far with SQL. Windowing gives each row visibility to the values of different rows in the same table.
Suppose you are a software engineer working at the International Olympic Committee.
You are tasked with analyzing data from the shot put event in the Olympic Games.
Each shot putter's throw is recorded in the shot_puts
table.
column_name | type |
---|---|
id | INTEGER |
athlete_id | INTEGER |
athlete_sex | VARCHAR |
distance_m | NUMERIC |
technique | VARCHAR |
Your first task is simple: rank the throws in the shot_puts
table.
Something like this…
rank BIGINT | id INTEGER | athlete_id INTEGER | athlete_sex VARCHAR | distance_m NUMERIC | technique VARCHAR |
---|---|---|---|---|---|
1 | 93514 | 7186 | male | 21.268 | glide |
2 | 20546 | 5519 | male | 20.208 | spin |
3 | 73198 | 1791 | male | 19.876 | glide |
4 | 92196 | 5933 | female | 19.273 | glide |
5 | 58542 | 4958 | male | 18.491 | spin |
…adding a rank
column to show the ranking of each throw among all the other competitors.
How would we do that?
When computing the rank for each row, we need some way of comparing it to other rows in the same table.
However, we don't have a way to do that yet.
Whenever we compute columns in the SELECT
clause, we're basically just "mapping" each row in the source table(s) into each row of the output table,
(unless aggregations are involved, and those won't help us here.)
That is where window functions come in.
Window Functions
Window functions, a.k.a "analytic functions", are functions you can call in the SELECT
clause, where you specify a window.
The window is another input to the function, in addition to the normal arguments between the parentheses.
Unlike the normal arguments, windows can "see" or access the other rows in the same table.
Calling window functions looks like function_name() OVER (…)
,
a function call followed by an OVER
clause which defines the window.
The function_name
must be 1 of a specific set of functions that can work as window functions.
List of all window functions.
You can Postgres' full list of window functions here.
The first one we'll learn is rank()
.
It does what you'd intuit, generates a ranking number based on the comparison with the other rows.
The window definition in the OVER
clause tells it which rows to rank, and how to rank them.
SELECT
rank() OVER (ORDER BY distance_m DESC),
*
FROM shot_puts
ORDER BY 1;
rank BIGINT | id INTEGER | athlete_id INTEGER | athlete_sex VARCHAR | distance_m NUMERIC | technique VARCHAR |
---|---|---|---|---|---|
1 | 93514 | 7186 | male | 21.268 | glide |
2 | 20546 | 5519 | male | 20.208 | spin |
3 | 73198 | 1791 | male | 19.876 | glide |
4 | 92196 | 5933 | female | 19.273 | glide |
5 | 58542 | 4958 | male | 18.491 | spin |
6 | 38980 | 8852 | female | 18.485 | glide |
7 | 92127 | 3749 | female | 18.316 | glide |
8 | 84190 | 1121 | female | 17.721 | glide |
9 | 50726 | 8639 | female | 17.553 | glide |
10 | 85215 | 7900 | male | 16.229 | glide |
The most basic window is simply OVER ()
, which means the window function "sees" the whole table in an unspecified order.
However, we are using rank()
here, and calling rank() OVER ()
will produce a ranking in a random order, which is meaningless.
Therefore, we specify the window ordering by OVER (ORDER BY distance_m DESC)
, so that an order is defined.
This gives our ranking the meaning we want it to have: a ranking of these throws from most to least distance.
As we can see, window functions run once per row and generate 1 value per row.
This is how every shot put in the table has its own rank.
They can take inputs from the current row and the window for that row.
Note that the value of the window can be different for each row,
though it is the same for all rows in this example: a view of the entire shot_puts
table ordered by distance.
We'll look at some window functions which do not depend on ordering later. For now, let's look at another challenge with ordered windows.
Fetching Values From Other Rows
Our next engineering task is to produce a table showing the difference in distance between each athlete and the 1 a rank behind them.
It's simple to find a difference in distances by subtraction, but what are we subtracting?
Each athlete's throw distance is in the distance_m
column.
We also need to get the throw distance of the athlete behind them and subtract it.
This means each row needs to refer to the previous row in the ordering, e.g. the 2nd row can access the 1st.
This is where window functions come in again.
We saw that the rank()
function lets us generate a single number based on the window order,
but doesn't let us refer to any specific row in the window.
For this, we can use lag()
instead.
SELECT
*,
distance_m - lag(distance_m) OVER (ORDER BY distance_m) AS distance_diff_m
FROM shot_puts
ORDER BY distance_m;
id INTEGER | athlete_id INTEGER | athlete_sex VARCHAR | distance_m NUMERIC | technique VARCHAR | distance_diff_m NUMERIC |
---|---|---|---|---|---|
85215 | 7900 | male | 16.229 | glide | NULL |
50726 | 8639 | female | 17.553 | glide | 1.324 |
84190 | 1121 | female | 17.721 | glide | 0.168 |
92127 | 3749 | female | 18.316 | glide | 0.595 |
38980 | 8852 | female | 18.485 | glide | 0.169 |
58542 | 4958 | male | 18.491 | spin | 0.006 |
92196 | 5933 | female | 19.273 | glide | 0.782 |
73198 | 1791 | male | 19.876 | glide | 0.603 |
20546 | 5519 | male | 20.208 | spin | 0.332 |
93514 | 7186 | male | 21.268 | glide | 1.060 |
When computing the output for each row, the value inside the lag()
function is computed for the previous row.
Like rank()
, lag()
is sensitive to its window ordering, so ORDER BY
is used.
The
ORDER BY
inside anOVER
clause applies only to that specific window, not to the result table as a whole. For ordering the output table, a separateORDER BY
clause is still needed.
The complement to lag()
is lead()
, which looks at the next row.
We could've used that here to get an equivalent result if we reverse the window order.
SELECT
*,
distance_m - lead(distance_m) OVER (ORDER BY distance_m DESC) AS distance_diff_m
FROM shot_puts
ORDER BY distance_m DESC;
id INTEGER | athlete_id INTEGER | athlete_sex VARCHAR | distance_m NUMERIC | technique VARCHAR | distance_diff_m NUMERIC |
---|---|---|---|---|---|
93514 | 7186 | male | 21.268 | glide | 1.060 |
20546 | 5519 | male | 20.208 | spin | 0.332 |
73198 | 1791 | male | 19.876 | glide | 0.603 |
92196 | 5933 | female | 19.273 | glide | 0.782 |
58542 | 4958 | male | 18.491 | spin | 0.006 |
38980 | 8852 | female | 18.485 | glide | 0.169 |
92127 | 3749 | female | 18.316 | glide | 0.595 |
84190 | 1121 | female | 17.721 | glide | 0.168 |
50726 | 8639 | female | 17.553 | glide | 1.324 |
85215 | 7900 | male | 16.229 | glide | NULL |
If you want to lag or lead by 2, 3, 4, n rows, etc. you can pass a second argument to do that, e.g. lag(distance_m, 2)
.
There are other window functions
for referencing specific rows within the window, like first_value()
, last_value()
, and nth_value()
.