initial commit

This commit is contained in:
Marty Oehme 2024-01-24 20:42:11 +01:00
commit 4da221d82c
Signed by: Marty
GPG key ID: EDBF2ED917B2EF6A
6 changed files with 1542 additions and 0 deletions

225
.gitignore vendored Normal file
View file

@ -0,0 +1,225 @@
# Created by https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
# Edit at https://www.toptal.com/developers/gitignore?templates=-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
### JupyterNotebooks ###
# gitignore template for Jupyter Notebooks
# website: http://jupyter.org/
.ipynb_checkpoints
*/.ipynb_checkpoints/*
# IPython
profile_default/
ipython_config.py
# Remove previous ipynb_checkpoints
# git rm -r .ipynb_checkpoints/
### Linux ###
*~
# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*
# KDE directory preferences
.directory
# Linux trash folder which might appear on any partition or disk
.Trash-*
# .nfs files are created when an open file is removed but is still being accessed
.nfs*
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
# IPython
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
### Python Patch ###
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
poetry.toml
# ruff
.ruff_cache/
# LSP config files
pyrightconfig.json
### Vim ###
# Swap
[._]*.s[a-v][a-z]
!*.svg # comment out if you don't need vector files
[._]*.sw[a-p]
[._]s[a-rt-v][a-z]
[._]ss[a-gi-z]
[._]sw[a-p]
# Session
Session.vim
Sessionx.vim
# Temporary
.netrwhist
# Auto-generated tag files
tags
# Persistent undo
[._]*.un~
# End of https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
/.quarto/
/output/

8
_quarto-blog.yml Normal file
View file

@ -0,0 +1,8 @@
project:
type: default
output-dir: /home/marty/projects/hosting/webpage/src/content/blog/2024-01-24-select-star-sql
format:
hugo-md:
preserve-yaml: true
code-fold: false

21
_quarto.yml Normal file
View file

@ -0,0 +1,21 @@
project:
type: default
output-dir: output
format:
html:
code-fold: false
echo: true
pdf:
echo: true # since we want to see the code in this case
papersize: A4
geometry:
- left=2cm
- right=2.5cm
- top=2.5cm
- bottom=2.5cm
indent: true
linestretch: 1.5
fontfamily: lmodern
fontsize: "12"
pdf-engine: tectonic

616
data/tx_deathrow_full.csv Normal file

File diff suppressed because one or more lines are too long

BIN
data/tx_deathrow_full.db Normal file

Binary file not shown.

672
index.qmd Normal file
View file

@ -0,0 +1,672 @@
---
title: "Select Star SQL"
subtitle: "SQL Introduction follow-along"
description: "SQL Introduction follow-along"
author:
- Marty Oehme
date: today
pubDate: "2024-01-24T09:10:23+01:00"
weight: 10
toc: true
tags:
- sql
---
## Quick Introduction
I have recently been brushing up on my SQL skills,
because I firmly believe they are one of the most fundamental knowledge areas if you do anything to do with data.
The following content will follow along with the excellent [Star Select SQL](https://selectstarsql.com/) introductory resource to SQL.
We will go through all four chapters, and follow most of the examples and challenges given.
Sometimes, however, we will swerve a little and approach things slightly differently,
or with a different query twist -
just because my mind obviously works differently than that of Zi Chong Kao who wrote the book.
In general, there is slightly more of a focus on the actual SQL statements,
the knowledge imparted and perhaps a more explicit wording regarding notes and limits in the queries.
Less of a focus is given to the meaning of the data under analysis,
though we of course still use the same dataset, and attempt to draw the same conclusions from similar queries.
In other words:
If you want to learn (and be excited by the possibilities and challenges of) SQL, go to [Select Star SQL](https://selectstarsql.com/).
If you have already done it and want a refresher, this page will give you a quick overview separated by topical chapter.
Or use it as inspiration to read-along like I did.
This document uses quarto under the hood and we will experiment a little with the R SQL interactions along the way.
It is also my first attempt to really integrate quarto publications into this blog,
in an attempt to ease the process for the future.
For now, we just load the database (as an SQLite file) and show which table we have,
using some R packages:
```{r}
suppressPackageStartupMessages(library(tidyverse)) # suppressing conflict warnings in outputs
library(DBI)
library(RSQLite)
con <- dbConnect(SQLite(), "data/tx_deathrow_full.db")
as.data.frame(dbListTables(con))
```
That seems reasonable! We have a single table in the dataset, called `deathrow`.
This is perhaps a good moment to mention that the tables and colums do *not* have the exact same names as they do on the website's interactive follow-along boxes.
Instead, they have the column names that the author gave the columns for the full data set that he made available as a csv download, I have not changed any of them.
<details>
<summary>A list of all column headers in the `deathrow` table.</summary>
0|Execution|TEXT\
1|Date of Birth|TEXT\
2|Date of Offence|TEXT\
3|Highest Education Level|TEXT\
4|Last Name|TEXT\
5|First Name|TEXT\
6|TDCJ Number|TEXT\
7|Age at Execution|TEXT\
8|Date Received|TEXT\
9|Execution Date|TEXT\
10|Race|TEXT\
11|County|TEXT\
12|Eye Color|TEXT\
13|Weight|TEXT\
14|Height|TEXT\
15|Native County|TEXT\
16|Native State|TEXT\
17|Last Statement|TEXT\
</details>
Perhaps, for a future version it might be an interesting data cleaning experiment to actually change them to the same names before following the book.
## Chapter 1 - Individual operations {#sec-individual}
For now, to test the database connection, we simply print three inmate names:
We do so by doing a `SELECT` for just three rows from the overall table.
```{sql connection=con}
-- SELECT * FROM deathrow LIMIT 3;
SELECT "First Name", "Last Name" FROM deathrow LIMIT 3;
```
Everything seems to be working smoothly, and we can even directly create `sql` code chunks which connect to our database.
Neat!
The star indeed selects 'everything' (somewhat like globbing) which means we `SELECT` *everything* from out deathrow table, but limit the number of rows to three.
In other words, `SELECT` is more oriented towards columns, with `LIMIT` (and later e.g. `WHERE`) instead filtering out rows.
The 'star' selection is only a theory in this document, unfortunately,
with `quarto` (and `knitr`) as our rendering engines.
Doing this selection here would just end in a word-spaghetti since we just have *too many columns* to nicely display.
Instead, we limit ourselves to just a couple.
You can see the 'star' selection as a comment underneath the one we did, however.
For now, we `SELECT` from out table.
We do not have to.
Selections at their most basic simply return whatever returns a boolean 'truthy' value for the columns passed in,
even if we do not `SELECT` `FROM` a specific table:
```{sql connection=con}
SELECT 50 + 2, 51 / 2, 51 / 2.0;
```
This example also reflects the float/integer differences in SQL:
If you only work with integers, the result will be an integer.
To work with floating point numbers at least one of the involved numbers has to be a float.
Often this is accomplished by just multiplying with `* 1.0` at some point in the query.
Having filtered on *columns* above with `SELECT`, let us now filter on *rows* with the `WHERE` block:
```{sql connection=con}
SELECT "Last Name", "First Name"
FROM deathrow
WHERE "Age at Execution" <= 25;
```
In this case, we filter for all executions aged 25 and under (only very few due to the lengthy process).
`WHERE` blocks take an expression which results in a boolean truth value (in this case 'is it smaller than or equal to?'),
and for every row that the result is true will include them for the rest of the query.
This bit is also important, as `WHERE` operations will generally run before other operations such as `SELECT` or aggregations, as we will see in the next chapter.
As a last bit to notice, the order of `SELECT` conditions also matters:
it is the order the rows will appear as in the resulting table.
Here, we switched first and last names compared to the last table query.
While numeric comparisons are one thing, we can of course filter on text as well.
A single equals sign generally accomplishes conditional comparison:
```{sql connection=con}
SELECT "First Name", "Last Name"
FROM deathrow
WHERE "Last Name" = "Jones";
```
Of course, string comparison needs some leeway to account for small differences in the thing to be searched for.
For example, names could have additions ('Sr.', 'Jr.') or enumeration ('John II') or even simple misspellings.
We can use `LIKE` to help with that for string comparisons:
```{sql connection=con}
SELECT "First Name", "Last Name", "Execution"
FROM deathrow
WHERE "First Name" LIKE 'Raymon_'
AND "Last Name" LIKE '%Landry%';
```
It allows a few wildcards in your queries to accomplish this:
`_` matches exactly one character,
while `%` matches any amount of characters.
Both only operate in the place the are put, so that `%Landry` and `%Landy%` can be different comparisons.
Above you can also see that we can logically concatenate query parts.
The precedence order goes `NOT`, then `AND`, then `OR`, so that the following:
```{sql connection=con}
SELECT 0 AND 0 OR 1;
```
Returns `1`.
It reads '0 and 0', which results in a 0; and *then* it reads '0 or 1' which ultimately results in a 1.
To re-organize them into the precedence we require, simply use parentheses:
```{sql connection=con}
SELECT 0 AND (0 OR 1);
```
The parenthesized `OR` clause is now looked at before the `AND` clause, resulting in a 0.
As a cap-stone for Chapter 1,
and to show that we do *not* need the column we filter on (with `WHERE`) as a `SELECT`ed one in the final table,
we will select a specific statement:
```{sql connection=con}
SELECT "Last Statement"
FROM deathrow
WHERE "Last Name" = 'Beazley';
```
## Chapter 2 - Aggregations {#sec-aggregation}
We can use aggregators (such as `COUNT`, `MEAN` or `MEDIAN`) to consolidate information from *multiple* rows of input.
```{sql connection=con}
SELECT COUNT("Last Statement") FROM deathrow WHERE "Last Statement" != '';
```
Here, we diverge slightly from the book:
Whereas for them, NULLs are used where no statement exists, this is not so for our dataset.
Since we did no cleaning after importing from csv, empty statements will be imported as an empty string and not a `NULL` object.
Since count uses `NULL` objects (as a 'non-count' if you want) we additionally have to select all those rows where the statements are not empty strings.
We can count the overall number of entries (i.e. the denominator for later seeing which share of inmates proclaimed innocence) by `COUNT`ing a column which we know to have a value for each entry:
```{sql connection=con}
SELECT COUNT("ex_age") FROM deathrow;
```
Or, even better as the book shows, use `COUNT(*)` to count the value in all columns. Since we can't have completely `NULL` rows (the row would just not exist), we will definitely get an overall count back this way:
```{sql connection=con}
SELECT COUNT(*) FROM deathrow;
```
Now, let's see how we can combine this with conditional `CASE WHEN` statements.
Let's find all inmates from 'Harris' or 'Bexar' country:
```{sql connection=con}
SELECT
SUM(CASE WHEN "County"='Harris' THEN 1 ELSE 0 END),
SUM(CASE WHEN "County"='Bexar' THEN 1 ELSE 0 END)
FROM deathrow;
```
We can try to find the number of inmates that where over 50 years old at their point of execution:
```{sql connection=con}
SELECT COUNT(*) FROM deathrow WHERE "Age at Execution" > 50;
```
You can see that the `WHERE` block selects *before* we run aggregations.
That is useful since we first reduce the amount of entries we have to consider before further operations.
Now, we can practice working with conditions, counts and cases for the same goal:
Finding all instances where inmates did *not* give a last statement.
First, we will do so again with a `WHERE` block:
```{sql connection=con}
SELECT COUNT(*) FROM deathrow WHERE "Last Statement" = '';
```
Then, we can do the same, but using a `COUNT` and `CASE WHEN` blocks:
```{sql connection=con}
SELECT
COUNT(CASE WHEN "Last Statement" = '' THEN 1 ELSE NULL END)
FROM deathrow;
```
This is a little worse performance-wise.
Whereas in the first attempt (using `WHERE`) we first filter the overall table down to relevant entries and then count them, here we go through the whole table to count each (or 'not-count' those we designate with `NULL`).
Lastly, if we had cleaned the data correctly before using it here (especially designated empty strings as `NULL` objects), we could use `COUNT` blocks only:
```{sql connection=con}
SELECT
COUNT(*) - COUNT("Last Statement")
FROM deathrow;
```
You can see, however, that this results in `0` entries --- we count *all* entries both times since nothing is `NULL`.
While somewhat contrived here, it should point to the fact that it is generally better to have clean `NULL`-designated data before working with it.
However, this way of counting would also be the least performant of the three, with all rows being aggregated with counts twice.
In other words, the exercise showed us three things:
- there are many ways to skin an SQL query
- correct `NULL` designations can be important during cleaning
- performant operations should generally filter before aggregating
There are more aggregation functions, such as `MIN`, `MAX`, `AVG`.[^docs]
[^docs]: The book recommends documentation as [SQLITE](http://sqlite.org), [W3 Schools](https://www.w3schools.com/sql/default.asp) and, of course, Stack overflow.
```{sql connection=con}
SELECT
MIN("Age at Execution"),
MAX("Age at Execution"),
AVG("Age at Execution")
FROM deathrow;
```
We can also combine aggregations, running one on the results of another (just like combining function outputs in e.g. python).
Here is an example calculating the average length of statements, for cases where a statement exists using the `LENGTH` aggregation:
```{sql connection=con}
SELECT
AVG(LENGTH("Last Statement"))
FROM deathrow
WHERE "Last Statement" != '';
```
Another aggregation is `DISTINCT` which works somewhat like the program `uniq` on Unix systems:
```{sql connection=con}
SELECT DISTINCT("County") FROM deathrow;
```
It presents all the options (or 'categories' if you think of your data as categorical) that are represented in a column, or the output of another aggregation.
On its face it is less of an *aggregation* function as the book remarks, since it does not output "a single number".
But since it 'aggregates' the contents of multiple rows into its output I would still very much classify it as such.
Finally, let's look at what happens with 'misshapen' queries:
<!-- ::: {#lst-strange-query} -->
```{sql connection=con}
SELECT "First Name", COUNT(*) FROM deathrow;
```
<!-- A strange query -->
<!-- ::: -->
Here, we make a query which is indeed very strange:
The `COUNT` aggregation wants to output a single aggregated number, while the single column selection `"First Name"` wants to output each individual row.
What happens?
The database goes easy on us and does not error out but uses the aggregation as guide that we only receive a single output back and picks an entry from the first names.
In the book, this is the last entry ('Charlie'), though it does warn that not all databases return the same.
This is reflected in the SQLite query here in fact returning the *first* entry ('Christopher Anthony') instead.
The lesson being to *not* rely on unclear operations like this but being explicit if, say, we want to indeed grab the last entry of something:
```{sql connection=con}
SELECT "First Name" FROM deathrow ORDER BY ROWID DESC LIMIT 1;
```
Since SQLite does not come with a convenient `LAST` aggregator (some other databases do),
we need to work around it by reversing the order of the table based on its `ROWID` (which increase for each entry).
Thus, the highest `ROWID` is the last entry.
Having reversed it, we can limit the output to the very last one to arrive back at 'Charlie'.[^SOlast]
[^SOlast]: This is cribbed from the very nice tips on grabbing the last SQLite entry on [this Stack Overflow](https://stackoverflow.com/q/24494182) question.
However, this operation (as far as I know for now) is now not compatible anymore with *also* aggregating on a row count like we did above.
So, for the final query in this aggregation chapter, let's see the share of inmates which insisted on their innocence even during the last statement:
```{sql connection=con}
SELECT
1.0 * COUNT(CASE WHEN "Last Statement" LIKE '%innocent%' THEN 1 ELSE NULL END) / (COUNT(*)) * 100
FROM deathrow;
```
We can see that over 5% of the people proclaimed their innocence even during their final statement.
Of course, this method of simple string matching has some issues:
If somebody uses other statements (the book mentions the example of 'not guilty') we have a lower bound of proclamations.
We also do not know in what way people used the word --- what if they instead proclaimed *not* being innocent?
We would rather have a higher bound than the true number in this case.
At the same time, we do not know the thoughts of the people not giving last statements at all.
Perhaps, it would also make sense to compare only the number of people who did give a statement (but did not mention their innocence) with those who did, which would put us on the lower bound again.
We can see this behavior if we just show a sub-section of the statements:
```{sql connection=con}
SELECT "First Name", "Last Statement"
FROM deathrow
WHERE "Last Statement" LIKE '%innocent%'
LIMIT 4;
```
While Preston, Jonathan and Keith do protest their innocence to the warden, their loved ones or the world,
Jeffrey instead mentions 'innocent kids'.
Now, he could include himself in this category but we do not know for sure.
So, this concludes a chapter about *aggregating*:
operating with functions on multiple rows in the dataset, allowing study of more system-level behavior in the data.
## Chapter 3 - Grouping and Nesting {#sec-grouping}
After dealing with individual rows (Ch 1) and aggregations (Ch 2), we will now do some data *organization* based on specific rows or columns -
a sort of mish-mash between keeping multiple rows in the output like in the first case and doing operations on them like in the latter.
The chapter begins by looking at a visualization of the data, which shows a strong long right tail (or, right skew as I know it called).
We ultimately will end the chapter by taking a look at the percentage breakdown of executions each county contributed to investigate this skew.
Grouping can be accomplished with the `GROUP BY` block:[^semicolon]
```{sql connection=con}
SELECT
"County",
COUNT(*) AS county_executions
FROM deathrow
GROUP BY "County"
;
```
[^semicolon]: You can see that I have started putting the final semicolon on its own separate line.
It is a technique I have seen being used by 'Alex the Analyst' on YouTube and which I am quite keen on replicating.
For me, it serves two purposes:
First, for some SQL dialects, the closing semicolon is actually required and this makes it harder to forget it.
Second, it provides a visually striking *close* of the query for myself as well, which may be useful once getting into more dense queries.
This reminds a lot of the 'misshapen' query above, however, there is a key difference:
Even when doing aggregations around it, the column(s) *being grouped on* is allowed to be a multi-output column (called 'grouping columns').
Let's first do another quick grouping query to see how it can work.
We'll try to find the most common last names:
```{sql connection=con}
SELECT "Last Name", COUNT(*) AS count
FROM deathrow
GROUP BY "Last Name"
ORDER BY "count" DESC
;
```
The code above also makes use of *aliasing*, with an `AS <new-name>` block with which you can provide an efficient short-hand or new name for `SELECT` outputs.
I believe it also works for the outputs of e.g. `JOIN` operations.
Let's now have a breakdown of executions with and without a last statement by county:
```{sql connection=con}
SELECT
"Last Statement" IS NOT '' AS has_last_statement,
"County",
COUNT(*)
FROM deathrow
GROUP BY "County", "has_last_statement"
;
```
The order in which you group by here matters!
In this case, we first order by county and then statements -
all 'Anderson' inmates appear first, then all 'Aransas', at some point the 'Bell' county cases, both those with and without statement, before 'Bexar' county, and so on.
Had we the groupings the other way around,
we would first have all `has_last_statement = 0` entries, from 'Anderson' county to 'Wood' county last, and then repeat the same for all counts of cases with statements.
Also, we can of course manually influence this order.
Using the `ORDER BY` block we can choose a column with which to order, so regardless of the grouping sequence we can make sure to for example order on counties.
We can of course also sort by different column altogether of course, such as the counts which would then require a naming alias.
Using `ORDER BY "Column" DESC` we can reverse the sorting.
Here is the same example from above implementing most of these ideas:
```{sql connection=con}
SELECT
"Last Statement" IS NOT '' AS has_last_statement,
"County",
COUNT(*) AS number
FROM deathrow
GROUP BY "has_last_statement", "County"
ORDER BY "number" DESC
;
```
We already know from @sec-aggregation that `WHERE` blocks will take place before any aggregation.
The same is true for groupings - `WHERE` will always execute (and thus filter) before `GROUP BY` executes.
The following counts the number of inmates executed that were at least 50 for each county:
```{sql connection=con}
SELECT
"County",
COUNT(*) AS number
FROM deathrow
WHERE "Age at Execution" >= 50
GROUP BY "County"
;
```
We do not select the age column for further consideration, but since `WHERE` runs before all other operations we also do not need it.
But what if we want to filter on the *outputs* of grouping or aggregation functions?
The `HAVING` block solves that.
The following shows the counties in which at least two inmates 50 or older were executed:
```{sql connection=con}
SELECT
"County",
COUNT(*) AS number
FROM deathrow
WHERE "Age at Execution" >= 50
GROUP BY "County"
HAVING "number" > 2
ORDER BY "number" DESC
;
```
As one interesting fact for possibly more advanced queries:
`GROUP BY` blocks do *not* need the columns on which they group to be in the `SELECT` block!
Generally this does not make a lot of sense - when we group by county but do not see county then it just seems like fairly weird groupings,
but there will invariably be situations where this knowledge is useful.
```{sql connection=con}
SELECT "County"
FROM deathrow
GROUP BY "County"
;
```
This exactly mirrors the `SELECT DISTINCT` aggregation, but is accomplished with grouping instead.
Many ways to skin a query again!
Now let's pivot a little and look at query nesting.
Since we sometimes will want to run one query leading into another (e.g. to compute percentages),
we have to have some way to integrate them with another.
We do so through *nested queries*, demarcated with `(parantheses)` within another query.
Let's see how we utilize this to select the inmate with the longest last statement:
```{sql connection=con}
SELECT "First Name", "Last Name"
FROM deathrow
WHERE LENGTH("Last Statement") =
(
SELECT MAX(LENGTH("Last Statement"))
FROM deathrow
)
;
```
It looks a little cumbersome but essentially we first filter on the row whose statement length is (exactly) the length of the longest statement,
previously queried as a separate sub-query.
Why do need a nested query here?
The book itself explains it most succinctly:
> nesting is necessary here because in the WHERE clause,
as the computer is inspecting a row to decide if its last statement is the right length,
it cant look outside to figure out the maximum length across the entire dataset.
We have to find the maximum length separately and feed it into the clause.
We will now attempt to do the same to find the percentage of all executions contributed by each county:
```{sql connection=con}
SELECT
"County",
100.0 * COUNT(*) / (
SELECT COUNT(*)
FROM deathrow
) as percentage
FROM deathrow
GROUP BY "County"
ORDER BY "percentage" DESC
;
```
It follows the same concept:
We need to invoke a nested query because our original query is already fixated on a sub-group of all rows and we **can not get out of it within our query**.
Instead, we invoke another query 'before' the original which still has access to all rows and create our own nested aggregation.
The output of that then feeds into the original query.
Quite clever, a little cumbersome, and presumably the origin of quite a few headaches in writing elegant SQL queries.
## Chapter 4 - Joins and Dates {#sec-joining}
Before we look at joins, let's look at handling dates (as in, the data type) in SQL.
While we have a couple of columns of reasonably formatted dates (ISO-8601), that doesn't mean we can automatically use them as such.
To make use of such nice and clean data we should use operations that specifically [make use of dates](https://www.sqlite.org/lang_datefunc.html) for their calculations.
```{sql connection=con}
SELECT
julianday('1993-08-10') - julianday('1989-07-07') as day_diff
;
```
The `julianday()` function will transform our ISO-compatible dates into timestamp floats on which we can operate like usual, in this case subtracting the latter from the former to get the period of time between them.
Like the unix timestamp, the Julian day counts from a specific point in time as 0 continuously upwards,
only that it counts from 4714 B.C.E. (not 1971) and per-day not per-second.
Anything below a single day is fractional.
Half a day's difference would thus be `0.5` difference, making it perhaps more useful to work with larger time differences.
Now we will join a table with itself, only shifted by a row.
This will make it necessary to prepare the date for the 'other' table first,
adding one to the column we are going to join on.
```{sql connection=con}
SELECT
"Execution" + 1 AS ex_number,
"Execution Date" AS prev_ex_date
FROM deathrow
WHERE "Execution" < 553
```
We could perhaps use `julianday` comparisons, but since we have access to the execution numbers and they are rolling we can instead use them like an ID and just shift it one up.
Now we want to put data from one row into *another* row and neither aggregations nor groups can help us out.
Instead, to gain access to data from other rows (whether in the same or another table) we use the `JOIN` block.
There is an `INNER JOIN` (the default), a `LEFT JOIN`, a `RIGHT JOIN` and an `OUTER JOIN` block.
**The different joins only differ in how they handle unmatched rows.**
With an inner join, any unmatched rows are dropped completely (essentially intersection merge),
with an outer join unmatched rows are preserved completely from both tables (union merge),
with a left join unmatched rows from the *left* (i.e. `FROM XY`) table are preserved,
with a right join unmatched rows from the *right* (`JOIN XY`) table.
This prepares our table to be used to join 'itself', adding those rows (shifted) together which we want.
Of course, we do not have to shift in our selection already, and I find it more intuitive to do so in the value comparison.
We end up with the following query:
```{sql connection=con}
SELECT
"Execution Date",
prev_ex_date AS "Previous Execution",
JULIANDAY("Execution Date") - JULIANDAY(prev_ex_date) AS "Difference in Days"
FROM deathrow
JOIN (
SELECT
"Execution" AS ex_number,
"Execution Date" AS prev_ex_date
FROM deathrow
WHERE "Execution" < 553
) AS previous
ON deathrow."Execution" = previous.ex_number + 1
ORDER BY "Difference in Days" DESC
LIMIT 10
;
```
This shows us the top ten timeframes in which no executions occured.
You can see we do the shifting in the `ON` block itself, leading to the more natural reading of
'join the tables on execution number being the same as the previous execution number plus one'.
For my brain this is more easily comprehensible as a row-shift.
Otherwise, we only select the execution number
(though we only need it for the shift operation and drop it in the outer selection)
and the execution date which is the one important column we are looking for.
We also do not include Execution number 553 (the largest execution number) since there will be no newer execution to join it with in the dataset.
The resulting table will not be different if we do not, however.
Remember we are doing an `INNER JOIN`, which drops any non-matching rows by default.
Such a 'self join' is a common technique to **grab information from other rows** from the same table.
This is already quite an advanced query!
The book plots a graph here, which I will not replicate for the moment.
However, it shows that roughly to pre-1993 there was a lower overall execution count,
with two clearly visible larger hiatuses afterwards.
Let's focus on those two hiatuses and limit the data to not show pre-1993 executions.
As a last thing, let's make it a little more elegant by making the original ('previous') table query way simpler:
```{sql connection=con}
SELECT
previous."Execution Date" AS "Beginning of period",
deathrow."Execution Date" AS "End of period",
JULIANDAY(deathrow."Execution Date") - JULIANDAY(previous."Execution Date") AS "Difference in Days"
FROM deathrow
JOIN deathrow AS previous
ON deathrow."Execution" = previous."Execution" + 1
WHERE DATE(deathrow."Execution Date") > DATE('1994-01-01')
ORDER BY "Difference in Days" DESC
LIMIT 10
;
```
We can also see much more clearly what the book is talking about with big stays of execution occuring in 1996-1997, as well as 2007-2008.
A little more wordy per line but overall a lot more elegant.
And mostly enabled due to putting the row 'shift' into the `ON` block itself.
However, the importance of good name-aliasing and `JOIN ON` blocks are definitely highlighted.
We are now equipped to grab data from multiple rows, multiple tables and rearrange them as we see necessary.
## Quick Conclusion
So, this should already give a rough mental model for the *kinds* of operations to be done with SQL.
We can operate on the contents of a single row,
we can aggregate the contents of many rows resulting in a single-row output,
we can group by columns which allows us to aggregate into multiple rows,
and we can work with the contents of *other* rows by joining tables.
We have learned to *query* data but not to create or manipulate date,
i.e. working with side-effects.
We also have not learned about the concepts of `window` functions or common table expressions.
These are additional operations to get to know,
but of course it is also important to get an overall broader view onto the concepts and mental mapping of SQL itself.
The book closes with a call-to-challenge, with an additional dataset to knock your teeth on.