initial commit
This commit is contained in:
commit
4da221d82c
6 changed files with 1542 additions and 0 deletions
225
.gitignore
vendored
Normal file
225
.gitignore
vendored
Normal file
|
@ -0,0 +1,225 @@
|
||||||
|
# Created by https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||||
|
# Edit at https://www.toptal.com/developers/gitignore?templates=-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||||
|
|
||||||
|
### JupyterNotebooks ###
|
||||||
|
# gitignore template for Jupyter Notebooks
|
||||||
|
# website: http://jupyter.org/
|
||||||
|
|
||||||
|
.ipynb_checkpoints
|
||||||
|
*/.ipynb_checkpoints/*
|
||||||
|
|
||||||
|
# IPython
|
||||||
|
profile_default/
|
||||||
|
ipython_config.py
|
||||||
|
|
||||||
|
# Remove previous ipynb_checkpoints
|
||||||
|
# git rm -r .ipynb_checkpoints/
|
||||||
|
|
||||||
|
### Linux ###
|
||||||
|
*~
|
||||||
|
|
||||||
|
# temporary files which can be created if a process still has a handle open of a deleted file
|
||||||
|
.fuse_hidden*
|
||||||
|
|
||||||
|
# KDE directory preferences
|
||||||
|
.directory
|
||||||
|
|
||||||
|
# Linux trash folder which might appear on any partition or disk
|
||||||
|
.Trash-*
|
||||||
|
|
||||||
|
# .nfs files are created when an open file is removed but is still being accessed
|
||||||
|
.nfs*
|
||||||
|
|
||||||
|
### Python ###
|
||||||
|
# Byte-compiled / optimized / DLL files
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
|
||||||
|
# C extensions
|
||||||
|
*.so
|
||||||
|
|
||||||
|
# Distribution / packaging
|
||||||
|
.Python
|
||||||
|
build/
|
||||||
|
develop-eggs/
|
||||||
|
dist/
|
||||||
|
downloads/
|
||||||
|
eggs/
|
||||||
|
.eggs/
|
||||||
|
lib/
|
||||||
|
lib64/
|
||||||
|
parts/
|
||||||
|
sdist/
|
||||||
|
var/
|
||||||
|
wheels/
|
||||||
|
share/python-wheels/
|
||||||
|
*.egg-info/
|
||||||
|
.installed.cfg
|
||||||
|
*.egg
|
||||||
|
MANIFEST
|
||||||
|
|
||||||
|
# PyInstaller
|
||||||
|
# Usually these files are written by a python script from a template
|
||||||
|
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||||
|
*.manifest
|
||||||
|
*.spec
|
||||||
|
|
||||||
|
# Installer logs
|
||||||
|
pip-log.txt
|
||||||
|
pip-delete-this-directory.txt
|
||||||
|
|
||||||
|
# Unit test / coverage reports
|
||||||
|
htmlcov/
|
||||||
|
.tox/
|
||||||
|
.nox/
|
||||||
|
.coverage
|
||||||
|
.coverage.*
|
||||||
|
.cache
|
||||||
|
nosetests.xml
|
||||||
|
coverage.xml
|
||||||
|
*.cover
|
||||||
|
*.py,cover
|
||||||
|
.hypothesis/
|
||||||
|
.pytest_cache/
|
||||||
|
cover/
|
||||||
|
|
||||||
|
# Translations
|
||||||
|
*.mo
|
||||||
|
*.pot
|
||||||
|
|
||||||
|
# Django stuff:
|
||||||
|
*.log
|
||||||
|
local_settings.py
|
||||||
|
db.sqlite3
|
||||||
|
db.sqlite3-journal
|
||||||
|
|
||||||
|
# Flask stuff:
|
||||||
|
instance/
|
||||||
|
.webassets-cache
|
||||||
|
|
||||||
|
# Scrapy stuff:
|
||||||
|
.scrapy
|
||||||
|
|
||||||
|
# Sphinx documentation
|
||||||
|
docs/_build/
|
||||||
|
|
||||||
|
# PyBuilder
|
||||||
|
.pybuilder/
|
||||||
|
target/
|
||||||
|
|
||||||
|
# Jupyter Notebook
|
||||||
|
|
||||||
|
# IPython
|
||||||
|
|
||||||
|
# pyenv
|
||||||
|
# For a library or package, you might want to ignore these files since the code is
|
||||||
|
# intended to run in multiple environments; otherwise, check them in:
|
||||||
|
# .python-version
|
||||||
|
|
||||||
|
# pipenv
|
||||||
|
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||||
|
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||||
|
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||||
|
# install all needed dependencies.
|
||||||
|
#Pipfile.lock
|
||||||
|
|
||||||
|
# poetry
|
||||||
|
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||||
|
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||||
|
# commonly ignored for libraries.
|
||||||
|
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||||
|
#poetry.lock
|
||||||
|
|
||||||
|
# pdm
|
||||||
|
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||||
|
#pdm.lock
|
||||||
|
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
||||||
|
# in version control.
|
||||||
|
# https://pdm.fming.dev/#use-with-ide
|
||||||
|
.pdm.toml
|
||||||
|
|
||||||
|
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||||
|
__pypackages__/
|
||||||
|
|
||||||
|
# Celery stuff
|
||||||
|
celerybeat-schedule
|
||||||
|
celerybeat.pid
|
||||||
|
|
||||||
|
# SageMath parsed files
|
||||||
|
*.sage.py
|
||||||
|
|
||||||
|
# Environments
|
||||||
|
.env
|
||||||
|
.venv
|
||||||
|
env/
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
env.bak/
|
||||||
|
venv.bak/
|
||||||
|
|
||||||
|
# Spyder project settings
|
||||||
|
.spyderproject
|
||||||
|
.spyproject
|
||||||
|
|
||||||
|
# Rope project settings
|
||||||
|
.ropeproject
|
||||||
|
|
||||||
|
# mkdocs documentation
|
||||||
|
/site
|
||||||
|
|
||||||
|
# mypy
|
||||||
|
.mypy_cache/
|
||||||
|
.dmypy.json
|
||||||
|
dmypy.json
|
||||||
|
|
||||||
|
# Pyre type checker
|
||||||
|
.pyre/
|
||||||
|
|
||||||
|
# pytype static type analyzer
|
||||||
|
.pytype/
|
||||||
|
|
||||||
|
# Cython debug symbols
|
||||||
|
cython_debug/
|
||||||
|
|
||||||
|
# PyCharm
|
||||||
|
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||||
|
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||||
|
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||||
|
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||||
|
#.idea/
|
||||||
|
|
||||||
|
### Python Patch ###
|
||||||
|
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
|
||||||
|
poetry.toml
|
||||||
|
|
||||||
|
# ruff
|
||||||
|
.ruff_cache/
|
||||||
|
|
||||||
|
# LSP config files
|
||||||
|
pyrightconfig.json
|
||||||
|
|
||||||
|
### Vim ###
|
||||||
|
# Swap
|
||||||
|
[._]*.s[a-v][a-z]
|
||||||
|
!*.svg # comment out if you don't need vector files
|
||||||
|
[._]*.sw[a-p]
|
||||||
|
[._]s[a-rt-v][a-z]
|
||||||
|
[._]ss[a-gi-z]
|
||||||
|
[._]sw[a-p]
|
||||||
|
|
||||||
|
# Session
|
||||||
|
Session.vim
|
||||||
|
Sessionx.vim
|
||||||
|
|
||||||
|
# Temporary
|
||||||
|
.netrwhist
|
||||||
|
# Auto-generated tag files
|
||||||
|
tags
|
||||||
|
# Persistent undo
|
||||||
|
[._]*.un~
|
||||||
|
|
||||||
|
# End of https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||||
|
|
||||||
|
/.quarto/
|
||||||
|
/output/
|
8
_quarto-blog.yml
Normal file
8
_quarto-blog.yml
Normal file
|
@ -0,0 +1,8 @@
|
||||||
|
project:
|
||||||
|
type: default
|
||||||
|
output-dir: /home/marty/projects/hosting/webpage/src/content/blog/2024-01-24-select-star-sql
|
||||||
|
|
||||||
|
format:
|
||||||
|
hugo-md:
|
||||||
|
preserve-yaml: true
|
||||||
|
code-fold: false
|
21
_quarto.yml
Normal file
21
_quarto.yml
Normal file
|
@ -0,0 +1,21 @@
|
||||||
|
project:
|
||||||
|
type: default
|
||||||
|
output-dir: output
|
||||||
|
|
||||||
|
format:
|
||||||
|
html:
|
||||||
|
code-fold: false
|
||||||
|
echo: true
|
||||||
|
pdf:
|
||||||
|
echo: true # since we want to see the code in this case
|
||||||
|
papersize: A4
|
||||||
|
geometry:
|
||||||
|
- left=2cm
|
||||||
|
- right=2.5cm
|
||||||
|
- top=2.5cm
|
||||||
|
- bottom=2.5cm
|
||||||
|
indent: true
|
||||||
|
linestretch: 1.5
|
||||||
|
fontfamily: lmodern
|
||||||
|
fontsize: "12"
|
||||||
|
pdf-engine: tectonic
|
616
data/tx_deathrow_full.csv
Normal file
616
data/tx_deathrow_full.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
data/tx_deathrow_full.db
Normal file
BIN
data/tx_deathrow_full.db
Normal file
Binary file not shown.
672
index.qmd
Normal file
672
index.qmd
Normal file
|
@ -0,0 +1,672 @@
|
||||||
|
---
|
||||||
|
title: "Select Star SQL"
|
||||||
|
subtitle: "SQL Introduction follow-along"
|
||||||
|
description: "SQL Introduction follow-along"
|
||||||
|
author:
|
||||||
|
- Marty Oehme
|
||||||
|
date: today
|
||||||
|
pubDate: "2024-01-24T09:10:23+01:00"
|
||||||
|
weight: 10
|
||||||
|
toc: true
|
||||||
|
tags:
|
||||||
|
- sql
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Introduction
|
||||||
|
|
||||||
|
I have recently been brushing up on my SQL skills,
|
||||||
|
because I firmly believe they are one of the most fundamental knowledge areas if you do anything to do with data.
|
||||||
|
|
||||||
|
The following content will follow along with the excellent [Star Select SQL](https://selectstarsql.com/) introductory resource to SQL.
|
||||||
|
We will go through all four chapters, and follow most of the examples and challenges given.
|
||||||
|
Sometimes, however, we will swerve a little and approach things slightly differently,
|
||||||
|
or with a different query twist -
|
||||||
|
just because my mind obviously works differently than that of Zi Chong Kao who wrote the book.
|
||||||
|
|
||||||
|
In general, there is slightly more of a focus on the actual SQL statements,
|
||||||
|
the knowledge imparted and perhaps a more explicit wording regarding notes and limits in the queries.
|
||||||
|
|
||||||
|
Less of a focus is given to the meaning of the data under analysis,
|
||||||
|
though we of course still use the same dataset, and attempt to draw the same conclusions from similar queries.
|
||||||
|
In other words:
|
||||||
|
If you want to learn (and be excited by the possibilities and challenges of) SQL, go to [Select Star SQL](https://selectstarsql.com/).
|
||||||
|
If you have already done it and want a refresher, this page will give you a quick overview separated by topical chapter.
|
||||||
|
Or use it as inspiration to read-along like I did.
|
||||||
|
|
||||||
|
This document uses quarto under the hood and we will experiment a little with the R SQL interactions along the way.
|
||||||
|
It is also my first attempt to really integrate quarto publications into this blog,
|
||||||
|
in an attempt to ease the process for the future.
|
||||||
|
For now, we just load the database (as an SQLite file) and show which table we have,
|
||||||
|
using some R packages:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
suppressPackageStartupMessages(library(tidyverse)) # suppressing conflict warnings in outputs
|
||||||
|
library(DBI)
|
||||||
|
library(RSQLite)
|
||||||
|
con <- dbConnect(SQLite(), "data/tx_deathrow_full.db")
|
||||||
|
as.data.frame(dbListTables(con))
|
||||||
|
```
|
||||||
|
|
||||||
|
That seems reasonable! We have a single table in the dataset, called `deathrow`.
|
||||||
|
|
||||||
|
This is perhaps a good moment to mention that the tables and colums do *not* have the exact same names as they do on the website's interactive follow-along boxes.
|
||||||
|
Instead, they have the column names that the author gave the columns for the full data set that he made available as a csv download, I have not changed any of them.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>A list of all column headers in the `deathrow` table.</summary>
|
||||||
|
0|Execution|TEXT\
|
||||||
|
1|Date of Birth|TEXT\
|
||||||
|
2|Date of Offence|TEXT\
|
||||||
|
3|Highest Education Level|TEXT\
|
||||||
|
4|Last Name|TEXT\
|
||||||
|
5|First Name|TEXT\
|
||||||
|
6|TDCJ Number|TEXT\
|
||||||
|
7|Age at Execution|TEXT\
|
||||||
|
8|Date Received|TEXT\
|
||||||
|
9|Execution Date|TEXT\
|
||||||
|
10|Race|TEXT\
|
||||||
|
11|County|TEXT\
|
||||||
|
12|Eye Color|TEXT\
|
||||||
|
13|Weight|TEXT\
|
||||||
|
14|Height|TEXT\
|
||||||
|
15|Native County|TEXT\
|
||||||
|
16|Native State|TEXT\
|
||||||
|
17|Last Statement|TEXT\
|
||||||
|
</details>
|
||||||
|
|
||||||
|
Perhaps, for a future version it might be an interesting data cleaning experiment to actually change them to the same names before following the book.
|
||||||
|
|
||||||
|
## Chapter 1 - Individual operations {#sec-individual}
|
||||||
|
|
||||||
|
For now, to test the database connection, we simply print three inmate names:
|
||||||
|
We do so by doing a `SELECT` for just three rows from the overall table.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
-- SELECT * FROM deathrow LIMIT 3;
|
||||||
|
SELECT "First Name", "Last Name" FROM deathrow LIMIT 3;
|
||||||
|
```
|
||||||
|
|
||||||
|
Everything seems to be working smoothly, and we can even directly create `sql` code chunks which connect to our database.
|
||||||
|
Neat!
|
||||||
|
|
||||||
|
The star indeed selects 'everything' (somewhat like globbing) which means we `SELECT` *everything* from out deathrow table, but limit the number of rows to three.
|
||||||
|
In other words, `SELECT` is more oriented towards columns, with `LIMIT` (and later e.g. `WHERE`) instead filtering out rows.
|
||||||
|
|
||||||
|
The 'star' selection is only a theory in this document, unfortunately,
|
||||||
|
with `quarto` (and `knitr`) as our rendering engines.
|
||||||
|
Doing this selection here would just end in a word-spaghetti since we just have *too many columns* to nicely display.
|
||||||
|
Instead, we limit ourselves to just a couple.
|
||||||
|
You can see the 'star' selection as a comment underneath the one we did, however.
|
||||||
|
|
||||||
|
For now, we `SELECT` from out table.
|
||||||
|
We do not have to.
|
||||||
|
Selections at their most basic simply return whatever returns a boolean 'truthy' value for the columns passed in,
|
||||||
|
even if we do not `SELECT` `FROM` a specific table:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT 50 + 2, 51 / 2, 51 / 2.0;
|
||||||
|
```
|
||||||
|
|
||||||
|
This example also reflects the float/integer differences in SQL:
|
||||||
|
If you only work with integers, the result will be an integer.
|
||||||
|
To work with floating point numbers at least one of the involved numbers has to be a float.
|
||||||
|
Often this is accomplished by just multiplying with `* 1.0` at some point in the query.
|
||||||
|
|
||||||
|
Having filtered on *columns* above with `SELECT`, let us now filter on *rows* with the `WHERE` block:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "Last Name", "First Name"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Age at Execution" <= 25;
|
||||||
|
```
|
||||||
|
|
||||||
|
In this case, we filter for all executions aged 25 and under (only very few due to the lengthy process).
|
||||||
|
`WHERE` blocks take an expression which results in a boolean truth value (in this case 'is it smaller than or equal to?'),
|
||||||
|
and for every row that the result is true will include them for the rest of the query.
|
||||||
|
|
||||||
|
This bit is also important, as `WHERE` operations will generally run before other operations such as `SELECT` or aggregations, as we will see in the next chapter.
|
||||||
|
|
||||||
|
As a last bit to notice, the order of `SELECT` conditions also matters:
|
||||||
|
it is the order the rows will appear as in the resulting table.
|
||||||
|
Here, we switched first and last names compared to the last table query.
|
||||||
|
|
||||||
|
While numeric comparisons are one thing, we can of course filter on text as well.
|
||||||
|
A single equals sign generally accomplishes conditional comparison:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name", "Last Name"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Last Name" = "Jones";
|
||||||
|
```
|
||||||
|
|
||||||
|
Of course, string comparison needs some leeway to account for small differences in the thing to be searched for.
|
||||||
|
For example, names could have additions ('Sr.', 'Jr.') or enumeration ('John II') or even simple misspellings.
|
||||||
|
We can use `LIKE` to help with that for string comparisons:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name", "Last Name", "Execution"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "First Name" LIKE 'Raymon_'
|
||||||
|
AND "Last Name" LIKE '%Landry%';
|
||||||
|
```
|
||||||
|
|
||||||
|
It allows a few wildcards in your queries to accomplish this:
|
||||||
|
`_` matches exactly one character,
|
||||||
|
while `%` matches any amount of characters.
|
||||||
|
Both only operate in the place the are put, so that `%Landry` and `%Landy%` can be different comparisons.
|
||||||
|
|
||||||
|
Above you can also see that we can logically concatenate query parts.
|
||||||
|
The precedence order goes `NOT`, then `AND`, then `OR`, so that the following:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT 0 AND 0 OR 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns `1`.
|
||||||
|
It reads '0 and 0', which results in a 0; and *then* it reads '0 or 1' which ultimately results in a 1.
|
||||||
|
To re-organize them into the precedence we require, simply use parentheses:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT 0 AND (0 OR 1);
|
||||||
|
```
|
||||||
|
|
||||||
|
The parenthesized `OR` clause is now looked at before the `AND` clause, resulting in a 0.
|
||||||
|
|
||||||
|
As a cap-stone for Chapter 1,
|
||||||
|
and to show that we do *not* need the column we filter on (with `WHERE`) as a `SELECT`ed one in the final table,
|
||||||
|
we will select a specific statement:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "Last Statement"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Last Name" = 'Beazley';
|
||||||
|
```
|
||||||
|
|
||||||
|
## Chapter 2 - Aggregations {#sec-aggregation}
|
||||||
|
|
||||||
|
We can use aggregators (such as `COUNT`, `MEAN` or `MEDIAN`) to consolidate information from *multiple* rows of input.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT COUNT("Last Statement") FROM deathrow WHERE "Last Statement" != '';
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, we diverge slightly from the book:
|
||||||
|
Whereas for them, NULLs are used where no statement exists, this is not so for our dataset.
|
||||||
|
|
||||||
|
Since we did no cleaning after importing from csv, empty statements will be imported as an empty string and not a `NULL` object.
|
||||||
|
Since count uses `NULL` objects (as a 'non-count' if you want) we additionally have to select all those rows where the statements are not empty strings.
|
||||||
|
|
||||||
|
We can count the overall number of entries (i.e. the denominator for later seeing which share of inmates proclaimed innocence) by `COUNT`ing a column which we know to have a value for each entry:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT COUNT("ex_age") FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
Or, even better as the book shows, use `COUNT(*)` to count the value in all columns. Since we can't have completely `NULL` rows (the row would just not exist), we will definitely get an overall count back this way:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT COUNT(*) FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
Now, let's see how we can combine this with conditional `CASE WHEN` statements.
|
||||||
|
|
||||||
|
Let's find all inmates from 'Harris' or 'Bexar' country:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
SUM(CASE WHEN "County"='Harris' THEN 1 ELSE 0 END),
|
||||||
|
SUM(CASE WHEN "County"='Bexar' THEN 1 ELSE 0 END)
|
||||||
|
FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
We can try to find the number of inmates that where over 50 years old at their point of execution:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT COUNT(*) FROM deathrow WHERE "Age at Execution" > 50;
|
||||||
|
```
|
||||||
|
|
||||||
|
You can see that the `WHERE` block selects *before* we run aggregations.
|
||||||
|
That is useful since we first reduce the amount of entries we have to consider before further operations.
|
||||||
|
|
||||||
|
Now, we can practice working with conditions, counts and cases for the same goal:
|
||||||
|
Finding all instances where inmates did *not* give a last statement.
|
||||||
|
|
||||||
|
First, we will do so again with a `WHERE` block:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT COUNT(*) FROM deathrow WHERE "Last Statement" = '';
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, we can do the same, but using a `COUNT` and `CASE WHEN` blocks:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
COUNT(CASE WHEN "Last Statement" = '' THEN 1 ELSE NULL END)
|
||||||
|
FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a little worse performance-wise.
|
||||||
|
Whereas in the first attempt (using `WHERE`) we first filter the overall table down to relevant entries and then count them, here we go through the whole table to count each (or 'not-count' those we designate with `NULL`).
|
||||||
|
|
||||||
|
Lastly, if we had cleaned the data correctly before using it here (especially designated empty strings as `NULL` objects), we could use `COUNT` blocks only:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
COUNT(*) - COUNT("Last Statement")
|
||||||
|
FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
You can see, however, that this results in `0` entries --- we count *all* entries both times since nothing is `NULL`.
|
||||||
|
While somewhat contrived here, it should point to the fact that it is generally better to have clean `NULL`-designated data before working with it.
|
||||||
|
|
||||||
|
However, this way of counting would also be the least performant of the three, with all rows being aggregated with counts twice.
|
||||||
|
|
||||||
|
In other words, the exercise showed us three things:
|
||||||
|
|
||||||
|
- there are many ways to skin an SQL query
|
||||||
|
- correct `NULL` designations can be important during cleaning
|
||||||
|
- performant operations should generally filter before aggregating
|
||||||
|
|
||||||
|
There are more aggregation functions, such as `MIN`, `MAX`, `AVG`.[^docs]
|
||||||
|
|
||||||
|
[^docs]: The book recommends documentation as [SQLITE](http://sqlite.org), [W3 Schools](https://www.w3schools.com/sql/default.asp) and, of course, Stack overflow.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
MIN("Age at Execution"),
|
||||||
|
MAX("Age at Execution"),
|
||||||
|
AVG("Age at Execution")
|
||||||
|
FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
We can also combine aggregations, running one on the results of another (just like combining function outputs in e.g. python).
|
||||||
|
Here is an example calculating the average length of statements, for cases where a statement exists using the `LENGTH` aggregation:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
AVG(LENGTH("Last Statement"))
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Last Statement" != '';
|
||||||
|
```
|
||||||
|
|
||||||
|
Another aggregation is `DISTINCT` which works somewhat like the program `uniq` on Unix systems:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT DISTINCT("County") FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
It presents all the options (or 'categories' if you think of your data as categorical) that are represented in a column, or the output of another aggregation.
|
||||||
|
|
||||||
|
On its face it is less of an *aggregation* function as the book remarks, since it does not output "a single number".
|
||||||
|
But since it 'aggregates' the contents of multiple rows into its output I would still very much classify it as such.
|
||||||
|
|
||||||
|
Finally, let's look at what happens with 'misshapen' queries:
|
||||||
|
|
||||||
|
<!-- ::: {#lst-strange-query} -->
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name", COUNT(*) FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
<!-- A strange query -->
|
||||||
|
|
||||||
|
<!-- ::: -->
|
||||||
|
|
||||||
|
Here, we make a query which is indeed very strange:
|
||||||
|
The `COUNT` aggregation wants to output a single aggregated number, while the single column selection `"First Name"` wants to output each individual row.
|
||||||
|
|
||||||
|
What happens?
|
||||||
|
The database goes easy on us and does not error out but uses the aggregation as guide that we only receive a single output back and picks an entry from the first names.
|
||||||
|
|
||||||
|
In the book, this is the last entry ('Charlie'), though it does warn that not all databases return the same.
|
||||||
|
This is reflected in the SQLite query here in fact returning the *first* entry ('Christopher Anthony') instead.
|
||||||
|
|
||||||
|
The lesson being to *not* rely on unclear operations like this but being explicit if, say, we want to indeed grab the last entry of something:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name" FROM deathrow ORDER BY ROWID DESC LIMIT 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
Since SQLite does not come with a convenient `LAST` aggregator (some other databases do),
|
||||||
|
we need to work around it by reversing the order of the table based on its `ROWID` (which increase for each entry).
|
||||||
|
Thus, the highest `ROWID` is the last entry.
|
||||||
|
Having reversed it, we can limit the output to the very last one to arrive back at 'Charlie'.[^SOlast]
|
||||||
|
|
||||||
|
[^SOlast]: This is cribbed from the very nice tips on grabbing the last SQLite entry on [this Stack Overflow](https://stackoverflow.com/q/24494182) question.
|
||||||
|
|
||||||
|
However, this operation (as far as I know for now) is now not compatible anymore with *also* aggregating on a row count like we did above.
|
||||||
|
|
||||||
|
So, for the final query in this aggregation chapter, let's see the share of inmates which insisted on their innocence even during the last statement:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
1.0 * COUNT(CASE WHEN "Last Statement" LIKE '%innocent%' THEN 1 ELSE NULL END) / (COUNT(*)) * 100
|
||||||
|
FROM deathrow;
|
||||||
|
```
|
||||||
|
|
||||||
|
We can see that over 5% of the people proclaimed their innocence even during their final statement.
|
||||||
|
|
||||||
|
Of course, this method of simple string matching has some issues:
|
||||||
|
If somebody uses other statements (the book mentions the example of 'not guilty') we have a lower bound of proclamations.
|
||||||
|
We also do not know in what way people used the word --- what if they instead proclaimed *not* being innocent?
|
||||||
|
We would rather have a higher bound than the true number in this case.
|
||||||
|
|
||||||
|
At the same time, we do not know the thoughts of the people not giving last statements at all.
|
||||||
|
Perhaps, it would also make sense to compare only the number of people who did give a statement (but did not mention their innocence) with those who did, which would put us on the lower bound again.
|
||||||
|
|
||||||
|
We can see this behavior if we just show a sub-section of the statements:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name", "Last Statement"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Last Statement" LIKE '%innocent%'
|
||||||
|
LIMIT 4;
|
||||||
|
```
|
||||||
|
|
||||||
|
While Preston, Jonathan and Keith do protest their innocence to the warden, their loved ones or the world,
|
||||||
|
Jeffrey instead mentions 'innocent kids'.
|
||||||
|
Now, he could include himself in this category but we do not know for sure.
|
||||||
|
|
||||||
|
So, this concludes a chapter about *aggregating*:
|
||||||
|
operating with functions on multiple rows in the dataset, allowing study of more system-level behavior in the data.
|
||||||
|
|
||||||
|
## Chapter 3 - Grouping and Nesting {#sec-grouping}
|
||||||
|
|
||||||
|
After dealing with individual rows (Ch 1) and aggregations (Ch 2), we will now do some data *organization* based on specific rows or columns -
|
||||||
|
a sort of mish-mash between keeping multiple rows in the output like in the first case and doing operations on them like in the latter.
|
||||||
|
|
||||||
|
The chapter begins by looking at a visualization of the data, which shows a strong long right tail (or, right skew as I know it called).
|
||||||
|
We ultimately will end the chapter by taking a look at the percentage breakdown of executions each county contributed to investigate this skew.
|
||||||
|
|
||||||
|
Grouping can be accomplished with the `GROUP BY` block:[^semicolon]
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"County",
|
||||||
|
COUNT(*) AS county_executions
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "County"
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
[^semicolon]: You can see that I have started putting the final semicolon on its own separate line.
|
||||||
|
It is a technique I have seen being used by 'Alex the Analyst' on YouTube and which I am quite keen on replicating.
|
||||||
|
For me, it serves two purposes:
|
||||||
|
First, for some SQL dialects, the closing semicolon is actually required and this makes it harder to forget it.
|
||||||
|
Second, it provides a visually striking *close* of the query for myself as well, which may be useful once getting into more dense queries.
|
||||||
|
|
||||||
|
This reminds a lot of the 'misshapen' query above, however, there is a key difference:
|
||||||
|
Even when doing aggregations around it, the column(s) *being grouped on* is allowed to be a multi-output column (called 'grouping columns').
|
||||||
|
|
||||||
|
Let's first do another quick grouping query to see how it can work.
|
||||||
|
We'll try to find the most common last names:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "Last Name", COUNT(*) AS count
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "Last Name"
|
||||||
|
ORDER BY "count" DESC
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
The code above also makes use of *aliasing*, with an `AS <new-name>` block with which you can provide an efficient short-hand or new name for `SELECT` outputs.
|
||||||
|
I believe it also works for the outputs of e.g. `JOIN` operations.
|
||||||
|
|
||||||
|
Let's now have a breakdown of executions with and without a last statement by county:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"Last Statement" IS NOT '' AS has_last_statement,
|
||||||
|
"County",
|
||||||
|
COUNT(*)
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "County", "has_last_statement"
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
The order in which you group by here matters!
|
||||||
|
In this case, we first order by county and then statements -
|
||||||
|
all 'Anderson' inmates appear first, then all 'Aransas', at some point the 'Bell' county cases, both those with and without statement, before 'Bexar' county, and so on.
|
||||||
|
Had we the groupings the other way around,
|
||||||
|
we would first have all `has_last_statement = 0` entries, from 'Anderson' county to 'Wood' county last, and then repeat the same for all counts of cases with statements.
|
||||||
|
|
||||||
|
Also, we can of course manually influence this order.
|
||||||
|
Using the `ORDER BY` block we can choose a column with which to order, so regardless of the grouping sequence we can make sure to for example order on counties.
|
||||||
|
We can of course also sort by different column altogether of course, such as the counts which would then require a naming alias.
|
||||||
|
Using `ORDER BY "Column" DESC` we can reverse the sorting.
|
||||||
|
Here is the same example from above implementing most of these ideas:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"Last Statement" IS NOT '' AS has_last_statement,
|
||||||
|
"County",
|
||||||
|
COUNT(*) AS number
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "has_last_statement", "County"
|
||||||
|
ORDER BY "number" DESC
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
We already know from @sec-aggregation that `WHERE` blocks will take place before any aggregation.
|
||||||
|
The same is true for groupings - `WHERE` will always execute (and thus filter) before `GROUP BY` executes.
|
||||||
|
|
||||||
|
The following counts the number of inmates executed that were at least 50 for each county:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"County",
|
||||||
|
COUNT(*) AS number
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Age at Execution" >= 50
|
||||||
|
GROUP BY "County"
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
We do not select the age column for further consideration, but since `WHERE` runs before all other operations we also do not need it.
|
||||||
|
But what if we want to filter on the *outputs* of grouping or aggregation functions?
|
||||||
|
The `HAVING` block solves that.
|
||||||
|
|
||||||
|
The following shows the counties in which at least two inmates 50 or older were executed:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"County",
|
||||||
|
COUNT(*) AS number
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Age at Execution" >= 50
|
||||||
|
GROUP BY "County"
|
||||||
|
HAVING "number" > 2
|
||||||
|
ORDER BY "number" DESC
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
As one interesting fact for possibly more advanced queries:
|
||||||
|
`GROUP BY` blocks do *not* need the columns on which they group to be in the `SELECT` block!
|
||||||
|
Generally this does not make a lot of sense - when we group by county but do not see county then it just seems like fairly weird groupings,
|
||||||
|
but there will invariably be situations where this knowledge is useful.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "County"
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "County"
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
This exactly mirrors the `SELECT DISTINCT` aggregation, but is accomplished with grouping instead.
|
||||||
|
Many ways to skin a query again!
|
||||||
|
|
||||||
|
Now let's pivot a little and look at query nesting.
|
||||||
|
Since we sometimes will want to run one query leading into another (e.g. to compute percentages),
|
||||||
|
we have to have some way to integrate them with another.
|
||||||
|
We do so through *nested queries*, demarcated with `(parantheses)` within another query.
|
||||||
|
|
||||||
|
Let's see how we utilize this to select the inmate with the longest last statement:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT "First Name", "Last Name"
|
||||||
|
FROM deathrow
|
||||||
|
WHERE LENGTH("Last Statement") =
|
||||||
|
(
|
||||||
|
SELECT MAX(LENGTH("Last Statement"))
|
||||||
|
FROM deathrow
|
||||||
|
)
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
It looks a little cumbersome but essentially we first filter on the row whose statement length is (exactly) the length of the longest statement,
|
||||||
|
previously queried as a separate sub-query.
|
||||||
|
|
||||||
|
Why do need a nested query here?
|
||||||
|
The book itself explains it most succinctly:
|
||||||
|
|
||||||
|
> nesting is necessary here because in the WHERE clause,
|
||||||
|
as the computer is inspecting a row to decide if its last statement is the right length,
|
||||||
|
it can’t look outside to figure out the maximum length across the entire dataset.
|
||||||
|
We have to find the maximum length separately and feed it into the clause.
|
||||||
|
|
||||||
|
We will now attempt to do the same to find the percentage of all executions contributed by each county:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"County",
|
||||||
|
100.0 * COUNT(*) / (
|
||||||
|
SELECT COUNT(*)
|
||||||
|
FROM deathrow
|
||||||
|
) as percentage
|
||||||
|
FROM deathrow
|
||||||
|
GROUP BY "County"
|
||||||
|
ORDER BY "percentage" DESC
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
It follows the same concept:
|
||||||
|
We need to invoke a nested query because our original query is already fixated on a sub-group of all rows and we **can not get out of it within our query**.
|
||||||
|
Instead, we invoke another query 'before' the original which still has access to all rows and create our own nested aggregation.
|
||||||
|
The output of that then feeds into the original query.
|
||||||
|
|
||||||
|
Quite clever, a little cumbersome, and presumably the origin of quite a few headaches in writing elegant SQL queries.
|
||||||
|
|
||||||
|
## Chapter 4 - Joins and Dates {#sec-joining}
|
||||||
|
|
||||||
|
Before we look at joins, let's look at handling dates (as in, the data type) in SQL.
|
||||||
|
While we have a couple of columns of reasonably formatted dates (ISO-8601), that doesn't mean we can automatically use them as such.
|
||||||
|
To make use of such nice and clean data we should use operations that specifically [make use of dates](https://www.sqlite.org/lang_datefunc.html) for their calculations.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
julianday('1993-08-10') - julianday('1989-07-07') as day_diff
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
The `julianday()` function will transform our ISO-compatible dates into timestamp floats on which we can operate like usual, in this case subtracting the latter from the former to get the period of time between them.
|
||||||
|
Like the unix timestamp, the Julian day counts from a specific point in time as 0 continuously upwards,
|
||||||
|
only that it counts from 4714 B.C.E. (not 1971) and per-day not per-second.
|
||||||
|
Anything below a single day is fractional.
|
||||||
|
Half a day's difference would thus be `0.5` difference, making it perhaps more useful to work with larger time differences.
|
||||||
|
|
||||||
|
Now we will join a table with itself, only shifted by a row.
|
||||||
|
This will make it necessary to prepare the date for the 'other' table first,
|
||||||
|
adding one to the column we are going to join on.
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"Execution" + 1 AS ex_number,
|
||||||
|
"Execution Date" AS prev_ex_date
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Execution" < 553
|
||||||
|
```
|
||||||
|
|
||||||
|
We could perhaps use `julianday` comparisons, but since we have access to the execution numbers and they are rolling we can instead use them like an ID and just shift it one up.
|
||||||
|
|
||||||
|
Now we want to put data from one row into *another* row and neither aggregations nor groups can help us out.
|
||||||
|
Instead, to gain access to data from other rows (whether in the same or another table) we use the `JOIN` block.
|
||||||
|
There is an `INNER JOIN` (the default), a `LEFT JOIN`, a `RIGHT JOIN` and an `OUTER JOIN` block.
|
||||||
|
|
||||||
|
**The different joins only differ in how they handle unmatched rows.**
|
||||||
|
With an inner join, any unmatched rows are dropped completely (essentially intersection merge),
|
||||||
|
with an outer join unmatched rows are preserved completely from both tables (union merge),
|
||||||
|
with a left join unmatched rows from the *left* (i.e. `FROM XY`) table are preserved,
|
||||||
|
with a right join unmatched rows from the *right* (`JOIN XY`) table.
|
||||||
|
|
||||||
|
This prepares our table to be used to join 'itself', adding those rows (shifted) together which we want.
|
||||||
|
Of course, we do not have to shift in our selection already, and I find it more intuitive to do so in the value comparison.
|
||||||
|
We end up with the following query:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
"Execution Date",
|
||||||
|
prev_ex_date AS "Previous Execution",
|
||||||
|
JULIANDAY("Execution Date") - JULIANDAY(prev_ex_date) AS "Difference in Days"
|
||||||
|
FROM deathrow
|
||||||
|
JOIN (
|
||||||
|
SELECT
|
||||||
|
"Execution" AS ex_number,
|
||||||
|
"Execution Date" AS prev_ex_date
|
||||||
|
FROM deathrow
|
||||||
|
WHERE "Execution" < 553
|
||||||
|
) AS previous
|
||||||
|
ON deathrow."Execution" = previous.ex_number + 1
|
||||||
|
ORDER BY "Difference in Days" DESC
|
||||||
|
LIMIT 10
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
This shows us the top ten timeframes in which no executions occured.
|
||||||
|
You can see we do the shifting in the `ON` block itself, leading to the more natural reading of
|
||||||
|
'join the tables on execution number being the same as the previous execution number plus one'.
|
||||||
|
For my brain this is more easily comprehensible as a row-shift.
|
||||||
|
Otherwise, we only select the execution number
|
||||||
|
(though we only need it for the shift operation and drop it in the outer selection)
|
||||||
|
and the execution date which is the one important column we are looking for.
|
||||||
|
|
||||||
|
We also do not include Execution number 553 (the largest execution number) since there will be no newer execution to join it with in the dataset.
|
||||||
|
The resulting table will not be different if we do not, however.
|
||||||
|
Remember we are doing an `INNER JOIN`, which drops any non-matching rows by default.
|
||||||
|
|
||||||
|
Such a 'self join' is a common technique to **grab information from other rows** from the same table.
|
||||||
|
This is already quite an advanced query!
|
||||||
|
|
||||||
|
The book plots a graph here, which I will not replicate for the moment.
|
||||||
|
However, it shows that roughly to pre-1993 there was a lower overall execution count,
|
||||||
|
with two clearly visible larger hiatuses afterwards.
|
||||||
|
|
||||||
|
Let's focus on those two hiatuses and limit the data to not show pre-1993 executions.
|
||||||
|
As a last thing, let's make it a little more elegant by making the original ('previous') table query way simpler:
|
||||||
|
|
||||||
|
```{sql connection=con}
|
||||||
|
SELECT
|
||||||
|
previous."Execution Date" AS "Beginning of period",
|
||||||
|
deathrow."Execution Date" AS "End of period",
|
||||||
|
JULIANDAY(deathrow."Execution Date") - JULIANDAY(previous."Execution Date") AS "Difference in Days"
|
||||||
|
FROM deathrow
|
||||||
|
JOIN deathrow AS previous
|
||||||
|
ON deathrow."Execution" = previous."Execution" + 1
|
||||||
|
WHERE DATE(deathrow."Execution Date") > DATE('1994-01-01')
|
||||||
|
ORDER BY "Difference in Days" DESC
|
||||||
|
LIMIT 10
|
||||||
|
;
|
||||||
|
```
|
||||||
|
|
||||||
|
We can also see much more clearly what the book is talking about with big stays of execution occuring in 1996-1997, as well as 2007-2008.
|
||||||
|
|
||||||
|
A little more wordy per line but overall a lot more elegant.
|
||||||
|
And mostly enabled due to putting the row 'shift' into the `ON` block itself.
|
||||||
|
However, the importance of good name-aliasing and `JOIN ON` blocks are definitely highlighted.
|
||||||
|
We are now equipped to grab data from multiple rows, multiple tables and rearrange them as we see necessary.
|
||||||
|
|
||||||
|
## Quick Conclusion
|
||||||
|
|
||||||
|
So, this should already give a rough mental model for the *kinds* of operations to be done with SQL.
|
||||||
|
|
||||||
|
We can operate on the contents of a single row,
|
||||||
|
we can aggregate the contents of many rows resulting in a single-row output,
|
||||||
|
we can group by columns which allows us to aggregate into multiple rows,
|
||||||
|
and we can work with the contents of *other* rows by joining tables.
|
||||||
|
|
||||||
|
We have learned to *query* data but not to create or manipulate date,
|
||||||
|
i.e. working with side-effects.
|
||||||
|
We also have not learned about the concepts of `window` functions or common table expressions.
|
||||||
|
|
||||||
|
These are additional operations to get to know,
|
||||||
|
but of course it is also important to get an overall broader view onto the concepts and mental mapping of SQL itself.
|
||||||
|
The book closes with a call-to-challenge, with an additional dataset to knock your teeth on.
|
Loading…
Reference in a new issue