initial commit
This commit is contained in:
commit
4da221d82c
6 changed files with 1542 additions and 0 deletions
225
.gitignore
vendored
Normal file
225
.gitignore
vendored
Normal file
|
@ -0,0 +1,225 @@
|
|||
# Created by https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||
# Edit at https://www.toptal.com/developers/gitignore?templates=-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||
|
||||
### JupyterNotebooks ###
|
||||
# gitignore template for Jupyter Notebooks
|
||||
# website: http://jupyter.org/
|
||||
|
||||
.ipynb_checkpoints
|
||||
*/.ipynb_checkpoints/*
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# Remove previous ipynb_checkpoints
|
||||
# git rm -r .ipynb_checkpoints/
|
||||
|
||||
### Linux ###
|
||||
*~
|
||||
|
||||
# temporary files which can be created if a process still has a handle open of a deleted file
|
||||
.fuse_hidden*
|
||||
|
||||
# KDE directory preferences
|
||||
.directory
|
||||
|
||||
# Linux trash folder which might appear on any partition or disk
|
||||
.Trash-*
|
||||
|
||||
# .nfs files are created when an open file is removed but is still being accessed
|
||||
.nfs*
|
||||
|
||||
### Python ###
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
|
||||
# IPython
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
#poetry.lock
|
||||
|
||||
# pdm
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||
#pdm.lock
|
||||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
||||
# in version control.
|
||||
# https://pdm.fming.dev/#use-with-ide
|
||||
.pdm.toml
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
#.idea/
|
||||
|
||||
### Python Patch ###
|
||||
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
|
||||
poetry.toml
|
||||
|
||||
# ruff
|
||||
.ruff_cache/
|
||||
|
||||
# LSP config files
|
||||
pyrightconfig.json
|
||||
|
||||
### Vim ###
|
||||
# Swap
|
||||
[._]*.s[a-v][a-z]
|
||||
!*.svg # comment out if you don't need vector files
|
||||
[._]*.sw[a-p]
|
||||
[._]s[a-rt-v][a-z]
|
||||
[._]ss[a-gi-z]
|
||||
[._]sw[a-p]
|
||||
|
||||
# Session
|
||||
Session.vim
|
||||
Sessionx.vim
|
||||
|
||||
# Temporary
|
||||
.netrwhist
|
||||
# Auto-generated tag files
|
||||
tags
|
||||
# Persistent undo
|
||||
[._]*.un~
|
||||
|
||||
# End of https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
|
||||
|
||||
/.quarto/
|
||||
/output/
|
8
_quarto-blog.yml
Normal file
8
_quarto-blog.yml
Normal file
|
@ -0,0 +1,8 @@
|
|||
project:
|
||||
type: default
|
||||
output-dir: /home/marty/projects/hosting/webpage/src/content/blog/2024-01-24-select-star-sql
|
||||
|
||||
format:
|
||||
hugo-md:
|
||||
preserve-yaml: true
|
||||
code-fold: false
|
21
_quarto.yml
Normal file
21
_quarto.yml
Normal file
|
@ -0,0 +1,21 @@
|
|||
project:
|
||||
type: default
|
||||
output-dir: output
|
||||
|
||||
format:
|
||||
html:
|
||||
code-fold: false
|
||||
echo: true
|
||||
pdf:
|
||||
echo: true # since we want to see the code in this case
|
||||
papersize: A4
|
||||
geometry:
|
||||
- left=2cm
|
||||
- right=2.5cm
|
||||
- top=2.5cm
|
||||
- bottom=2.5cm
|
||||
indent: true
|
||||
linestretch: 1.5
|
||||
fontfamily: lmodern
|
||||
fontsize: "12"
|
||||
pdf-engine: tectonic
|
616
data/tx_deathrow_full.csv
Normal file
616
data/tx_deathrow_full.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
data/tx_deathrow_full.db
Normal file
BIN
data/tx_deathrow_full.db
Normal file
Binary file not shown.
672
index.qmd
Normal file
672
index.qmd
Normal file
|
@ -0,0 +1,672 @@
|
|||
---
|
||||
title: "Select Star SQL"
|
||||
subtitle: "SQL Introduction follow-along"
|
||||
description: "SQL Introduction follow-along"
|
||||
author:
|
||||
- Marty Oehme
|
||||
date: today
|
||||
pubDate: "2024-01-24T09:10:23+01:00"
|
||||
weight: 10
|
||||
toc: true
|
||||
tags:
|
||||
- sql
|
||||
---
|
||||
|
||||
## Quick Introduction
|
||||
|
||||
I have recently been brushing up on my SQL skills,
|
||||
because I firmly believe they are one of the most fundamental knowledge areas if you do anything to do with data.
|
||||
|
||||
The following content will follow along with the excellent [Star Select SQL](https://selectstarsql.com/) introductory resource to SQL.
|
||||
We will go through all four chapters, and follow most of the examples and challenges given.
|
||||
Sometimes, however, we will swerve a little and approach things slightly differently,
|
||||
or with a different query twist -
|
||||
just because my mind obviously works differently than that of Zi Chong Kao who wrote the book.
|
||||
|
||||
In general, there is slightly more of a focus on the actual SQL statements,
|
||||
the knowledge imparted and perhaps a more explicit wording regarding notes and limits in the queries.
|
||||
|
||||
Less of a focus is given to the meaning of the data under analysis,
|
||||
though we of course still use the same dataset, and attempt to draw the same conclusions from similar queries.
|
||||
In other words:
|
||||
If you want to learn (and be excited by the possibilities and challenges of) SQL, go to [Select Star SQL](https://selectstarsql.com/).
|
||||
If you have already done it and want a refresher, this page will give you a quick overview separated by topical chapter.
|
||||
Or use it as inspiration to read-along like I did.
|
||||
|
||||
This document uses quarto under the hood and we will experiment a little with the R SQL interactions along the way.
|
||||
It is also my first attempt to really integrate quarto publications into this blog,
|
||||
in an attempt to ease the process for the future.
|
||||
For now, we just load the database (as an SQLite file) and show which table we have,
|
||||
using some R packages:
|
||||
|
||||
```{r}
|
||||
suppressPackageStartupMessages(library(tidyverse)) # suppressing conflict warnings in outputs
|
||||
library(DBI)
|
||||
library(RSQLite)
|
||||
con <- dbConnect(SQLite(), "data/tx_deathrow_full.db")
|
||||
as.data.frame(dbListTables(con))
|
||||
```
|
||||
|
||||
That seems reasonable! We have a single table in the dataset, called `deathrow`.
|
||||
|
||||
This is perhaps a good moment to mention that the tables and colums do *not* have the exact same names as they do on the website's interactive follow-along boxes.
|
||||
Instead, they have the column names that the author gave the columns for the full data set that he made available as a csv download, I have not changed any of them.
|
||||
|
||||
<details>
|
||||
<summary>A list of all column headers in the `deathrow` table.</summary>
|
||||
0|Execution|TEXT\
|
||||
1|Date of Birth|TEXT\
|
||||
2|Date of Offence|TEXT\
|
||||
3|Highest Education Level|TEXT\
|
||||
4|Last Name|TEXT\
|
||||
5|First Name|TEXT\
|
||||
6|TDCJ Number|TEXT\
|
||||
7|Age at Execution|TEXT\
|
||||
8|Date Received|TEXT\
|
||||
9|Execution Date|TEXT\
|
||||
10|Race|TEXT\
|
||||
11|County|TEXT\
|
||||
12|Eye Color|TEXT\
|
||||
13|Weight|TEXT\
|
||||
14|Height|TEXT\
|
||||
15|Native County|TEXT\
|
||||
16|Native State|TEXT\
|
||||
17|Last Statement|TEXT\
|
||||
</details>
|
||||
|
||||
Perhaps, for a future version it might be an interesting data cleaning experiment to actually change them to the same names before following the book.
|
||||
|
||||
## Chapter 1 - Individual operations {#sec-individual}
|
||||
|
||||
For now, to test the database connection, we simply print three inmate names:
|
||||
We do so by doing a `SELECT` for just three rows from the overall table.
|
||||
|
||||
```{sql connection=con}
|
||||
-- SELECT * FROM deathrow LIMIT 3;
|
||||
SELECT "First Name", "Last Name" FROM deathrow LIMIT 3;
|
||||
```
|
||||
|
||||
Everything seems to be working smoothly, and we can even directly create `sql` code chunks which connect to our database.
|
||||
Neat!
|
||||
|
||||
The star indeed selects 'everything' (somewhat like globbing) which means we `SELECT` *everything* from out deathrow table, but limit the number of rows to three.
|
||||
In other words, `SELECT` is more oriented towards columns, with `LIMIT` (and later e.g. `WHERE`) instead filtering out rows.
|
||||
|
||||
The 'star' selection is only a theory in this document, unfortunately,
|
||||
with `quarto` (and `knitr`) as our rendering engines.
|
||||
Doing this selection here would just end in a word-spaghetti since we just have *too many columns* to nicely display.
|
||||
Instead, we limit ourselves to just a couple.
|
||||
You can see the 'star' selection as a comment underneath the one we did, however.
|
||||
|
||||
For now, we `SELECT` from out table.
|
||||
We do not have to.
|
||||
Selections at their most basic simply return whatever returns a boolean 'truthy' value for the columns passed in,
|
||||
even if we do not `SELECT` `FROM` a specific table:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT 50 + 2, 51 / 2, 51 / 2.0;
|
||||
```
|
||||
|
||||
This example also reflects the float/integer differences in SQL:
|
||||
If you only work with integers, the result will be an integer.
|
||||
To work with floating point numbers at least one of the involved numbers has to be a float.
|
||||
Often this is accomplished by just multiplying with `* 1.0` at some point in the query.
|
||||
|
||||
Having filtered on *columns* above with `SELECT`, let us now filter on *rows* with the `WHERE` block:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "Last Name", "First Name"
|
||||
FROM deathrow
|
||||
WHERE "Age at Execution" <= 25;
|
||||
```
|
||||
|
||||
In this case, we filter for all executions aged 25 and under (only very few due to the lengthy process).
|
||||
`WHERE` blocks take an expression which results in a boolean truth value (in this case 'is it smaller than or equal to?'),
|
||||
and for every row that the result is true will include them for the rest of the query.
|
||||
|
||||
This bit is also important, as `WHERE` operations will generally run before other operations such as `SELECT` or aggregations, as we will see in the next chapter.
|
||||
|
||||
As a last bit to notice, the order of `SELECT` conditions also matters:
|
||||
it is the order the rows will appear as in the resulting table.
|
||||
Here, we switched first and last names compared to the last table query.
|
||||
|
||||
While numeric comparisons are one thing, we can of course filter on text as well.
|
||||
A single equals sign generally accomplishes conditional comparison:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name", "Last Name"
|
||||
FROM deathrow
|
||||
WHERE "Last Name" = "Jones";
|
||||
```
|
||||
|
||||
Of course, string comparison needs some leeway to account for small differences in the thing to be searched for.
|
||||
For example, names could have additions ('Sr.', 'Jr.') or enumeration ('John II') or even simple misspellings.
|
||||
We can use `LIKE` to help with that for string comparisons:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name", "Last Name", "Execution"
|
||||
FROM deathrow
|
||||
WHERE "First Name" LIKE 'Raymon_'
|
||||
AND "Last Name" LIKE '%Landry%';
|
||||
```
|
||||
|
||||
It allows a few wildcards in your queries to accomplish this:
|
||||
`_` matches exactly one character,
|
||||
while `%` matches any amount of characters.
|
||||
Both only operate in the place the are put, so that `%Landry` and `%Landy%` can be different comparisons.
|
||||
|
||||
Above you can also see that we can logically concatenate query parts.
|
||||
The precedence order goes `NOT`, then `AND`, then `OR`, so that the following:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT 0 AND 0 OR 1;
|
||||
```
|
||||
|
||||
Returns `1`.
|
||||
It reads '0 and 0', which results in a 0; and *then* it reads '0 or 1' which ultimately results in a 1.
|
||||
To re-organize them into the precedence we require, simply use parentheses:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT 0 AND (0 OR 1);
|
||||
```
|
||||
|
||||
The parenthesized `OR` clause is now looked at before the `AND` clause, resulting in a 0.
|
||||
|
||||
As a cap-stone for Chapter 1,
|
||||
and to show that we do *not* need the column we filter on (with `WHERE`) as a `SELECT`ed one in the final table,
|
||||
we will select a specific statement:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "Last Statement"
|
||||
FROM deathrow
|
||||
WHERE "Last Name" = 'Beazley';
|
||||
```
|
||||
|
||||
## Chapter 2 - Aggregations {#sec-aggregation}
|
||||
|
||||
We can use aggregators (such as `COUNT`, `MEAN` or `MEDIAN`) to consolidate information from *multiple* rows of input.
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT COUNT("Last Statement") FROM deathrow WHERE "Last Statement" != '';
|
||||
```
|
||||
|
||||
Here, we diverge slightly from the book:
|
||||
Whereas for them, NULLs are used where no statement exists, this is not so for our dataset.
|
||||
|
||||
Since we did no cleaning after importing from csv, empty statements will be imported as an empty string and not a `NULL` object.
|
||||
Since count uses `NULL` objects (as a 'non-count' if you want) we additionally have to select all those rows where the statements are not empty strings.
|
||||
|
||||
We can count the overall number of entries (i.e. the denominator for later seeing which share of inmates proclaimed innocence) by `COUNT`ing a column which we know to have a value for each entry:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT COUNT("ex_age") FROM deathrow;
|
||||
```
|
||||
|
||||
Or, even better as the book shows, use `COUNT(*)` to count the value in all columns. Since we can't have completely `NULL` rows (the row would just not exist), we will definitely get an overall count back this way:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT COUNT(*) FROM deathrow;
|
||||
```
|
||||
|
||||
Now, let's see how we can combine this with conditional `CASE WHEN` statements.
|
||||
|
||||
Let's find all inmates from 'Harris' or 'Bexar' country:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
SUM(CASE WHEN "County"='Harris' THEN 1 ELSE 0 END),
|
||||
SUM(CASE WHEN "County"='Bexar' THEN 1 ELSE 0 END)
|
||||
FROM deathrow;
|
||||
```
|
||||
|
||||
We can try to find the number of inmates that where over 50 years old at their point of execution:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT COUNT(*) FROM deathrow WHERE "Age at Execution" > 50;
|
||||
```
|
||||
|
||||
You can see that the `WHERE` block selects *before* we run aggregations.
|
||||
That is useful since we first reduce the amount of entries we have to consider before further operations.
|
||||
|
||||
Now, we can practice working with conditions, counts and cases for the same goal:
|
||||
Finding all instances where inmates did *not* give a last statement.
|
||||
|
||||
First, we will do so again with a `WHERE` block:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT COUNT(*) FROM deathrow WHERE "Last Statement" = '';
|
||||
```
|
||||
|
||||
Then, we can do the same, but using a `COUNT` and `CASE WHEN` blocks:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
COUNT(CASE WHEN "Last Statement" = '' THEN 1 ELSE NULL END)
|
||||
FROM deathrow;
|
||||
```
|
||||
|
||||
This is a little worse performance-wise.
|
||||
Whereas in the first attempt (using `WHERE`) we first filter the overall table down to relevant entries and then count them, here we go through the whole table to count each (or 'not-count' those we designate with `NULL`).
|
||||
|
||||
Lastly, if we had cleaned the data correctly before using it here (especially designated empty strings as `NULL` objects), we could use `COUNT` blocks only:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
COUNT(*) - COUNT("Last Statement")
|
||||
FROM deathrow;
|
||||
```
|
||||
|
||||
You can see, however, that this results in `0` entries --- we count *all* entries both times since nothing is `NULL`.
|
||||
While somewhat contrived here, it should point to the fact that it is generally better to have clean `NULL`-designated data before working with it.
|
||||
|
||||
However, this way of counting would also be the least performant of the three, with all rows being aggregated with counts twice.
|
||||
|
||||
In other words, the exercise showed us three things:
|
||||
|
||||
- there are many ways to skin an SQL query
|
||||
- correct `NULL` designations can be important during cleaning
|
||||
- performant operations should generally filter before aggregating
|
||||
|
||||
There are more aggregation functions, such as `MIN`, `MAX`, `AVG`.[^docs]
|
||||
|
||||
[^docs]: The book recommends documentation as [SQLITE](http://sqlite.org), [W3 Schools](https://www.w3schools.com/sql/default.asp) and, of course, Stack overflow.
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
MIN("Age at Execution"),
|
||||
MAX("Age at Execution"),
|
||||
AVG("Age at Execution")
|
||||
FROM deathrow;
|
||||
```
|
||||
|
||||
We can also combine aggregations, running one on the results of another (just like combining function outputs in e.g. python).
|
||||
Here is an example calculating the average length of statements, for cases where a statement exists using the `LENGTH` aggregation:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
AVG(LENGTH("Last Statement"))
|
||||
FROM deathrow
|
||||
WHERE "Last Statement" != '';
|
||||
```
|
||||
|
||||
Another aggregation is `DISTINCT` which works somewhat like the program `uniq` on Unix systems:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT DISTINCT("County") FROM deathrow;
|
||||
```
|
||||
|
||||
It presents all the options (or 'categories' if you think of your data as categorical) that are represented in a column, or the output of another aggregation.
|
||||
|
||||
On its face it is less of an *aggregation* function as the book remarks, since it does not output "a single number".
|
||||
But since it 'aggregates' the contents of multiple rows into its output I would still very much classify it as such.
|
||||
|
||||
Finally, let's look at what happens with 'misshapen' queries:
|
||||
|
||||
<!-- ::: {#lst-strange-query} -->
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name", COUNT(*) FROM deathrow;
|
||||
```
|
||||
|
||||
<!-- A strange query -->
|
||||
|
||||
<!-- ::: -->
|
||||
|
||||
Here, we make a query which is indeed very strange:
|
||||
The `COUNT` aggregation wants to output a single aggregated number, while the single column selection `"First Name"` wants to output each individual row.
|
||||
|
||||
What happens?
|
||||
The database goes easy on us and does not error out but uses the aggregation as guide that we only receive a single output back and picks an entry from the first names.
|
||||
|
||||
In the book, this is the last entry ('Charlie'), though it does warn that not all databases return the same.
|
||||
This is reflected in the SQLite query here in fact returning the *first* entry ('Christopher Anthony') instead.
|
||||
|
||||
The lesson being to *not* rely on unclear operations like this but being explicit if, say, we want to indeed grab the last entry of something:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name" FROM deathrow ORDER BY ROWID DESC LIMIT 1;
|
||||
```
|
||||
|
||||
Since SQLite does not come with a convenient `LAST` aggregator (some other databases do),
|
||||
we need to work around it by reversing the order of the table based on its `ROWID` (which increase for each entry).
|
||||
Thus, the highest `ROWID` is the last entry.
|
||||
Having reversed it, we can limit the output to the very last one to arrive back at 'Charlie'.[^SOlast]
|
||||
|
||||
[^SOlast]: This is cribbed from the very nice tips on grabbing the last SQLite entry on [this Stack Overflow](https://stackoverflow.com/q/24494182) question.
|
||||
|
||||
However, this operation (as far as I know for now) is now not compatible anymore with *also* aggregating on a row count like we did above.
|
||||
|
||||
So, for the final query in this aggregation chapter, let's see the share of inmates which insisted on their innocence even during the last statement:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
1.0 * COUNT(CASE WHEN "Last Statement" LIKE '%innocent%' THEN 1 ELSE NULL END) / (COUNT(*)) * 100
|
||||
FROM deathrow;
|
||||
```
|
||||
|
||||
We can see that over 5% of the people proclaimed their innocence even during their final statement.
|
||||
|
||||
Of course, this method of simple string matching has some issues:
|
||||
If somebody uses other statements (the book mentions the example of 'not guilty') we have a lower bound of proclamations.
|
||||
We also do not know in what way people used the word --- what if they instead proclaimed *not* being innocent?
|
||||
We would rather have a higher bound than the true number in this case.
|
||||
|
||||
At the same time, we do not know the thoughts of the people not giving last statements at all.
|
||||
Perhaps, it would also make sense to compare only the number of people who did give a statement (but did not mention their innocence) with those who did, which would put us on the lower bound again.
|
||||
|
||||
We can see this behavior if we just show a sub-section of the statements:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name", "Last Statement"
|
||||
FROM deathrow
|
||||
WHERE "Last Statement" LIKE '%innocent%'
|
||||
LIMIT 4;
|
||||
```
|
||||
|
||||
While Preston, Jonathan and Keith do protest their innocence to the warden, their loved ones or the world,
|
||||
Jeffrey instead mentions 'innocent kids'.
|
||||
Now, he could include himself in this category but we do not know for sure.
|
||||
|
||||
So, this concludes a chapter about *aggregating*:
|
||||
operating with functions on multiple rows in the dataset, allowing study of more system-level behavior in the data.
|
||||
|
||||
## Chapter 3 - Grouping and Nesting {#sec-grouping}
|
||||
|
||||
After dealing with individual rows (Ch 1) and aggregations (Ch 2), we will now do some data *organization* based on specific rows or columns -
|
||||
a sort of mish-mash between keeping multiple rows in the output like in the first case and doing operations on them like in the latter.
|
||||
|
||||
The chapter begins by looking at a visualization of the data, which shows a strong long right tail (or, right skew as I know it called).
|
||||
We ultimately will end the chapter by taking a look at the percentage breakdown of executions each county contributed to investigate this skew.
|
||||
|
||||
Grouping can be accomplished with the `GROUP BY` block:[^semicolon]
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"County",
|
||||
COUNT(*) AS county_executions
|
||||
FROM deathrow
|
||||
GROUP BY "County"
|
||||
;
|
||||
```
|
||||
|
||||
[^semicolon]: You can see that I have started putting the final semicolon on its own separate line.
|
||||
It is a technique I have seen being used by 'Alex the Analyst' on YouTube and which I am quite keen on replicating.
|
||||
For me, it serves two purposes:
|
||||
First, for some SQL dialects, the closing semicolon is actually required and this makes it harder to forget it.
|
||||
Second, it provides a visually striking *close* of the query for myself as well, which may be useful once getting into more dense queries.
|
||||
|
||||
This reminds a lot of the 'misshapen' query above, however, there is a key difference:
|
||||
Even when doing aggregations around it, the column(s) *being grouped on* is allowed to be a multi-output column (called 'grouping columns').
|
||||
|
||||
Let's first do another quick grouping query to see how it can work.
|
||||
We'll try to find the most common last names:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "Last Name", COUNT(*) AS count
|
||||
FROM deathrow
|
||||
GROUP BY "Last Name"
|
||||
ORDER BY "count" DESC
|
||||
;
|
||||
```
|
||||
|
||||
The code above also makes use of *aliasing*, with an `AS <new-name>` block with which you can provide an efficient short-hand or new name for `SELECT` outputs.
|
||||
I believe it also works for the outputs of e.g. `JOIN` operations.
|
||||
|
||||
Let's now have a breakdown of executions with and without a last statement by county:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"Last Statement" IS NOT '' AS has_last_statement,
|
||||
"County",
|
||||
COUNT(*)
|
||||
FROM deathrow
|
||||
GROUP BY "County", "has_last_statement"
|
||||
;
|
||||
```
|
||||
|
||||
The order in which you group by here matters!
|
||||
In this case, we first order by county and then statements -
|
||||
all 'Anderson' inmates appear first, then all 'Aransas', at some point the 'Bell' county cases, both those with and without statement, before 'Bexar' county, and so on.
|
||||
Had we the groupings the other way around,
|
||||
we would first have all `has_last_statement = 0` entries, from 'Anderson' county to 'Wood' county last, and then repeat the same for all counts of cases with statements.
|
||||
|
||||
Also, we can of course manually influence this order.
|
||||
Using the `ORDER BY` block we can choose a column with which to order, so regardless of the grouping sequence we can make sure to for example order on counties.
|
||||
We can of course also sort by different column altogether of course, such as the counts which would then require a naming alias.
|
||||
Using `ORDER BY "Column" DESC` we can reverse the sorting.
|
||||
Here is the same example from above implementing most of these ideas:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"Last Statement" IS NOT '' AS has_last_statement,
|
||||
"County",
|
||||
COUNT(*) AS number
|
||||
FROM deathrow
|
||||
GROUP BY "has_last_statement", "County"
|
||||
ORDER BY "number" DESC
|
||||
;
|
||||
```
|
||||
|
||||
We already know from @sec-aggregation that `WHERE` blocks will take place before any aggregation.
|
||||
The same is true for groupings - `WHERE` will always execute (and thus filter) before `GROUP BY` executes.
|
||||
|
||||
The following counts the number of inmates executed that were at least 50 for each county:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"County",
|
||||
COUNT(*) AS number
|
||||
FROM deathrow
|
||||
WHERE "Age at Execution" >= 50
|
||||
GROUP BY "County"
|
||||
;
|
||||
```
|
||||
|
||||
We do not select the age column for further consideration, but since `WHERE` runs before all other operations we also do not need it.
|
||||
But what if we want to filter on the *outputs* of grouping or aggregation functions?
|
||||
The `HAVING` block solves that.
|
||||
|
||||
The following shows the counties in which at least two inmates 50 or older were executed:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"County",
|
||||
COUNT(*) AS number
|
||||
FROM deathrow
|
||||
WHERE "Age at Execution" >= 50
|
||||
GROUP BY "County"
|
||||
HAVING "number" > 2
|
||||
ORDER BY "number" DESC
|
||||
;
|
||||
```
|
||||
|
||||
As one interesting fact for possibly more advanced queries:
|
||||
`GROUP BY` blocks do *not* need the columns on which they group to be in the `SELECT` block!
|
||||
Generally this does not make a lot of sense - when we group by county but do not see county then it just seems like fairly weird groupings,
|
||||
but there will invariably be situations where this knowledge is useful.
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "County"
|
||||
FROM deathrow
|
||||
GROUP BY "County"
|
||||
;
|
||||
```
|
||||
|
||||
This exactly mirrors the `SELECT DISTINCT` aggregation, but is accomplished with grouping instead.
|
||||
Many ways to skin a query again!
|
||||
|
||||
Now let's pivot a little and look at query nesting.
|
||||
Since we sometimes will want to run one query leading into another (e.g. to compute percentages),
|
||||
we have to have some way to integrate them with another.
|
||||
We do so through *nested queries*, demarcated with `(parantheses)` within another query.
|
||||
|
||||
Let's see how we utilize this to select the inmate with the longest last statement:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT "First Name", "Last Name"
|
||||
FROM deathrow
|
||||
WHERE LENGTH("Last Statement") =
|
||||
(
|
||||
SELECT MAX(LENGTH("Last Statement"))
|
||||
FROM deathrow
|
||||
)
|
||||
;
|
||||
```
|
||||
|
||||
It looks a little cumbersome but essentially we first filter on the row whose statement length is (exactly) the length of the longest statement,
|
||||
previously queried as a separate sub-query.
|
||||
|
||||
Why do need a nested query here?
|
||||
The book itself explains it most succinctly:
|
||||
|
||||
> nesting is necessary here because in the WHERE clause,
|
||||
as the computer is inspecting a row to decide if its last statement is the right length,
|
||||
it can’t look outside to figure out the maximum length across the entire dataset.
|
||||
We have to find the maximum length separately and feed it into the clause.
|
||||
|
||||
We will now attempt to do the same to find the percentage of all executions contributed by each county:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"County",
|
||||
100.0 * COUNT(*) / (
|
||||
SELECT COUNT(*)
|
||||
FROM deathrow
|
||||
) as percentage
|
||||
FROM deathrow
|
||||
GROUP BY "County"
|
||||
ORDER BY "percentage" DESC
|
||||
;
|
||||
```
|
||||
|
||||
It follows the same concept:
|
||||
We need to invoke a nested query because our original query is already fixated on a sub-group of all rows and we **can not get out of it within our query**.
|
||||
Instead, we invoke another query 'before' the original which still has access to all rows and create our own nested aggregation.
|
||||
The output of that then feeds into the original query.
|
||||
|
||||
Quite clever, a little cumbersome, and presumably the origin of quite a few headaches in writing elegant SQL queries.
|
||||
|
||||
## Chapter 4 - Joins and Dates {#sec-joining}
|
||||
|
||||
Before we look at joins, let's look at handling dates (as in, the data type) in SQL.
|
||||
While we have a couple of columns of reasonably formatted dates (ISO-8601), that doesn't mean we can automatically use them as such.
|
||||
To make use of such nice and clean data we should use operations that specifically [make use of dates](https://www.sqlite.org/lang_datefunc.html) for their calculations.
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
julianday('1993-08-10') - julianday('1989-07-07') as day_diff
|
||||
;
|
||||
```
|
||||
|
||||
The `julianday()` function will transform our ISO-compatible dates into timestamp floats on which we can operate like usual, in this case subtracting the latter from the former to get the period of time between them.
|
||||
Like the unix timestamp, the Julian day counts from a specific point in time as 0 continuously upwards,
|
||||
only that it counts from 4714 B.C.E. (not 1971) and per-day not per-second.
|
||||
Anything below a single day is fractional.
|
||||
Half a day's difference would thus be `0.5` difference, making it perhaps more useful to work with larger time differences.
|
||||
|
||||
Now we will join a table with itself, only shifted by a row.
|
||||
This will make it necessary to prepare the date for the 'other' table first,
|
||||
adding one to the column we are going to join on.
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"Execution" + 1 AS ex_number,
|
||||
"Execution Date" AS prev_ex_date
|
||||
FROM deathrow
|
||||
WHERE "Execution" < 553
|
||||
```
|
||||
|
||||
We could perhaps use `julianday` comparisons, but since we have access to the execution numbers and they are rolling we can instead use them like an ID and just shift it one up.
|
||||
|
||||
Now we want to put data from one row into *another* row and neither aggregations nor groups can help us out.
|
||||
Instead, to gain access to data from other rows (whether in the same or another table) we use the `JOIN` block.
|
||||
There is an `INNER JOIN` (the default), a `LEFT JOIN`, a `RIGHT JOIN` and an `OUTER JOIN` block.
|
||||
|
||||
**The different joins only differ in how they handle unmatched rows.**
|
||||
With an inner join, any unmatched rows are dropped completely (essentially intersection merge),
|
||||
with an outer join unmatched rows are preserved completely from both tables (union merge),
|
||||
with a left join unmatched rows from the *left* (i.e. `FROM XY`) table are preserved,
|
||||
with a right join unmatched rows from the *right* (`JOIN XY`) table.
|
||||
|
||||
This prepares our table to be used to join 'itself', adding those rows (shifted) together which we want.
|
||||
Of course, we do not have to shift in our selection already, and I find it more intuitive to do so in the value comparison.
|
||||
We end up with the following query:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
"Execution Date",
|
||||
prev_ex_date AS "Previous Execution",
|
||||
JULIANDAY("Execution Date") - JULIANDAY(prev_ex_date) AS "Difference in Days"
|
||||
FROM deathrow
|
||||
JOIN (
|
||||
SELECT
|
||||
"Execution" AS ex_number,
|
||||
"Execution Date" AS prev_ex_date
|
||||
FROM deathrow
|
||||
WHERE "Execution" < 553
|
||||
) AS previous
|
||||
ON deathrow."Execution" = previous.ex_number + 1
|
||||
ORDER BY "Difference in Days" DESC
|
||||
LIMIT 10
|
||||
;
|
||||
```
|
||||
|
||||
This shows us the top ten timeframes in which no executions occured.
|
||||
You can see we do the shifting in the `ON` block itself, leading to the more natural reading of
|
||||
'join the tables on execution number being the same as the previous execution number plus one'.
|
||||
For my brain this is more easily comprehensible as a row-shift.
|
||||
Otherwise, we only select the execution number
|
||||
(though we only need it for the shift operation and drop it in the outer selection)
|
||||
and the execution date which is the one important column we are looking for.
|
||||
|
||||
We also do not include Execution number 553 (the largest execution number) since there will be no newer execution to join it with in the dataset.
|
||||
The resulting table will not be different if we do not, however.
|
||||
Remember we are doing an `INNER JOIN`, which drops any non-matching rows by default.
|
||||
|
||||
Such a 'self join' is a common technique to **grab information from other rows** from the same table.
|
||||
This is already quite an advanced query!
|
||||
|
||||
The book plots a graph here, which I will not replicate for the moment.
|
||||
However, it shows that roughly to pre-1993 there was a lower overall execution count,
|
||||
with two clearly visible larger hiatuses afterwards.
|
||||
|
||||
Let's focus on those two hiatuses and limit the data to not show pre-1993 executions.
|
||||
As a last thing, let's make it a little more elegant by making the original ('previous') table query way simpler:
|
||||
|
||||
```{sql connection=con}
|
||||
SELECT
|
||||
previous."Execution Date" AS "Beginning of period",
|
||||
deathrow."Execution Date" AS "End of period",
|
||||
JULIANDAY(deathrow."Execution Date") - JULIANDAY(previous."Execution Date") AS "Difference in Days"
|
||||
FROM deathrow
|
||||
JOIN deathrow AS previous
|
||||
ON deathrow."Execution" = previous."Execution" + 1
|
||||
WHERE DATE(deathrow."Execution Date") > DATE('1994-01-01')
|
||||
ORDER BY "Difference in Days" DESC
|
||||
LIMIT 10
|
||||
;
|
||||
```
|
||||
|
||||
We can also see much more clearly what the book is talking about with big stays of execution occuring in 1996-1997, as well as 2007-2008.
|
||||
|
||||
A little more wordy per line but overall a lot more elegant.
|
||||
And mostly enabled due to putting the row 'shift' into the `ON` block itself.
|
||||
However, the importance of good name-aliasing and `JOIN ON` blocks are definitely highlighted.
|
||||
We are now equipped to grab data from multiple rows, multiple tables and rearrange them as we see necessary.
|
||||
|
||||
## Quick Conclusion
|
||||
|
||||
So, this should already give a rough mental model for the *kinds* of operations to be done with SQL.
|
||||
|
||||
We can operate on the contents of a single row,
|
||||
we can aggregate the contents of many rows resulting in a single-row output,
|
||||
we can group by columns which allows us to aggregate into multiple rows,
|
||||
and we can work with the contents of *other* rows by joining tables.
|
||||
|
||||
We have learned to *query* data but not to create or manipulate date,
|
||||
i.e. working with side-effects.
|
||||
We also have not learned about the concepts of `window` functions or common table expressions.
|
||||
|
||||
These are additional operations to get to know,
|
||||
but of course it is also important to get an overall broader view onto the concepts and mental mapping of SQL itself.
|
||||
The book closes with a call-to-challenge, with an additional dataset to knock your teeth on.
|
Loading…
Reference in a new issue