initial commit

2024-01-24 20:42:11 +01:00 · 2024-01-24 20:42:11 +01:00 · 4da221d82c
commit 4da221d82c
6 changed files with 1542 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,225 @@
+# Created by https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
+# Edit at https://www.toptal.com/developers/gitignore?templates=-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
+
+### JupyterNotebooks ###
+# gitignore template for Jupyter Notebooks
+# website: http://jupyter.org/
+
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
+
+# IPython
+profile_default/
+ipython_config.py
+
+# Remove previous ipynb_checkpoints
+#   git rm -r .ipynb_checkpoints/
+
+### Linux ###
+*~
+
+# temporary files which can be created if a process still has a handle open of a deleted file
+.fuse_hidden*
+
+# KDE directory preferences
+.directory
+
+# Linux trash folder which might appear on any partition or disk
+.Trash-*
+
+# .nfs files are created when an open file is removed but is still being accessed
+.nfs*
+
+### Python ###
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+
+# IPython
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+### Python Patch ###
+# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
+poetry.toml
+
+# ruff
+.ruff_cache/
+
+# LSP config files
+pyrightconfig.json
+
+### Vim ###
+# Swap
+[._]*.s[a-v][a-z]
+!*.svg  # comment out if you don't need vector files
+[._]*.sw[a-p]
+[._]s[a-rt-v][a-z]
+[._]ss[a-gi-z]
+[._]sw[a-p]
+
+# Session
+Session.vim
+Sessionx.vim
+
+# Temporary
+.netrwhist
+# Auto-generated tag files
+tags
+# Persistent undo
+[._]*.un~
+
+# End of https://www.toptal.com/developers/gitignore/api/-f,python,linux,vim,quarto,markdown,jupyternotebooks,sql
+
+/.quarto/
+/output/
--- a/_quarto-blog.yml
+++ b/_quarto-blog.yml
@ -0,0 +1,8 @@
+project:
+  type: default
+  output-dir: /home/marty/projects/hosting/webpage/src/content/blog/2024-01-24-select-star-sql
+
+format:
+  hugo-md:
+    preserve-yaml: true
+    code-fold: false
--- a/_quarto.yml
+++ b/_quarto.yml
@ -0,0 +1,21 @@
+project:
+  type: default
+  output-dir: output
+
+format:
+  html:
+    code-fold: false
+    echo: true
+  pdf:
+    echo: true # since we want to see the code in this case
+    papersize: A4
+    geometry:
+     - left=2cm
+     - right=2.5cm
+     - top=2.5cm
+     - bottom=2.5cm
+    indent: true
+    linestretch: 1.5
+    fontfamily: lmodern
+    fontsize: "12"
+    pdf-engine: tectonic
--- a/data/tx_deathrow_full.csv
+++ b/data/tx_deathrow_full.csv
--- a/data/tx_deathrow_full.db
+++ b/data/tx_deathrow_full.db
--- a/index.qmd
+++ b/index.qmd
@ -0,0 +1,672 @@
+---
+title: "Select Star SQL"
+subtitle: "SQL Introduction follow-along"
+description: "SQL Introduction follow-along"
+author:
+  - Marty Oehme
+date: today
+pubDate: "2024-01-24T09:10:23+01:00"
+weight: 10
+toc: true
+tags:
+  - sql
+---
+
+## Quick Introduction
+
+I have recently been brushing up on my SQL skills,
+because I firmly believe they are one of the most fundamental knowledge areas if you do anything to do with data.
+
+The following content will follow along with the excellent [Star Select SQL](https://selectstarsql.com/) introductory resource to SQL.
+We will go through all four chapters, and follow most of the examples and challenges given.
+Sometimes, however, we will swerve a little and approach things slightly differently, 
+or with a different query twist -
+just because my mind obviously works differently than that of Zi Chong Kao who wrote the book.
+
+In general, there is slightly more of a focus on the actual SQL statements,
+the knowledge imparted and perhaps a more explicit wording regarding notes and limits in the queries.
+
+Less of a focus is given to the meaning of the data under analysis,
+though we of course still use the same dataset, and attempt to draw the same conclusions from similar queries.
+In other words: 
+If you want to learn (and be excited by the possibilities and challenges of) SQL, go to [Select Star SQL](https://selectstarsql.com/).
+If you have already done it and want a refresher, this page will give you a quick overview separated by topical chapter.
+Or use it as inspiration to read-along like I did.
+
+This document uses quarto under the hood and we will experiment a little with the R SQL interactions along the way.
+It is also my first attempt to really integrate quarto publications into this blog,
+in an attempt to ease the process for the future.
+For now, we just load the database (as an SQLite file) and show which table we have,
+using some R packages:
+
+```{r}
+suppressPackageStartupMessages(library(tidyverse)) # suppressing conflict warnings in outputs
+library(DBI)
+library(RSQLite)
+con <- dbConnect(SQLite(), "data/tx_deathrow_full.db")
+as.data.frame(dbListTables(con))
+```
+
+That seems reasonable! We have a single table in the dataset, called `deathrow`.
+
+This is perhaps a good moment to mention that the tables and colums do *not* have the exact same names as they do on the website's interactive follow-along boxes.
+Instead, they have the column names that the author gave the columns for the full data set that he made available as a csv download, I have not changed any of them.
+
+<details>
+    <summary>A list of all column headers in the `deathrow` table.</summary>
+    0|Execution|TEXT\
+    1|Date of Birth|TEXT\
+    2|Date of Offence|TEXT\
+    3|Highest Education Level|TEXT\
+    4|Last Name|TEXT\
+    5|First Name|TEXT\
+    6|TDCJ Number|TEXT\
+    7|Age at Execution|TEXT\
+    8|Date Received|TEXT\
+    9|Execution Date|TEXT\
+    10|Race|TEXT\
+    11|County|TEXT\
+    12|Eye Color|TEXT\
+    13|Weight|TEXT\
+    14|Height|TEXT\
+    15|Native County|TEXT\
+    16|Native State|TEXT\
+    17|Last Statement|TEXT\
+</details>
+
+Perhaps, for a future version it might be an interesting data cleaning experiment to actually change them to the same names before following the book.
+
+## Chapter 1 - Individual operations {#sec-individual}
+
+For now, to test the database connection, we simply print three inmate names:
+We do so by doing a `SELECT` for just three rows from the overall table.
+
+```{sql connection=con}
+-- SELECT * FROM deathrow LIMIT 3;
+SELECT "First Name", "Last Name" FROM deathrow LIMIT 3;
+```
+
+Everything seems to be working smoothly, and we can even directly create `sql` code chunks which connect to our database.
+Neat!
+
+The star indeed selects 'everything' (somewhat like globbing) which means we `SELECT` *everything* from out deathrow table, but limit the number of rows to three.
+In other words, `SELECT` is more oriented towards columns, with `LIMIT` (and later e.g. `WHERE`) instead filtering out rows.
+
+The 'star' selection is only a theory in this document, unfortunately,
+with `quarto` (and `knitr`) as our rendering engines.
+Doing this selection here would just end in a word-spaghetti since we just have *too many columns* to nicely display.
+Instead, we limit ourselves to just a couple.
+You can see the 'star' selection as a comment underneath the one we did, however.
+
+For now, we `SELECT` from out table.
+We do not have to.
+Selections at their most basic simply return whatever returns a boolean 'truthy' value for the columns passed in,
+even if we do not `SELECT` `FROM` a specific table:
+
+```{sql connection=con}
+SELECT 50 + 2, 51 / 2, 51 / 2.0;
+```
+
+This example also reflects the float/integer differences in SQL:
+If you only work with integers, the result will be an integer.
+To work with floating point numbers at least one of the involved numbers has to be a float.
+Often this is accomplished by just multiplying with `* 1.0` at some point in the query.
+
+Having filtered on *columns* above with `SELECT`, let us now filter on *rows* with the `WHERE` block:
+
+```{sql connection=con}
+SELECT "Last Name", "First Name"
+FROM deathrow
+WHERE "Age at Execution" <= 25;
+```
+
+In this case, we filter for all executions aged 25 and under (only very few due to the lengthy process).
+`WHERE` blocks take an expression which results in a boolean truth value (in this case 'is it smaller than or equal to?'), 
+and for every row that the result is true will include them for the rest of the query.
+
+This bit is also important, as `WHERE` operations will generally run before other operations such as `SELECT` or aggregations, as we will see in the next chapter.
+
+As a last bit to notice, the order of `SELECT` conditions also matters:
+it is the order the rows will appear as in the resulting table.
+Here, we switched first and last names compared to the last table query.
+
+While numeric comparisons are one thing, we can of course filter on text as well.
+A single equals sign generally accomplishes conditional comparison:
+
+```{sql connection=con}
+SELECT "First Name", "Last Name"
+FROM deathrow
+WHERE "Last Name" = "Jones";
+```
+
+Of course, string comparison needs some leeway to account for small differences in the thing to be searched for.
+For example, names could have additions ('Sr.', 'Jr.') or enumeration ('John II') or even simple misspellings.
+We can use `LIKE` to help with that for string comparisons:
+
+```{sql connection=con}
+SELECT "First Name", "Last Name", "Execution"
+FROM deathrow
+WHERE "First Name" LIKE 'Raymon_'
+    AND "Last Name" LIKE '%Landry%';
+```
+
+It allows a few wildcards in your queries to accomplish this: 
+`_` matches exactly one character, 
+while `%` matches any amount of characters.
+Both only operate in the place the are put, so that `%Landry` and `%Landy%` can be different comparisons.
+
+Above you can also see that we can logically concatenate query parts.
+The precedence order goes `NOT`, then `AND`, then `OR`, so that the following:
+
+```{sql connection=con}
+SELECT 0 AND 0 OR 1;
+```
+
+Returns `1`.
+It reads '0 and 0', which results in a 0; and *then* it reads '0 or 1' which ultimately results in a 1.
+To re-organize them into the precedence we require, simply use parentheses:
+
+```{sql connection=con}
+SELECT 0 AND (0 OR 1);
+```
+
+The parenthesized `OR` clause is now looked at before the `AND` clause, resulting in a 0.
+
+As a cap-stone for Chapter 1,
+and to show that we do *not* need the column we filter on (with `WHERE`) as a `SELECT`ed one in the final table,
+we will select a specific statement:
+
+```{sql connection=con}
+SELECT "Last Statement"
+FROM deathrow
+WHERE "Last Name" = 'Beazley';
+```
+
+## Chapter 2 - Aggregations {#sec-aggregation}
+
+We can use aggregators (such as `COUNT`, `MEAN` or `MEDIAN`) to consolidate information from *multiple* rows of input.
+
+```{sql connection=con}
+SELECT COUNT("Last Statement") FROM deathrow WHERE "Last Statement" != '';
+```
+
+Here, we diverge slightly from the book:
+Whereas for them, NULLs are used where no statement exists, this is not so for our dataset.
+
+Since we did no cleaning after importing from csv, empty statements will be imported as an empty string and not a `NULL` object.
+Since count uses `NULL` objects (as a 'non-count' if you want) we additionally have to select all those rows where the statements are not empty strings.
+
+We can count the overall number of entries (i.e. the denominator for later seeing which share of inmates proclaimed innocence) by `COUNT`ing a column which we know to have a value for each entry:
+
+```{sql connection=con}
+SELECT COUNT("ex_age") FROM deathrow;
+```
+
+Or, even better as the book shows, use `COUNT(*)` to count the value in all columns. Since we can't have completely `NULL` rows (the row would just not exist), we will definitely get an overall count back this way:
+
+```{sql connection=con}
+SELECT COUNT(*) FROM deathrow;
+```
+
+Now, let's see how we can combine this with conditional `CASE WHEN` statements.
+
+Let's find all inmates from 'Harris' or 'Bexar' country:
+
+```{sql connection=con}
+SELECT
+    SUM(CASE WHEN "County"='Harris' THEN 1 ELSE 0 END),
+    SUM(CASE WHEN "County"='Bexar' THEN 1 ELSE 0 END)
+FROM deathrow;
+```
+
+We can try to find the number of inmates that where over 50 years old at their point of execution:
+
+```{sql connection=con}
+SELECT COUNT(*) FROM deathrow WHERE "Age at Execution" > 50;
+```
+
+You can see that the `WHERE` block selects *before* we run aggregations.
+That is useful since we first reduce the amount of entries we have to consider before further operations.
+
+Now, we can practice working with conditions, counts and cases for the same goal:
+Finding all instances where inmates did *not* give a last statement.
+
+First, we will do so again with a `WHERE` block:
+
+```{sql connection=con}
+SELECT COUNT(*) FROM deathrow WHERE "Last Statement" = '';
+```
+
+Then, we can do the same, but using a `COUNT` and `CASE WHEN` blocks:
+
+```{sql connection=con}
+SELECT
+    COUNT(CASE WHEN "Last Statement" = '' THEN 1 ELSE NULL END)
+FROM deathrow;
+```
+
+This is a little worse performance-wise.
+Whereas in the first attempt (using `WHERE`) we first filter the overall table down to relevant entries and then count them, here we go through the whole table to count each (or 'not-count' those we designate with `NULL`).
+
+Lastly, if we had cleaned the data correctly before using it here (especially designated empty strings as `NULL` objects), we could use `COUNT` blocks only:
+
+```{sql connection=con}
+SELECT
+    COUNT(*) - COUNT("Last Statement")
+FROM deathrow;
+```
+
+You can see, however, that this results in `0` entries --- we count *all* entries both times since nothing is `NULL`.
+While somewhat contrived here, it should point to the fact that it is generally better to have clean `NULL`-designated data before working with it.
+
+However, this way of counting would also be the least performant of the three, with all rows being aggregated with counts twice.
+
+In other words, the exercise showed us three things:
+
+- there are many ways to skin an SQL query
+- correct `NULL` designations can be important during cleaning
+- performant operations should generally filter before aggregating
+
+There are more aggregation functions, such as `MIN`, `MAX`, `AVG`.[^docs]
+
+[^docs]: The book recommends documentation as [SQLITE](http://sqlite.org), [W3 Schools](https://www.w3schools.com/sql/default.asp) and, of course, Stack overflow.
+
+```{sql connection=con}
+SELECT
+    MIN("Age at Execution"),
+    MAX("Age at Execution"),
+    AVG("Age at Execution")
+FROM deathrow;
+```
+
+We can also combine aggregations, running one on the results of another (just like combining function outputs in e.g. python).
+Here is an example calculating the average length of statements, for cases where a statement exists using the `LENGTH` aggregation:
+
+```{sql connection=con}
+SELECT
+    AVG(LENGTH("Last Statement"))
+FROM deathrow
+WHERE "Last Statement" != '';
+```
+
+Another aggregation is `DISTINCT` which works somewhat like the program `uniq` on Unix systems:
+
+```{sql connection=con}
+SELECT DISTINCT("County") FROM deathrow;
+```
+
+It presents all the options (or 'categories' if you think of your data as categorical) that are represented in a column, or the output of another aggregation.
+
+On its face it is less of an *aggregation* function as the book remarks, since it does not output "a single number".
+But since it 'aggregates' the contents of multiple rows into its output I would still very much classify it as such.
+
+Finally, let's look at what happens with 'misshapen' queries:
+
+<!-- ::: {#lst-strange-query} -->
+
+```{sql connection=con}
+SELECT "First Name", COUNT(*) FROM deathrow;
+```
+
+<!-- A strange query -->
+
+<!-- ::: -->
+
+Here, we make a query which is indeed very strange:
+The `COUNT` aggregation wants to output a single aggregated number, while the single column selection `"First Name"` wants to output each individual row.
+
+What happens?
+The database goes easy on us and does not error out but uses the aggregation as guide that we only receive a single output back and picks an entry from the first names.
+
+In the book, this is the last entry ('Charlie'), though it does warn that not all databases return the same.
+This is reflected in the SQLite query here in fact returning the *first* entry ('Christopher Anthony') instead.
+
+The lesson being to *not* rely on unclear operations like this but being explicit if, say, we want to indeed grab the last entry of something:
+
+```{sql connection=con}
+SELECT "First Name" FROM deathrow ORDER BY ROWID DESC LIMIT 1;
+```
+
+Since SQLite does not come with a convenient `LAST` aggregator (some other databases do),
+we need to work around it by reversing the order of the table based on its `ROWID` (which increase for each entry).
+Thus, the highest `ROWID` is the last entry.
+Having reversed it, we can limit the output to the very last one to arrive back at 'Charlie'.[^SOlast]
+
+[^SOlast]: This is cribbed from the very nice tips on grabbing the last SQLite entry on [this Stack Overflow](https://stackoverflow.com/q/24494182) question.
+
+However, this operation (as far as I know for now) is now not compatible anymore with *also* aggregating on a row count like we did above.
+
+So, for the final query in this aggregation chapter, let's see the share of inmates which insisted on their innocence even during the last statement:
+
+```{sql connection=con}
+SELECT
+    1.0 * COUNT(CASE WHEN "Last Statement" LIKE '%innocent%' THEN 1 ELSE NULL END) / (COUNT(*)) * 100
+FROM deathrow;
+```
+
+We can see that over 5% of the people proclaimed their innocence even during their final statement.
+
+Of course, this method of simple string matching has some issues: 
+If somebody uses other statements (the book mentions the example of 'not guilty') we have a lower bound of proclamations.
+We also do not know in what way people used the word --- what if they instead proclaimed *not* being innocent?
+We would rather have a higher bound than the true number in this case.
+
+At the same time, we do not know the thoughts of the people not giving last statements at all.
+Perhaps, it would also make sense to compare only the number of people who did give a statement (but did not mention their innocence) with those who did, which would put us on the lower bound again.
+
+We can see this behavior if we just show a sub-section of the statements:
+
+```{sql connection=con}
+SELECT "First Name", "Last Statement"
+FROM deathrow
+WHERE "Last Statement" LIKE '%innocent%'
+LIMIT 4;
+```
+
+While Preston, Jonathan and Keith do protest their innocence to the warden, their loved ones or the world,
+Jeffrey instead mentions 'innocent kids'. 
+Now, he could include himself in this category but we do not know for sure.
+
+So, this concludes a chapter about *aggregating*:
+operating with functions on multiple rows in the dataset, allowing study of more system-level behavior in the data.
+
+## Chapter 3 - Grouping and Nesting {#sec-grouping}
+
+After dealing with individual rows (Ch 1) and aggregations (Ch 2), we will now do some data *organization* based on specific rows or columns - 
+a sort of mish-mash between keeping multiple rows in the output like in the first case and doing operations on them like in the latter.
+
+The chapter begins by looking at a visualization of the data, which shows a strong long right tail (or, right skew as I know it called).
+We ultimately will end the chapter by taking a look at the percentage breakdown of executions each county contributed to investigate this skew.
+
+Grouping can be accomplished with the `GROUP BY` block:[^semicolon]
+
+```{sql connection=con}
+SELECT
+    "County",
+    COUNT(*) AS county_executions
+FROM deathrow
+GROUP BY "County"
+;
+```
+
+[^semicolon]: You can see that I have started putting the final semicolon on its own separate line.
+            It is a technique I have seen being used by 'Alex the Analyst' on YouTube and which I am quite keen on replicating.
+            For me, it serves two purposes:
+            First, for some SQL dialects, the closing semicolon is actually required and this makes it harder to forget it.
+            Second, it provides a visually striking *close* of the query for myself as well, which may be useful once getting into more dense queries.
+
+This reminds a lot of the 'misshapen' query above, however, there is a key difference:
+Even when doing aggregations around it, the column(s) *being grouped on* is allowed to be a multi-output column (called 'grouping columns').
+
+Let's first do another quick grouping query to see how it can work.
+We'll try to find the most common last names:
+
+```{sql connection=con}
+SELECT "Last Name", COUNT(*) AS count
+FROM deathrow
+GROUP BY "Last Name"
+ORDER BY "count" DESC
+;
+```
+
+The code above also makes use of *aliasing*, with an `AS <new-name>` block with which you can provide an efficient short-hand or new name for `SELECT` outputs.
+I believe it also works for the outputs of e.g. `JOIN` operations.
+
+Let's now have a breakdown of executions with and without a last statement by county:
+
+```{sql connection=con}
+SELECT 
+    "Last Statement" IS NOT '' AS has_last_statement,
+    "County",
+    COUNT(*)
+FROM deathrow
+GROUP BY "County", "has_last_statement"
+;
+```
+
+The order in which you group by here matters!
+In this case, we first order by county and then statements -
+all 'Anderson' inmates appear first, then all 'Aransas', at some point the 'Bell' county cases, both those with and without statement, before 'Bexar' county, and so on.
+Had we the groupings the other way around,
+we would first have all `has_last_statement = 0` entries, from 'Anderson' county to 'Wood' county last, and then repeat the same for all counts of cases with statements.
+
+Also, we can of course manually influence this order.
+Using the `ORDER BY` block we can choose a column with which to order, so regardless of the grouping sequence we can make sure to for example order on counties.
+We can of course also sort by different column altogether of course, such as the counts which would then require a naming alias.
+Using `ORDER BY "Column" DESC` we can reverse the sorting.
+Here is the same example from above implementing most of these ideas:
+
+```{sql connection=con}
+SELECT 
+    "Last Statement" IS NOT '' AS has_last_statement,
+    "County",
+    COUNT(*) AS number
+FROM deathrow
+GROUP BY "has_last_statement", "County"
+ORDER BY "number" DESC
+;
+```
+
+We already know from @sec-aggregation that `WHERE` blocks will take place before any aggregation.
+The same is true for groupings - `WHERE` will always execute (and thus filter) before `GROUP BY` executes.
+
+The following counts the number of inmates executed that were at least 50 for each county:
+
+```{sql connection=con}
+SELECT 
+    "County",
+    COUNT(*) AS number
+FROM deathrow
+WHERE "Age at Execution" >= 50
+GROUP BY "County"
+;
+```
+
+We do not select the age column for further consideration, but since `WHERE` runs before all other operations we also do not need it.
+But what if we want to filter on the *outputs* of grouping or aggregation functions?
+The `HAVING` block solves that.
+
+The following shows the counties in which at least two inmates 50 or older were executed:
+
+```{sql connection=con}
+SELECT 
+    "County",
+    COUNT(*) AS number
+FROM deathrow
+WHERE "Age at Execution" >= 50
+GROUP BY "County"
+HAVING "number" > 2
+ORDER BY "number" DESC
+;
+```
+
+As one interesting fact for possibly more advanced queries:
+`GROUP BY` blocks do *not* need the columns on which they group to be in the `SELECT` block!
+Generally this does not make a lot of sense - when we group by county but do not see county then it just seems like fairly weird groupings,
+but there will invariably be situations where this knowledge is useful.
+
+```{sql connection=con}
+SELECT "County"
+FROM deathrow
+GROUP BY "County"
+;
+```
+
+This exactly mirrors the `SELECT DISTINCT` aggregation, but is accomplished with grouping instead.
+Many ways to skin a query again!
+
+Now let's pivot a little and look at query nesting.
+Since we sometimes will want to run one query leading into another (e.g. to compute percentages),
+we have to have some way to integrate them with another.
+We do so through *nested queries*, demarcated with `(parantheses)` within another query.
+
+Let's see how we utilize this to select the inmate with the longest last statement:
+
+```{sql connection=con}
+SELECT "First Name", "Last Name"
+FROM deathrow
+WHERE LENGTH("Last Statement") = 
+    (  
+        SELECT MAX(LENGTH("Last Statement"))
+        FROM deathrow
+    )
+;
+```
+
+It looks a little cumbersome but essentially we first filter on the row whose statement length is (exactly) the length of the longest statement,
+previously queried as a separate sub-query.
+
+Why do need a nested query here?
+The book itself explains it most succinctly:
+
+> nesting is necessary here because in the WHERE clause,
+  as the computer is inspecting a row to decide if its last statement is the right length,
+  it can’t look outside to figure out the maximum length across the entire dataset.
+  We have to find the maximum length separately and feed it into the clause.
+
+We will now attempt to do the same to find the percentage of all executions contributed by each county:
+
+```{sql connection=con}
+SELECT
+    "County",
+    100.0 * COUNT(*) / (
+        SELECT COUNT(*)
+        FROM deathrow
+    ) as percentage
+FROM deathrow
+GROUP BY "County"
+ORDER BY "percentage" DESC
+;
+```
+
+It follows the same concept: 
+We need to invoke a nested query because our original query is already fixated on a sub-group of all rows and we **can not get out of it within our query**.
+Instead, we invoke another query 'before' the original which still has access to all rows and create our own nested aggregation.
+The output of that then feeds into the original query.
+
+Quite clever, a little cumbersome, and presumably the origin of quite a few headaches in writing elegant SQL queries.
+
+## Chapter 4 - Joins and Dates {#sec-joining}
+
+Before we look at joins, let's look at handling dates (as in, the data type) in SQL.
+While we have a couple of columns of reasonably formatted dates (ISO-8601), that doesn't mean we can automatically use them as such.
+To make use of such nice and clean data we should use operations that specifically [make use of dates](https://www.sqlite.org/lang_datefunc.html) for their calculations.
+
+```{sql connection=con}
+SELECT 
+    julianday('1993-08-10') - julianday('1989-07-07') as day_diff
+;
+```
+
+The `julianday()` function will transform our ISO-compatible dates into timestamp floats on which we can operate like usual, in this case subtracting the latter from the former to get the period of time between them.
+Like the unix timestamp, the Julian day counts from a specific point in time as 0 continuously upwards,
+only that it counts from 4714 B.C.E. (not 1971) and per-day not per-second.
+Anything below a single day is fractional.
+Half a day's difference would thus be `0.5` difference, making it perhaps more useful to work with larger time differences.
+
+Now we will join a table with itself, only shifted by a row.
+This will make it necessary to prepare the date for the 'other' table first,
+adding one to the column we are going to join on.
+
+```{sql connection=con}
+SELECT 
+    "Execution" + 1 AS ex_number,
+    "Execution Date" AS prev_ex_date
+FROM deathrow
+WHERE "Execution" < 553
+```
+
+We could perhaps use `julianday` comparisons, but since we have access to the execution numbers and they are rolling we can instead use them like an ID and just shift it one up.
+
+Now we want to put data from one row into *another* row and neither aggregations nor groups can help us out.
+Instead, to gain access to data from other rows (whether in the same or another table) we use the `JOIN` block.
+There is an `INNER JOIN` (the default), a `LEFT JOIN`, a `RIGHT JOIN` and an `OUTER JOIN` block.
+
+**The different joins only differ in how they handle unmatched rows.**
+With an inner join, any unmatched rows are dropped completely (essentially intersection merge),
+with an outer join unmatched rows are preserved completely from both tables (union merge),
+with a left join unmatched rows from the *left* (i.e. `FROM XY`) table are preserved,
+with a right join unmatched rows from the *right* (`JOIN XY`) table.
+
+This prepares our table to be used to join 'itself', adding those rows (shifted) together which we want.
+Of course, we do not have to shift in our selection already, and I find it more intuitive to do so in the value comparison.
+We end up with the following query:
+
+```{sql connection=con}
+SELECT 
+    "Execution Date",
+    prev_ex_date AS "Previous Execution",
+    JULIANDAY("Execution Date") - JULIANDAY(prev_ex_date) AS "Difference in Days"
+FROM deathrow
+JOIN (
+    SELECT 
+        "Execution" AS ex_number,
+        "Execution Date" AS prev_ex_date
+    FROM deathrow
+    WHERE "Execution" < 553
+) AS previous
+ON deathrow."Execution" = previous.ex_number + 1
+ORDER BY "Difference in Days" DESC
+LIMIT 10
+;
+```
+
+This shows us the top ten timeframes in which no executions occured.
+You can see we do the shifting in the `ON` block itself, leading to the more natural reading of 
+'join the tables on execution number being the same as the previous execution number plus one'.
+For my brain this is more easily comprehensible as a row-shift.
+Otherwise, we only select the execution number 
+(though we only need it for the shift operation and drop it in the outer selection)
+and the execution date which is the one important column we are looking for.
+
+We also do not include Execution number 553 (the largest execution number) since there will be no newer execution to join it with in the dataset.
+The resulting table will not be different if we do not, however.
+Remember we are doing an `INNER JOIN`, which drops any non-matching rows by default.
+
+Such a 'self join' is a common technique to **grab information from other rows** from the same table.
+This is already quite an advanced query!
+
+The book plots a graph here, which I will not replicate for the moment.
+However, it shows that roughly to pre-1993 there was a lower overall execution count, 
+with two clearly visible larger hiatuses afterwards.
+
+Let's focus on those two hiatuses and limit the data to not show pre-1993 executions.
+As a last thing, let's make it a little more elegant by making the original ('previous') table query way simpler:
+
+```{sql connection=con}
+SELECT 
+    previous."Execution Date" AS "Beginning of period",
+    deathrow."Execution Date" AS "End of period",
+    JULIANDAY(deathrow."Execution Date") - JULIANDAY(previous."Execution Date") AS "Difference in Days"
+FROM deathrow
+JOIN deathrow AS previous
+ON deathrow."Execution" = previous."Execution" + 1
+WHERE DATE(deathrow."Execution Date") > DATE('1994-01-01')
+ORDER BY "Difference in Days" DESC
+LIMIT 10
+;
+```
+
+We can also see much more clearly what the book is talking about with big stays of execution occuring in 1996-1997, as well as 2007-2008.
+
+A little more wordy per line but overall a lot more elegant.
+And mostly enabled due to putting the row 'shift' into the `ON` block itself.
+However, the importance of good name-aliasing and `JOIN ON` blocks are definitely highlighted.
+We are now equipped to grab data from multiple rows, multiple tables and rearrange them as we see necessary.
+
+## Quick Conclusion
+
+So, this should already give a rough mental model for the *kinds* of operations to be done with SQL.
+
+We can operate on the contents of a single row,
+we can aggregate the contents of many rows resulting in a single-row output,
+we can group by columns which allows us to aggregate into multiple rows,
+and we can work with the contents of *other* rows by joining tables.
+
+We have learned to *query* data but not to create or manipulate date,
+i.e. working with side-effects.
+We also have not learned about the concepts of `window` functions or common table expressions.
+
+These are additional operations to get to know,
+but of course it is also important to get an overall broader view onto the concepts and mental mapping of SQL itself.
+The book closes with a call-to-challenge, with an additional dataset to knock your teeth on.