Add intro and weekday rhythm analysis

2025-11-20 15:43:22 +01:00 · 2025-11-20 15:43:22 +01:00 · 48a63484ae
commit 48a63484ae
parent 12e1666959
1 changed files with 315 additions and 0 deletions
--- a/index.md
+++ b/index.md
@ -0,0 +1,315 @@
+# Nushell analysis
+
+A little while ago I have been experimenting with the `datalad` program in order
+to create recreatable and traceable datasets to work with in data analysis.
+
+One of my learning projects resulted in the creation of
+[`ds-voidlinux-popcorn`](https://git.martyoeh.me/datasci/ds-voidlinux-popcorn),
+a dataset taking the statistics from
+[voidlinux popcorn](https://popcorn.voidlinux.org/) and transforming them into
+an easily parseable csv-based dataset.
+
+I then used the data for further analysis, the results of which is a long-form
+article which will hopefully be published here soon.
+
+But I also thought, until then, let's have some further learning fun while we're
+at it and do a little exploratory data analysis with
+[`nushell`](https://nushell.sh).
+
+In a nutshell, nushell is a unix-like shell environment which transports
+structured data through its pipes, making some data exploration processes
+relatively painless which were much more involved more traditional shells like
+`zsh` or `bash`.[^powershell]
+
+[^powershell]:
+In a way, `nushell` combines the concise syntax and movement concepts of unix-y
+shells like `bash` with the internal data model concepts of `powershell`, while
+giving it all a little sheen and gloss.
+
+While working with `json` data in a shell directly traditionally required
+calling out to external programs like [`jq`](https://github.com/jqlang/jq) or
+[`ijq`](https://sr.ht/~gpanders/ijq/), [`yq`](https://github.com/kislyuk/yq) or
+[`zq`](https://zed.brimdata.io/docs/commands/zq), with `nushell` this is all
+handled internally, exposing the same data selection and filtering interface to
+you whether you work with `json` files, `csv` or `tsv`, `yaml` or `toml`, or an
+increasing number of additional formats handled through
+[plugins](https://github.com/nushell/awesome-nu).
+For any of these formats, simply `open` a file, or pipe data into a `from`
+command, and nushell creates a data structure it understands out of it ---
+usually a table.
+
+At the same time, the language and underlying (functional) approach to `nushell`
+is radically different from traditional shells, so it takes some time to
+re-learn the syntax and mental models required for fast and flexible data
+analysis.
+
+Which is where we circle back to the purpose of this article:
+let's take the data I prepared and run it through some `nushell` functions to
+explore and learn about both at the same time.
+
+Note:\ I am by no means an authority on `nushell` or functional programming.
+In fact I cobbled many of the following examples together from other people's
+samples, the [`nushell` book](http://www.nushell.sh/book/) and the official
+[cookbook](http://www.nushell.sh/cookbook/).
+If I made a mistake anywhere or there are better ways to accomplish what I did
+_don't hesitate_ to reach out and let me know!
+
+## Loading the data
+
+Since the data exists in a `datalad`-prepared dataset as `csv` files, this makes
+it easy to integrate into my new analysis.
+
+For my purposes, I'll simply create a new `datalad` (dataset) project.
+If you don't have the program installed, you can use a temporary version with
+the `uv` python package manager, by substituting the `datalad ...` command in
+any of the below code snippets with `uvx datalad ...`.[^datalad]
+
+[^datalad]:
+Using `datalad` is not strictly necessary to get at the actual data but it does
+make it easier.
+Basically the program is a wrapper around `git-annex` which takes care of some
+of the plumbing to let you concentrate on the data analysis itself.
+Any of the below operations can also be accomplished with `git` and `git-annex`
+commands, but I am not exploring those as `datalad` is not the focus of the
+article at hand.
+
+Then we can simply clone the existing dataset as a _sub_-dataset into this new
+project, and download the whole output directory from it.
+
+```sh
+datalad create -c yoda analysis-popcorn-nushell
+cd analysis-popcorn-nushell
+mkdir input
+
+datalad clone --dataset . https://git.martyoeh.me/datasci/ds-voidlinux-popcorn input/popcorn
+```
+
+The `input/` directory creation and path cloning is not necessary, I just like
+to have any input data for the current project inside a directory named
+`input/`, and any output data (unsurprisingly) in a directory named `output/`.
+If you'd like you can clone the dataset just as well into the repository root
+instead.
+
+Lastly, we have to actually _grab_ the data (so far we only cloned pointers to
+the actual files) so we can work with it locally:
+
+```sh
+datalad get input/popcorn/output/*
+```
+
+This may take a moment, as the dataset is currently a couple hundred MB in size.
+Now we are ready to take a look with `nushell`:
+
+```nu
+open input/popcorn/output/files.csv | first 5
+```
+
+This should show you a table with 5 rows of the files contained in the 'raw'
+popcorn dataset, along with their filesize and last modification time.[^firstfive]
+
+[^firstfive]: For your analysis of course it will makes more sense to work on the full file, I am just restricting it to the first five for easier display in this article. For the more involved data grouping below, I often switch to the full file as well.
+
+| date | filename | mtime | filesize |
+| --- | --- | --- | --- |
+| 2018-05-09 | 2018-05-09.json | 1759244947.9668756 | 6258 |
+| 2018-05-10 | 2018-05-10.json | 1759244947.9668756 | 48567 |
+| 2018-05-11 | 2018-05-11.json | 1759244947.9678757 | 56027 |
+| 2018-05-12 | 2018-05-12.json | 1759244947.9678757 | 43233 |
+| 2018-05-13 | 2018-05-13.json | 1759244947.9678757 | 52045 |
+
+Let's do a quick few example cleaning steps to dip our toes into the water before moving to more intricate grouping and aggregation steps.
+
+The `date` column is already nicely cleaned to display a simple `YYYY-mm-dd` day format,
+but it is still only read as a `string` formatted column.
+
+```nu
+open input/popcorn/output/files.csv | first 5 | into datetime date
+```
+
+| date | filename | mtime | filesize |
+| --- | --- | --- | --- |
+| Wed, 9 May 2018 00:00:00 +0200 (7 years ago) | 2018-05-09.json | 1759244947.9668756 | 6258 |
+| Thu, 10 May 2018 00:00:00 +0200 (7 years ago) | 2018-05-10.json | 1759244947.9668756 | 48567 |
+| Fri, 11 May 2018 00:00:00 +0200 (7 years ago) | 2018-05-11.json | 1759244947.9678757 | 56027 |
+| Sat, 12 May 2018 00:00:00 +0200 (7 years ago) | 2018-05-12.json | 1759244947.9678757 | 43233 |
+| Sun, 13 May 2018 00:00:00 +0200 (7 years ago) | 2018-05-13.json | 1759244947.9678757 | 52045 |
+
+Your display might diverge a bit from mine here now,
+but we can see that `nushell` has parsed the string into a full `datetime` type which we can work with much more easily in later operations.
+Let's also fix the other columns, starting with `filesize`:
+
+```nu
+open input/popcorn/output/files.csv | last 5 | into datetime date | into filesize filesize
+```
+
+These are now also nicely recognized as type `filesize` and shown in a more human-readable format:
+
+| date | filename | mtime | filesize |
+| --- | --- | --- | --- |
+| Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago) | 2025-11-16.json | 1763636518.5792813 | 417.5 kB |
+| Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago) | 2025-11-17.json | 1763636518.7392821 | 419.3 kB |
+| Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago) | 2025-11-18.json | 1763636518.8012826 | 407.9 kB |
+| Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago) | 2025-11-19.json | 1763636519.7272873 | 416.5 kB |
+| Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago) | 2025-11-20.json | 1763636519.874288 | 407.7 kB |
+
+Lastly, let's fix the `mtime` column.
+This is a little more involved, as we have the data in a unix timestamp format with sub-second accuracy (everything after the dot).
+Since the values resemble a `float` type, it is automatically parsed by `nushell` as such.
+Generally useful, but in our case we'll have to do an intermittent conversion step since `into datetime` can only understand `int` and `string` types.
+
+So, here's the final conversion command, taking care of all the relevant columns:
+
+```nu
+open input/popcorn/output/files.csv | last 5 |
+  into datetime date |
+  into filesize filesize |
+  into string mtime |
+  into datetime --format "%s%.f" mtime
+```
+
+| date | filename | mtime | filesize |
+| --- | --- | --- | --- |
+| Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago) | 2025-11-16.json | Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago) | 417.5 kB |
+| Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago) | 2025-11-17.json | Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago) | 419.3 kB |
+| Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago) | 2025-11-18.json | Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago) | 407.9 kB |
+| Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago) | 2025-11-19.json | Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago) | 416.5 kB |
+| Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago) | 2025-11-20.json | Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago) | 407.7 kB |
+
+We have converted all the necessary columns and could now work with them as needed.
+In the code snippet above you can also see that we can easily create multi-line commands in `nushell`,
+without any of the magic `\`-escaping of more traditional shells.
+I will make use of this for the longer commands to follow.
+
+## Weekly rhythm
+
+Looking at the available data, one question that instantly popped into my mind is,
+when do most people interact with the repository?
+
+Often, when it comes to download patterns, there are weekend dips or weekday peaks ---
+in other words, more people interacting during the week than on the weekends.
+But since I presume most people have Void Linux running as their personal distribution,
+I could see this pattern _not_ being in the dataset as well.
+
+Let's find out!
+By looking at the number of unique installations interacting with the repository,
+we start by creating a new column which keeps track of the `weekday` of the rows' `date` column.
+
+We can then group the results by the new `weekday` column, and aggregate the number of unique downloads by averaging them.
+
+```nu
+open input/popcorn/output/unique_installs.csv |
+  into datetime date |
+  insert weekday { |row| $row.date | format date "%u" } |
+  group-by --to-table weekday | update items { |row| $row.items.unique | math avg } |
+  sort -n
+```
+
+Lastly, we sort by _numeric_ key (`-n`) and have a table which shows us the average downloads per weekday:
+
+| weekday | items |
+| --- | --- |
+| 1 | 72.07|
+| 2 | 73.35 |
+| 3 | 72.97 |
+| 4 | 72.99 |
+| 5 | 72.55 |
+| 6 | 72.79|
+| 7 | 72.29 |
+
+Indeed, there is very little variation between the week days (Mon-Fri, 1-5) and the weekends (Sat-Sun, 6-7).
+In fact, the only day on which repository interactions rise a little seems to be Tuesday,
+which is surprising.
+
+Well, corroborate this with my own statistics!
+I use [`atuin`](https://atuin.sh/) to track my shell history,
+which can be queried with `atuin history list`.
+
+```nu
+atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
+  split row (char nul) |
+  split column " ||| " |
+  rename time duration directory host user exit command |
+  compact command |
+  where command starts-with "sudo xbps-install -Su" |
+  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
+  group-by --to-table weekday | update items { |row| $row.items | length } |
+  sort -n
+```
+
+This is quite a bit more advanced of a command, so let's quickly break it down a little.
+First, we format the output lines and use them in a null-separated output (to be able to parse multiline history entries).
+We can then use this output to split it along the 'nul' separators for finding rows,
+and along the custom-inserted formatting markers (`|||`) for finding columns.
+
+```nu
+atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
+  split row (char nul) |
+  split column " ||| " |
+```
+
+This gives us a basic table which we can now refine a little:
+First rename the columns according to the formatting we just used to give them the same names.
+Then we can use the `compact` command to filter out any rows which do _not_ have a `command` column entry
+(drop all null-values, essentially),
+as this would trip up our row filters later on.[^nullfilter]
+
+[^nullfilter]: And it _did_ trip me up, for quite a while when crafting the pipeline. Now I know the usefulness of the `compact` command, but just for the record: if you ever receive an `cannot find column 'command'` error even if that column should be there, that may mean it contains null values which still have to be filtered. In our case, those rows did not have the column entry since we parsed the 'empty' command earlier. Now I know.
+
+```nu
+atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
+  split row (char nul) |
+  split column " ||| " |
+  rename time duration directory host user exit command |
+  compact command
+```
+
+Now we have a table containing our complete command invocation history ---
+a very long table, for which displaying may be less mess with another `| first 100` filter or similar.
+
+To see when we updated our system, we can now filter this for any invocation of the `sudo xbps-install -Su` command,
+and we'll have all our personal system update commands in a table with their exact dates.
+Lastly, we just do the same grouping and aggregation method we already applied to the popcorn history above,
+to arrive back at the full pipeline:
+
+```nu
+atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
+  split row (char nul) |
+  split column " ||| " |
+  rename time duration directory host user exit command |
+  compact command |
+  where command starts-with "sudo xbps-install -Su" |
+  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
+  group-by --to-table weekday | update items { |row| $row.items | length } |
+  sort -n
+```
+
+By the way, this sort of re-use is exactly one of the positives I envision from my `nushell` usage ---
+it doesn't really matter whether the data comes from a `json` dataset, `csv` files, or a command output.
+As long as you can wrangle the data structure into vaguely similar shape,
+you can filter and aggregate using the same standardized commands.
+
+This pipeline leaves us with the following output:
+
+| weekday | items |
+| --- | --- |
+| 1 | 2 |
+| 2 | 19 |
+| 3 | 7 |
+| 4 | 8 |
+| 5 | 5 |
+| 6 | 4 |
+
+How interesting ---
+my personal update usage reflects the little peak we saw for the global dataset _exactly_ on Tuesday,
+only much more so.
+I am not sure why Tuesday seems to be my preferred update day throughout my usage of Void Linux.
+
+Another thing I could see from my personal history is that I am indeed a 'lazy updater',
+sometimes letting a month or more slip between running updates on my machine.[^lazyupdates]
+Curiously, I can also glean from the list above that I have indeed _never_ updated my system on a Sunday.
+
+[^lazyupdates]: It is one of the reasons why I switched from Arch Linux to Void Linux, in fact. While both provide rolling updates, I did not need the constant bleeding edge for all packages on my system, and running an update after a little while on Arch Linux always presented one with (literally) hundreds of package updates, often already after only a couple weeks.
+
+## Kernel longevity
+
+