2025-11-20 15:43:22 +01:00

15 KiB

Raw Blame History

Nushell analysis

A little while ago I have been experimenting with the datalad program in order to create recreatable and traceable datasets to work with in data analysis.

One of my learning projects resulted in the creation of ds-voidlinux-popcorn, a dataset taking the statistics from voidlinux popcorn and transforming them into an easily parseable csv-based dataset.

I then used the data for further analysis, the results of which is a long-form article which will hopefully be published here soon.

But I also thought, until then, let's have some further learning fun while we're at it and do a little exploratory data analysis with nushell.

In a nutshell, nushell is a unix-like shell environment which transports structured data through its pipes, making some data exploration processes relatively painless which were much more involved more traditional shells like zsh or bash.¹

In a way, nushell combines the concise syntax and movement concepts of unix-y shells like bash with the internal data model concepts of powershell, while giving it all a little sheen and gloss.

While working with json data in a shell directly traditionally required calling out to external programs like jq or ijq, yq or zq, with nushell this is all handled internally, exposing the same data selection and filtering interface to you whether you work with json files, csv or tsv, yaml or toml, or an increasing number of additional formats handled through plugins. For any of these formats, simply open a file, or pipe data into a from command, and nushell creates a data structure it understands out of it --- usually a table.

At the same time, the language and underlying (functional) approach to nushell is radically different from traditional shells, so it takes some time to re-learn the syntax and mental models required for fast and flexible data analysis.

Which is where we circle back to the purpose of this article: let's take the data I prepared and run it through some nushell functions to explore and learn about both at the same time.

Note:\ I am by no means an authority on nushell or functional programming. In fact I cobbled many of the following examples together from other people's samples, the nushell book and the official cookbook. If I made a mistake anywhere or there are better ways to accomplish what I did don't hesitate to reach out and let me know!

Loading the data

Since the data exists in a datalad-prepared dataset as csv files, this makes it easy to integrate into my new analysis.

For my purposes, I'll simply create a new datalad (dataset) project. If you don't have the program installed, you can use a temporary version with the uv python package manager, by substituting the datalad ... command in any of the below code snippets with uvx datalad ....²

Using datalad is not strictly necessary to get at the actual data but it does make it easier. Basically the program is a wrapper around git-annex which takes care of some of the plumbing to let you concentrate on the data analysis itself. Any of the below operations can also be accomplished with git and git-annex commands, but I am not exploring those as datalad is not the focus of the article at hand.

Then we can simply clone the existing dataset as a sub-dataset into this new project, and download the whole output directory from it.

datalad create -c yoda analysis-popcorn-nushell
cd analysis-popcorn-nushell
mkdir input

datalad clone --dataset . https://git.martyoeh.me/datasci/ds-voidlinux-popcorn input/popcorn

The input/ directory creation and path cloning is not necessary, I just like to have any input data for the current project inside a directory named input/, and any output data (unsurprisingly) in a directory named output/. If you'd like you can clone the dataset just as well into the repository root instead.

Lastly, we have to actually grab the data (so far we only cloned pointers to the actual files) so we can work with it locally:

datalad get input/popcorn/output/*

This may take a moment, as the dataset is currently a couple hundred MB in size. Now we are ready to take a look with nushell:

open input/popcorn/output/files.csv | first 5

This should show you a table with 5 rows of the files contained in the 'raw' popcorn dataset, along with their filesize and last modification time.³

date	filename	mtime	filesize
2018-05-09	2018-05-09.json	1759244947.9668756	6258
2018-05-10	2018-05-10.json	1759244947.9668756	48567
2018-05-11	2018-05-11.json	1759244947.9678757	56027
2018-05-12	2018-05-12.json	1759244947.9678757	43233
2018-05-13	2018-05-13.json	1759244947.9678757	52045

Let's do a quick few example cleaning steps to dip our toes into the water before moving to more intricate grouping and aggregation steps.

The date column is already nicely cleaned to display a simple YYYY-mm-dd day format, but it is still only read as a string formatted column.

open input/popcorn/output/files.csv | first 5 | into datetime date

date	filename	mtime	filesize
Wed, 9 May 2018 00:00:00 +0200 (7 years ago)	2018-05-09.json	1759244947.9668756	6258
Thu, 10 May 2018 00:00:00 +0200 (7 years ago)	2018-05-10.json	1759244947.9668756	48567
Fri, 11 May 2018 00:00:00 +0200 (7 years ago)	2018-05-11.json	1759244947.9678757	56027
Sat, 12 May 2018 00:00:00 +0200 (7 years ago)	2018-05-12.json	1759244947.9678757	43233
Sun, 13 May 2018 00:00:00 +0200 (7 years ago)	2018-05-13.json	1759244947.9678757	52045

Your display might diverge a bit from mine here now, but we can see that nushell has parsed the string into a full datetime type which we can work with much more easily in later operations. Let's also fix the other columns, starting with filesize:

open input/popcorn/output/files.csv | last 5 | into datetime date | into filesize filesize

These are now also nicely recognized as type filesize and shown in a more human-readable format:

date	filename	mtime	filesize
Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago)	2025-11-16.json	1763636518.5792813	417.5 kB
Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago)	2025-11-17.json	1763636518.7392821	419.3 kB
Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-18.json	1763636518.8012826	407.9 kB
Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-19.json	1763636519.7272873	416.5 kB
Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago)	2025-11-20.json	1763636519.874288	407.7 kB

Lastly, let's fix the mtime column. This is a little more involved, as we have the data in a unix timestamp format with sub-second accuracy (everything after the dot). Since the values resemble a float type, it is automatically parsed by nushell as such. Generally useful, but in our case we'll have to do an intermittent conversion step since into datetime can only understand int and string types.

So, here's the final conversion command, taking care of all the relevant columns:

open input/popcorn/output/files.csv | last 5 |
  into datetime date |
  into filesize filesize |
  into string mtime |
  into datetime --format "%s%.f" mtime

date	filename	mtime	filesize
Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago)	2025-11-16.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	417.5 kB
Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago)	2025-11-17.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	419.3 kB
Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-18.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	407.9 kB
Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-19.json	Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago)	416.5 kB
Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago)	2025-11-20.json	Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago)	407.7 kB

We have converted all the necessary columns and could now work with them as needed. In the code snippet above you can also see that we can easily create multi-line commands in nushell, without any of the magic \-escaping of more traditional shells. I will make use of this for the longer commands to follow.

Weekly rhythm

Looking at the available data, one question that instantly popped into my mind is, when do most people interact with the repository?

Often, when it comes to download patterns, there are weekend dips or weekday peaks --- in other words, more people interacting during the week than on the weekends. But since I presume most people have Void Linux running as their personal distribution, I could see this pattern not being in the dataset as well.

Let's find out! By looking at the number of unique installations interacting with the repository, we start by creating a new column which keeps track of the weekday of the rows' date column.

We can then group the results by the new weekday column, and aggregate the number of unique downloads by averaging them.

open input/popcorn/output/unique_installs.csv |
  into datetime date |
  insert weekday { |row| $row.date | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items.unique | math avg } |
  sort -n

Lastly, we sort by numeric key (-n) and have a table which shows us the average downloads per weekday:

weekday	items
1	72.07
2	73.35
3	72.97
4	72.99
5	72.55
6	72.79
7	72.29

Indeed, there is very little variation between the week days (Mon-Fri, 1-5) and the weekends (Sat-Sun, 6-7). In fact, the only day on which repository interactions rise a little seems to be Tuesday, which is surprising.

Well, corroborate this with my own statistics! I use atuin to track my shell history, which can be queried with atuin history list.

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command |
  where command starts-with "sudo xbps-install -Su" |
  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items | length } |
  sort -n

This is quite a bit more advanced of a command, so let's quickly break it down a little. First, we format the output lines and use them in a null-separated output (to be able to parse multiline history entries). We can then use this output to split it along the 'nul' separators for finding rows, and along the custom-inserted formatting markers (|||) for finding columns.

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |

This gives us a basic table which we can now refine a little: First rename the columns according to the formatting we just used to give them the same names. Then we can use the compact command to filter out any rows which do not have a command column entry (drop all null-values, essentially), as this would trip up our row filters later on.⁴

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command

Now we have a table containing our complete command invocation history --- a very long table, for which displaying may be less mess with another | first 100 filter or similar.

To see when we updated our system, we can now filter this for any invocation of the sudo xbps-install -Su command, and we'll have all our personal system update commands in a table with their exact dates. Lastly, we just do the same grouping and aggregation method we already applied to the popcorn history above, to arrive back at the full pipeline:

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command |
  where command starts-with "sudo xbps-install -Su" |
  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items | length } |
  sort -n

By the way, this sort of re-use is exactly one of the positives I envision from my nushell usage --- it doesn't really matter whether the data comes from a json dataset, csv files, or a command output. As long as you can wrangle the data structure into vaguely similar shape, you can filter and aggregate using the same standardized commands.

This pipeline leaves us with the following output:

weekday	items
1	2
2	19
3	7
4	8
5	5
6	4

How interesting --- my personal update usage reflects the little peak we saw for the global dataset exactly on Tuesday, only much more so. I am not sure why Tuesday seems to be my preferred update day throughout my usage of Void Linux.

Another thing I could see from my personal history is that I am indeed a 'lazy updater', sometimes letting a month or more slip between running updates on my machine.⁵ Curiously, I can also glean from the list above that I have indeed never updated my system on a Sunday.

Kernel longevity

↩︎
↩︎
For your analysis of course it will makes more sense to work on the full file, I am just restricting it to the first five for easier display in this article. For the more involved data grouping below, I often switch to the full file as well. ↩︎
And it did trip me up, for quite a while when crafting the pipeline. Now I know the usefulness of the compact command, but just for the record: if you ever receive an cannot find column 'command' error even if that column should be there, that may mean it contains null values which still have to be filtered. In our case, those rows did not have the column entry since we parsed the 'empty' command earlier. Now I know. ↩︎
It is one of the reasons why I switched from Arch Linux to Void Linux, in fact. While both provide rolling updates, I did not need the constant bleeding edge for all packages on my system, and running an update after a little while on Arch Linux always presented one with (literally) hundreds of package updates, often already after only a couple weeks. ↩︎

15 KiB Raw Blame History

Nushell analysis

Loading the data

Weekly rhythm

Kernel longevity

15 KiB

Raw Blame History