2025-11-20 18:04:52 +01:00

25 KiB

Raw Blame History

Nushell analysis

A little while ago I have been experimenting with the datalad program in order to create recreatable and traceable datasets to work with in data analysis.

One of my learning projects resulted in the creation of ds-voidlinux-popcorn, a dataset taking the statistics from voidlinux popcorn and transforming them into an easily parseable csv-based dataset.

The data show daily statistics of the Void Linux repositories, with unique daily users, the Linux kernel versions present, and individual package installation break-downs. I used the data for some further analysis, the result of which is a long-form article which will hopefully be published here soon.

But I also thought --- until then --- let's have some further fun learning while we're at it, and do a little exploratory data analysis with nushell.

In a nutshell, nushell is a unix-like shell environment which transports structured data through its pipes, making some data exploration processes relatively painless which were much more involved more traditional shells like zsh or bash.¹

In a way, nushell combines the concise syntax and movement concepts of unix-y shells like bash with the internal data model concepts of powershell, while giving it all a little sheen and gloss.

While working with json data in a shell directly traditionally required calling out to external programs like jq or ijq, yq or zq, with nushell this is all handled internally, exposing the same data selection and filtering interface to you whether you work with json files, csv or tsv, yaml or toml, or an increasing number of additional formats handled through plugins. For any of these formats, simply open a file, or pipe data into a from command, and nushell creates a data structure it understands out of it --- usually a table.

At the same time, the language and underlying (functional) approach to nushell is radically different from traditional shells, so it takes some time to re-learn the syntax and mental models required for fast and flexible data analysis.

Which is where we circle back to the purpose of this article: let's take the data I prepared and run it through some nushell functions to explore and learn about both at the same time.

Note:\ I am by no means an authority on nushell or functional programming. In fact I cobbled many of the following examples together from other people's samples, the nushell book and the official cookbook. If I made a mistake anywhere or there are better ways to accomplish what I did don't hesitate to reach out and let me know!

Loading the data

Since the data exists in a datalad-prepared dataset as csv files, this makes it easy to integrate into my new analysis.

For my purposes, I'll simply create a new datalad (dataset) project. If you don't have the program installed, you can use a temporary version with the uv python package manager, by substituting the datalad ... command in any of the below code snippets with uvx datalad ....²

Using datalad is not strictly necessary to get at the actual data but it does make it easier. Basically the program is a wrapper around git-annex which takes care of some of the plumbing to let you concentrate on the data analysis itself. Any of the below operations can also be accomplished with git and git-annex commands, but I am not exploring those as datalad is not the focus of the article at hand.

Then we can simply clone the existing dataset as a sub-dataset into this new project, and download the whole output directory from it.

datalad create -c yoda analysis-popcorn-nushell
cd analysis-popcorn-nushell
mkdir input

datalad clone --dataset . https://git.martyoeh.me/datasci/ds-voidlinux-popcorn input/popcorn

The input/ directory creation and path cloning is not necessary, I just like to have any input data for the current project inside a directory named input/, and any output data (unsurprisingly) in a directory named output/. If you'd like you can clone the dataset just as well into the repository root instead.

Lastly, we have to actually grab the data (so far we only cloned pointers to the actual files) so we can work with it locally:

datalad get input/popcorn/output/*

This may take a moment, as the dataset is currently a couple hundred MB in size. Now we are ready to take a look with nushell:

open input/popcorn/output/files.csv | first 5

This should show you a table with 5 rows of the files contained in the 'raw' popcorn dataset, along with their filesize and last modification time.³

date	filename	mtime	filesize
2018-05-09	2018-05-09.json	1759244947.9668756	6258
2018-05-10	2018-05-10.json	1759244947.9668756	48567
2018-05-11	2018-05-11.json	1759244947.9678757	56027
2018-05-12	2018-05-12.json	1759244947.9678757	43233
2018-05-13	2018-05-13.json	1759244947.9678757	52045

Let's do a quick few example cleaning steps to dip our toes into the water before moving to more intricate grouping and aggregation steps.

The date column is already nicely cleaned to display a simple YYYY-mm-dd day format, but it is still only read as a string formatted column.

open input/popcorn/output/files.csv | first 5 | into datetime date

date	filename	mtime	filesize
Wed, 9 May 2018 00:00:00 +0200 (7 years ago)	2018-05-09.json	1759244947.9668756	6258
Thu, 10 May 2018 00:00:00 +0200 (7 years ago)	2018-05-10.json	1759244947.9668756	48567
Fri, 11 May 2018 00:00:00 +0200 (7 years ago)	2018-05-11.json	1759244947.9678757	56027
Sat, 12 May 2018 00:00:00 +0200 (7 years ago)	2018-05-12.json	1759244947.9678757	43233
Sun, 13 May 2018 00:00:00 +0200 (7 years ago)	2018-05-13.json	1759244947.9678757	52045

Your display might diverge a bit from mine here now, but we can see that nushell has parsed the string into a full datetime type which we can work with much more easily in later operations. Let's also fix the other columns, starting with filesize:

open input/popcorn/output/files.csv | last 5 | into datetime date | into filesize filesize

These are now also nicely recognized as type filesize and shown in a more human-readable format:

date	filename	mtime	filesize
Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago)	2025-11-16.json	1763636518.5792813	417.5 kB
Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago)	2025-11-17.json	1763636518.7392821	419.3 kB
Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-18.json	1763636518.8012826	407.9 kB
Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-19.json	1763636519.7272873	416.5 kB
Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago)	2025-11-20.json	1763636519.874288	407.7 kB

Lastly, let's fix the mtime column. This is a little more involved, as we have the data in a unix timestamp format with sub-second accuracy (everything after the dot). Since the values resemble a float type, it is automatically parsed by nushell as such. Generally useful, but in our case we'll have to do an intermittent conversion step since into datetime can only understand int and string types.

So, here's the final conversion command, taking care of all the relevant columns:

open input/popcorn/output/files.csv | last 5 |
  into datetime date |
  into filesize filesize |
  into string mtime |
  into datetime --format "%s%.f" mtime

date	filename	mtime	filesize
Sun, 16 Nov 2025 00:00:00 +0100 (4 days ago)	2025-11-16.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	417.5 kB
Mon, 17 Nov 2025 00:00:00 +0100 (3 days ago)	2025-11-17.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	419.3 kB
Tue, 18 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-18.json	Thu, 20 Nov 2025 11:01:58 +0000 (2 hours ago)	407.9 kB
Wed, 19 Nov 2025 00:00:00 +0100 (2 days ago)	2025-11-19.json	Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago)	416.5 kB
Thu, 20 Nov 2025 00:00:00 +0100 (14 hours ago)	2025-11-20.json	Thu, 20 Nov 2025 11:01:59 +0000 (2 hours ago)	407.7 kB

We have converted all the necessary columns and could now work with them as needed. In the code snippet above you can also see that we can easily create multi-line commands in nushell, without any of the magic \-escaping of more traditional shells. I will make use of this for the longer commands to follow.

Weekly rhythm

Looking at the available data, one question that instantly popped into my mind is, when do most people interact with the repository?

Often, when it comes to download patterns, there are weekend dips or weekday peaks --- in other words, more people interacting during the week than on the weekends. But since I presume most people have Void Linux running as their personal distribution, I could see this pattern not being in the dataset as well.

Let's find out! By looking at the number of unique installations interacting with the repository, we start by creating a new column which keeps track of the weekday of the rows' date column.

We can then group the results by the new weekday column, and aggregate the number of unique downloads by averaging them.

open input/popcorn/output/unique_installs.csv |
  into datetime date |
  insert weekday { |row| $row.date | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items.unique | math avg } |
  sort -n

Lastly, we sort by numeric key (-n) and have a table which shows us the average downloads per weekday:

weekday	items
1	72.07
2	73.35
3	72.97
4	72.99
5	72.55
6	72.79
7	72.29

Indeed, there is very little variation between the week days (Mon-Fri, 1-5) and the weekends (Sat-Sun, 6-7). In fact, the only day on which repository interactions rise a little seems to be Tuesday, which is surprising.

Well, let's corroborate this with my own statistics! I use atuin to track my shell history, which can be queried with atuin history list.

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command |
  where command starts-with "sudo xbps-install -Su" |
  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items | length } |
  sort -n

This is quite a bit more advanced of a command, so let's quickly break it down a little. First, we format the output lines and use them in a null-separated output (to be able to parse multiline history entries). We can then use this output to split it along the 'nul' separators for finding rows, and along the custom-inserted formatting markers (|||) for finding columns.

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |

This gives us a basic table which we can now refine a little: First rename the columns according to the formatting we just used to give them the same names. Then we can use the compact command to filter out any rows which do not have a command column entry (drop all null-values, essentially), as this would trip up our row filters later on.⁴

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command

Now we have a table containing our complete command invocation history --- a very long table, for which displaying may be less mess with another | first 100 filter or similar.

To see when we updated our system, we can now filter this for any invocation of the sudo xbps-install -Su command, and we'll have all our personal system update commands in a table with their exact dates. Lastly, we just do the same grouping and aggregation method we already applied to the popcorn history above, to arrive back at the full pipeline:

atuin history list --print0 --format "{time} ||| {duration} ||| {directory} ||| {host} ||| {user} ||| {exit} ||| {command}" |
  split row (char nul) |
  split column " ||| " |
  rename time duration directory host user exit command |
  compact command |
  where command starts-with "sudo xbps-install -Su" |
  into datetime time| insert weekday { |row| $row.time | format date "%u" } |
  group-by --to-table weekday | update items { |row| $row.items | length } |
  sort -n

By the way, this sort of re-use is exactly one of the positives I envision from my nushell usage --- it doesn't really matter whether the data comes from a json dataset, csv files, or a command output. As long as you can wrangle the data structure into vaguely similar shape, you can filter and aggregate using the same standardized commands.

This pipeline leaves us with the following output:

weekday	items
1	2
2	19
3	7
4	8
5	5
6	4

How interesting --- my personal update usage reflects the little peak we saw for the global dataset exactly on Tuesday, only much more so. I am not sure why Tuesday seems to be my preferred update day throughout my usage of Void Linux.

Another thing I could see from my personal history is that I am indeed a 'lazy updater', sometimes letting a month or more slip between running updates on my machine.⁵ Curiously, I can also glean from the list above that I have indeed never updated my system on a Sunday.

Kernel longevity

Another question that I find quite interesting is this: How long were the various kernel versions in use? Or, more precisely, which ones are the versions that have the longest 'life-spans' in the repository, or the shortest ones?

But first, let's investigate the overall download numbers per kernel.

For this we'll use the kernels.csv file, so let's take a look.

date	kernel	downloads
2025-11-20	6.17.7_1	6
2025-11-20	6.17.8-tkg-bore-alderlake_1	1
2025-11-20	6.17.8-tkg-bore-zen_1	1
2025-11-20	6.17.8_1	12
2025-11-20	6.6.111_1	1
2025-11-20	6.6.116_1	3
2025-11-20	6.6.65_1	1
2025-11-20	6.6.87.2-microsoft-standard-WSL2	1

This file is almost perfectly usable as-is, but I am only interested in the actual kernel versions, so the first three version dots (e.g. 6.17.7). I don't care about the void-internal release version (the _1), nor the weird custom-compiled kernels people are using (e.g. tkg-bore-alderlake_1). But since I also don't want to straight drop them from the data, we'll do a little regex string substitution:

mkdir outputs
open input/popcorn/output/kernels.csv |
  update kernel { str replace --regex '^(\d.\d+.\d+).*' "$1"} |
  group-by --to-table kernel |
  save outputs/kernels_standardized.json

Here we remove anything that is not part of the version string by essentially replacing the whole line with just the version itself. This process takes a while for the over 57.000 lines contained in the file, so I am saving an intermediate output version that I'll use for the next steps.

We'll start by summing up the absolute numbers of kernel uses per version, of which we can keep the top 5:

open output/kernels_standardized.json | update items { $in.downloads | math sum } | sort-by items | last 10

This show us that:

kernel	items
6.1.31	1340
5.8.18	1674
6.12.41	1744
5.13.19	2500
6.3.13	2624

The kernel that was run the most in terms of absolute numbers was kernel version 6.3.13, with 5.13.19 coming up relatively closely behind. The other kernels are trailing somewhat further behind with the next kernel having almost 1.000 fewer uses.

But I originally wanted to know about the longest lived kernel in these data, so how do we extract that?

We'll take the grouped json file and do a similar aggregation as up above, except creating a new column for the first (math min) and last (math max) appearance of each kernel version. Then we can take those two and, since they are of type datetime, simply subtract one from the other to get the total duration that the respective kernel appeared in the data.

open output/kernels_standardized.json |
  insert first { $in.items.date | math min } |
  insert last { $in.items.date | math max } |
  reject items |
  into datetime first last |
  insert delta {$in.last - $in.first } |
  sort-by delta |
  last 10

By sorting on the delta value and keeping the last ones we have essentially filtered for the 'longest'-lived kernel versions, leaving us with the following:

kernel	first	last	delta
6.1.6	Mon, 16 Jan 2023 00:00:00 +0100 (2 years ago)	Sat, 5 Apr 2025 00:00:00 +0200 (7 months ago)	115wk 4day 23hr
4.19.59	Wed, 17 Jul 2019 00:00:00 +0200 (6 years ago)	Fri, 28 Jan 2022 00:00:00 +0100 (3 years ago)	132wk 2day 1hr
5.10.9	Fri, 22 Jan 2021 00:00:00 +0100 (4 years ago)	Tue, 12 Sep 2023 00:00:00 +0200 (2 years ago)	137wk 3day 23hr
5.19.14	Thu, 13 Oct 2022 00:00:00 +0200 (3 years ago)	Tue, 5 Aug 2025 00:00:00 +0200 (3 months ago)	146wk 5day
5.15.36	Fri, 29 Apr 2022 00:00:00 +0200 (3 years ago)	Mon, 3 Mar 2025 00:00:00 +0100 (8 months ago)	148wk 3day 1hr
5.13.8	Fri, 6 Aug 2021 00:00:00 +0200 (4 years ago)	Fri, 2 Aug 2024 00:00:00 +0200 (a year ago)	156wk
5.13.10	Sat, 14 Aug 2021 00:00:00 +0200 (4 years ago)	Sun, 15 Sep 2024 00:00:00 +0200 (a year ago)	161wk 1day
5.12.13	Sat, 26 Jun 2021 00:00:00 +0200 (4 years ago)	Mon, 23 Sep 2024 00:00:00 +0200 (a year ago)	169wk 2day
5.11.22	Fri, 21 May 2021 00:00:00 +0200 (4 years ago)	Sun, 22 Sep 2024 00:00:00 +0200 (a year ago)	174wk 2day
5.2.13	Sat, 7 Sep 2019 00:00:00 +0200 (6 years ago)	Sun, 7 Sep 2025 00:00:00 +0200 (2 months ago)	313wk 1day

We can see that especially kernel version 5 was long-lived, with version 5.2.13 being in use for just over 6 years. The exact nature of the time frame (September 7 to September 7) makes me think this may be some sort of automated installation.

Without skipping ahead too much, this makes sense to me looking at the wider picture, as the popcorn statistics gathering was introduced in the middle of kernel 4's existence, and we are not yet anywhere near the end of the kernel 6 life-span, so version 5 probably had the most opportunity to have long-running installations.

Top packages

Lastly, let's answer one more question: Which packages have the highest median daily installation counts across the whole period?

This will be a little more easy again --- we have all the necessary ingredients in the packages.csv file. And with the tools we used so far, it shouldn't be hard to create a pipeline which: groups on the package column, then aggregates the package count using the median method, and finally sorts by the result of this aggregation.

open input/popcorn/output/packages.csv | group-by --to-table package | update items { $in.count | math median } | sort-by items

Of course, we'll have to be a little more careful with our pipeline here while building it and definitely resort to filtering like | first 1000 or similar while building it, since constantly running over 17 million lines through the pipeline will be a little too much for the machine otherwise (at least, definitely my machine with 8GB of RAM).

In fact, running this full command completely saturated my memory and made heavy use of my swap memory so it wouldn't have to crash due to running out. Of course, with so much swapping this also massively slowed down the process, so the above command took a little over 13 minutes to complete on my system.

Here's the result of all that number crunching:

package	items
smartmontools	25.0
psmisc	25.0
base-system	26.0
ntfs-3g	26
void-repo-multilib	27
xorg-minimal	28.0
lvm2	29
unzip	29
base-devel	30
void-repo-nonfree	31.0
neofetch	33.0
lm_sensors	35
zip	42
xmirror	42
socklog-void	48.0

So, what does that tell us? I think there's a few interesting observations to be made here.

First, remember that we are looking at the median number installations of packages over the whole time period. So, even if a package was slow to get going with a few days of only having a single user, it shows up here. Similarly, however, if a package had one or multiple periods of intense use but is more erratic in its overall usage pattern, this will not be reflected here.

Second, we are looking at the number of installations, so the daily report of who has this package installed on their system. The most-installed package here is socklog-void, which makes sense as the main suggested package in the documentation. The high prevalence of xmirror is a little more surprising to me, though it is, once again, the suggested method of changing your installation's repository mirrors.

zip being ahead of both base-system and base-devel is somewhat amusing to me, as is the latter also being ahead of the former.

But overall I think this distribution of packages makes sense, as they all describe long-lived utility programs which any user of a distro may find useful (as opposed to more focused programs such as design software like gimp or text editors like neovim). With one curious exception: neofetch is on the 5th spot of packages, which is a giant surprise to me.

Personally, I don't use a fetch-like program, as I think it just adds clutter to the terminal.

But I am truly surprised at the amount of people having it installed. I suppose it makes sense in the way of installing it once, for a show-case or to check your system at a glance, but then not uninstalling it since it is just so unobtrusive. Nevertheless, this surprises me greatly.

Conclusion

This was a fun first excursion into the package statistics of Void Linux. As I said on the outset, I hope to have a more detailed article out at some point which looks at some of the changes over time a little more visually, but this was a lot of fun.

And I think it also really shows the power --- and the limitations --- of nushell. I could quickly switch between a multitude of data sources, and my data cleaning and transformation tools remained the same.

The mental model behind operations is also much more akin to more data-oriented workflows and tools like Python Pandas, or SQL or even R, which I think is a boon when first introducing the idea of using the shell to more data-oriented folks.

However, I also stumbled onto the edges of what is possible with the shell on my machine. There may be approaches that make use of data streaming which I haven't discovered, but running transformations on the giant data for packages nearly brought my machine to its knees, and would still be much better accomplished with Python Polars for me currently.

In conclusion, use nushell for the right purposes: the quick turn-around of exploring medium-sized datasets, or taking a first look into parts of large datasets, while always staying flexible and having the full power of an interactive shell at your fingertips. Once you wrap your head around the more functional approach to how data streams through your pipelines (and I'm still in the process of doing so), it just becomes plain fun to explore all manner of datasets.

↩︎
↩︎
For your analysis of course it will makes more sense to work on the full file, I am just restricting it to the first five for easier display in this article. For the more involved data grouping below, I often switch to the full file as well. ↩︎
And it did trip me up, for quite a while when crafting the pipeline. Now I know the usefulness of the compact command, but just for the record: if you ever receive an cannot find column 'command' error even if that column should be there, that may mean it contains null values which still have to be filtered. In our case, those rows did not have the column entry since we parsed the 'empty' command earlier. Now I know. ↩︎
It is one of the reasons why I switched from Arch Linux to Void Linux, in fact. While both provide rolling updates, I did not need the constant bleeding edge for all packages on my system, and running an update after a little while on Arch Linux always presented one with (literally) hundreds of package updates, often already after only a couple weeks. ↩︎

25 KiB Raw Blame History