Add top packages and conclusion

This commit is contained in:
Marty Oehme 2025-11-20 17:57:41 +01:00
parent 2f6a7c9af6
commit 053dbc397d
Signed by: Marty
GPG key ID: 4E535BC19C61886E

102
index.md
View file

@ -415,3 +415,105 @@ Without skipping ahead too much, this makes sense to me looking at the wider pic
as the `popcorn` statistics gathering was introduced in the middle of kernel 4's existence,
and we are not yet anywhere near the end of the kernel 6 life-span,
so version 5 probably had the most opportunity to have long-running installations.
## Top packages
Lastly, let's answer one more question:
Which packages have the highest _median_ daily installation counts across the whole period?
This will be a little more easy again ---
we have all the necessary ingredients in the `packages.csv` file.
And with the tools we used so far,
it shouldn't be hard to create a pipeline which:
groups on the `package` column,
then aggregates the package `count` using the `median` method,
and finally sorts by the result of this aggregation.
```nu
open input/popcorn/output/packages.csv | group-by --to-table package | update items { $in.count | math median } | sort-by items
```
Of course, we'll have to be a little more careful with our pipeline here while building it and _definitely_ resort to filtering like `| first 1000` or similar while building it,
since constantly running over 17 million lines through the pipeline will be a little too much for the machine otherwise (at least, definitely my machine with 8GB of RAM).
In fact, running this full command completely saturated my memory and made heavy use of my swap memory so it wouldn't have to crash due to running out.
Of course, with so much swapping this also massively slowed down the process, so the above command took a little over 13 minutes to complete on my system.
Here's the result of all that number crunching:
| package | items |
| --- | --- |
| smartmontools | 25.0 |
| psmisc | 25.0 |
| base-system | 26.0 |
| ntfs-3g | 26 |
| void-repo-multilib | 27 |
| xorg-minimal | 28.0 |
| lvm2 | 29 |
| unzip | 29 |
| base-devel | 30 |
| void-repo-nonfree | 31.0 |
| neofetch | 33.0 |
| lm_sensors | 35 |
| zip | 42 |
| xmirror | 42 |
| socklog-void | 48.0 |
So, what does that tell us?
I think there's a few interesting observations to be made here.
First, remember that we are looking at the _median_ number installations of packages over the _whole_ time period.
So, even if a package was slow to get going with a few days of only having a single user,
it shows up here.
Similarly, however, if a package had one or multiple periods of intense use but is more erratic in its overall usage pattern,
this will not be reflected here.
Second, we are looking at the _number of installations_,
so the daily report of who has this package installed on their system.
The most-installed package here is `socklog-void`, which makes sense as the main suggested package in the [documentation](https://docs.voidlinux.org/config/services/logging.html).
The high prevalence of `xmirror` is a little more surprising to me,
though it is, once again, the [suggested method](https://docs.voidlinux.org/xbps/repositories/mirrors/changing.html?highlight=xmirror#xmirror) of changing your installation's repository mirrors.
`zip` being ahead of both `base-system` and `base-devel` is somewhat amusing to me,
as is the latter also being ahead of the former.
But overall I think this distribution of packages makes sense, as they all describe long-lived utility programs which _any_ user of a distro may find useful (as opposed to more focused programs such as design software like `gimp` or text editors like `neovim`).
With one curious exception:
`neofetch` is on the 5th spot of packages,
which is a giant surprise to me.
Personally, I don't use a `fetch`-like program,
as I think it just adds clutter to the terminal.
But I am truly surprised at the amount of people having it installed.
I suppose it makes sense in the way of installing it once,
for a show-case or to check your system at a glance,
but then not uninstalling it since it is just so unobtrusive.
Nevertheless, this surprises me greatly.
## Conclusion
This was a fun first excursion into the package statistics of Void Linux.
As I said on the outset, I hope to have a more detailed article out at some point which looks at some of the changes over time a little more visually,
but this was a lot of fun.
And I think it also really shows the power ---
and the limitations ---
of `nushell`.
I could quickly switch between a multitude of data sources,
and my data cleaning and transformation tools remained the same.
The mental model behind operations is also much more akin to more data-oriented workflows and tools like `Python Pandas`, or `SQL` or even `R`,
which I think is a boon when first introducing the idea of using the shell to more data-oriented folks.
However, I also stumbled onto the edges of what is possible with the shell on my machine.
There may be approaches that make use of data streaming which I haven't discovered,
but running transformations on the giant data for packages nearly brought my machine to its knees,
and would still be much better accomplished with `Python Polars` for me currently.
In conclusion, use `nushell` for the right purposes:
the quick turn-around of exploring medium-sized datasets,
or taking a first look into parts of large datasets,
while always staying flexible and having the full power of an interactive shell at your fingertips.
Once you wrap your head around the more functional approach to how data streams through your pipelines (and I'm still in the process of doing so),
it just becomes plain _fun_ to explore all manner of datasets.