Add top packages and conclusion

2025-11-20 17:57:41 +01:00 · 2025-11-20 17:57:41 +01:00 · 053dbc397d
commit 053dbc397d
parent 2f6a7c9af6
1 changed files with 102 additions and 0 deletions
--- a/index.md
+++ b/index.md
@ -415,3 +415,105 @@ Without skipping ahead too much, this makes sense to me looking at the wider pic
 as the `popcorn` statistics gathering was introduced in the middle of kernel 4's existence,
 and we are not yet anywhere near the end of the kernel 6 life-span,
 so version 5 probably had the most opportunity to have long-running installations.
+
+## Top packages
+
+Lastly, let's answer one more question:
+Which packages have the highest _median_ daily installation counts across the whole period?
+
+This will be a little more easy again ---
+we have all the necessary ingredients in the `packages.csv` file.
+And with the tools we used so far,
+it shouldn't be hard to create a pipeline which:
+groups on the `package` column,
+then aggregates the package `count` using the `median` method,
+and finally sorts by the result of this aggregation.
+
+```nu
+open input/popcorn/output/packages.csv | group-by --to-table package | update items { $in.count | math median } | sort-by items
+```
+
+Of course, we'll have to be a little more careful with our pipeline here while building it and _definitely_ resort to filtering like `| first 1000` or similar while building it,
+since constantly running over 17 million lines through the pipeline will be a little too much for the machine otherwise (at least, definitely my machine with 8GB of RAM).
+
+In fact, running this full command completely saturated my memory and made heavy use of my swap memory so it wouldn't have to crash due to running out.
+Of course, with so much swapping this also massively slowed down the process, so the above command took a little over 13 minutes to complete on my system.
+
+Here's the result of all that number crunching:
+
+| package | items |
+| --- | --- |
+| smartmontools | 25.0 |
+| psmisc | 25.0 |
+| base-system | 26.0 |
+| ntfs-3g | 26 |
+| void-repo-multilib | 27 |
+| xorg-minimal | 28.0 |
+| lvm2 | 29 |
+| unzip | 29 |
+| base-devel | 30 |
+| void-repo-nonfree | 31.0 |
+| neofetch | 33.0 |
+| lm_sensors | 35 |
+| zip | 42 |
+| xmirror | 42 |
+| socklog-void | 48.0 |
+
+So, what does that tell us?
+I think there's a few interesting observations to be made here.
+
+First, remember that we are looking at the _median_ number installations of packages over the _whole_ time period.
+So, even if a package was slow to get going with a few days of only having a single user,
+it shows up here.
+Similarly, however, if a package had one or multiple periods of intense use but is more erratic in its overall usage pattern,
+this will not be reflected here.
+
+Second, we are looking at the _number of installations_,
+so the daily report of who has this package installed on their system.
+The most-installed package here is `socklog-void`, which makes sense as the main suggested package in the [documentation](https://docs.voidlinux.org/config/services/logging.html).
+The high prevalence of `xmirror` is a little more surprising to me,
+though it is, once again, the [suggested method](https://docs.voidlinux.org/xbps/repositories/mirrors/changing.html?highlight=xmirror#xmirror) of changing your installation's repository mirrors.
+
+`zip` being ahead of both `base-system` and `base-devel` is somewhat amusing to me,
+as is the latter also being ahead of the former.
+
+But overall I think this distribution of packages makes sense, as they all describe long-lived utility programs which _any_ user of a distro may find useful (as opposed to more focused programs such as design software like `gimp` or text editors like `neovim`).
+With one curious exception:
+`neofetch` is on the 5th spot of packages,
+which is a giant surprise to me.
+
+Personally, I don't use a `fetch`-like program,
+as I think it just adds clutter to the terminal.
+
+But I am truly surprised at the amount of people having it installed.
+I suppose it makes sense in the way of installing it once,
+for a show-case or to check your system at a glance,
+but then not uninstalling it since it is just so unobtrusive.
+Nevertheless, this surprises me greatly.
+
+## Conclusion
+
+This was a fun first excursion into the package statistics of Void Linux.
+As I said on the outset, I hope to have a more detailed article out at some point which looks at some of the changes over time a little more visually,
+but this was a lot of fun.
+
+And I think it also really shows the power ---
+and the limitations ---
+of `nushell`.
+I could quickly switch between a multitude of data sources,
+and my data cleaning and transformation tools remained the same.
+
+The mental model behind operations is also much more akin to more data-oriented workflows and tools like `Python Pandas`, or `SQL` or even `R`,
+which I think is a boon when first introducing the idea of using the shell to more data-oriented folks.
+
+However, I also stumbled onto the edges of what is possible with the shell on my machine.
+There may be approaches that make use of data streaming which I haven't discovered,
+but running transformations on the giant data for packages nearly brought my machine to its knees,
+and would still be much better accomplished with `Python Polars` for me currently.
+
+In conclusion, use `nushell` for the right purposes:
+the quick turn-around of exploring medium-sized datasets,
+or taking a first look into parts of large datasets,
+while always staying flexible and having the full power of an interactive shell at your fingertips.
+Once you wrap your head around the more functional approach to how data streams through your pipelines (and I'm still in the process of doing so),
+it just becomes plain _fun_ to explore all manner of datasets.