680 lines
29 KiB
Text
680 lines
29 KiB
Text
---
|
|
title: "Voidlinux popcorn"
|
|
subtitle: "Analysis of voidlinux package and kernel statistics"
|
|
toc: true
|
|
---
|
|
|
|
This notebook analyses the daily package repository statistics files,
|
|
colloquially known as 'popcorn' files, that are generated by the Void Linux
|
|
tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and
|
|
uploaded by users who have opted in to share with it.
|
|
|
|
The tool collects a variety of statistics from the user which it anonymizes
|
|
(you only get a uuid and to ensure your stats are only uploaded once, and
|
|
that's it) and mixes with all other daily uploaded statistics on the server.
|
|
The result is an overview of the package and kernel ecosystem of the Void Linux
|
|
distribution collected from a multitude of unique installations.
|
|
|
|
Before we get into the pretty graphs I would like to point out that you too can
|
|
participate in this collection, and thereby help the distribution maintainers
|
|
know which packages are more relevant or, perhaps more important, less relevant
|
|
for end users.
|
|
|
|
To upload your own statistics anonymously, simply `xbps-install PopCorn` and
|
|
enable its service to run in the background with `ln -s /etc/sv/popcorn
|
|
/var/service`.
|
|
|
|
```{python}
|
|
# | echo: false
|
|
import os
|
|
from typing import Any, Awaitable, Mapping
|
|
|
|
import lets_plot as lp
|
|
import polars as pl
|
|
from lets_plot import LetsPlot
|
|
from marimo import Cell
|
|
|
|
|
|
def run_cell(cell: Cell) -> tuple[Any, Mapping[str, Any]]:
|
|
ret = cell.run()
|
|
if isinstance(ret, Awaitable):
|
|
raise NotImplementedError
|
|
else:
|
|
output, defs = ret
|
|
return (output, defs)
|
|
|
|
|
|
fig_width, fig_height = (
|
|
int(os.getenv("QUARTO_FIG_WIDTH") or 7),
|
|
int(os.getenv("QUARTO_FIG_HEIGHT") or 5),
|
|
)
|
|
|
|
|
|
def pplot(cell: Cell) -> Any:
|
|
outp, _ = run_cell(cell)
|
|
return (
|
|
outp
|
|
# + lp.flavor_darcula()
|
|
+ lp.ggsize(width=fig_width * 1000, height=fig_height * 1000)
|
|
)
|
|
|
|
|
|
LetsPlot.setup_html()
|
|
```
|
|
|
|
## Daily statistics file size
|
|
|
|
Before we take an in-depth look at the packages, let's quickly get an overview
|
|
of the stats files themselves. These are the raw `JSON` files that we are using
|
|
throughout the rest of the article, reporting installed packages and kernels,
|
|
and their respective versions.
|
|
|
|
A look at the overall file size for each of the daily statistics files over
|
|
time reveals not necessarily changes in the absolute use of packages (e.g.
|
|
`rsync` being installed more or less often). Whether it has been downloaded
|
|
once or 100 times, the file size does not change drastically. Instead it
|
|
increases much more drastically when both `rsync` and `rclone` are installed,
|
|
or a variety of different versions for one of the packages are installed.
|
|
Similarly for different versions of the kernel.
|
|
|
|
An increase in file size here mainly suggests an increase in the 'breadth' of
|
|
files on offer in the repository, whether that be a _wider variety_ of packages
|
|
or more different package _versions_ that people are interested in, and that
|
|
the community chooses to use.
|
|
|
|
So while the overall amount of packages and installations --- we'll get to those
|
|
--- gives a general estimate of the interest in the distribution, we see at a
|
|
glance a more 'user-choice'-aligned view of how many aisles the buffet offers,
|
|
and how many of those the users are eating from.
|
|
|
|
```{python}
|
|
# | echo: true
|
|
from notebooks.popcorn import plt_filesize
|
|
pplot(plt_filesize)
|
|
```
|
|
|
|
As we can see, the difference over time is massive. Especially early on,
|
|
between 2019 and the start of 2021, the amount of different packages and
|
|
package versions used grew rapidly, with the pace also picking up once again
|
|
starting 2023.
|
|
|
|
From a reported filesize of around 50KB in the first days before the end of
|
|
2019 we have easily tripled the filesize to over 150KB needed for the report
|
|
per day. Nowadays we have reached just about 400KB daily report size, over 8
|
|
times the size beginning 2018.
|
|
|
|
There are a few outlier days with a size of 0 KB on the server, which we had to
|
|
remove from the data. In all likelihood, those days were not reported correctly
|
|
or there was some kind of issue on the backend so the stats for those days are
|
|
lost.
|
|
|
|
<!-- TODO: is this still true? -->
|
|
|
|
We take a look at the missing days
|
|
among other things at the end of this article.
|
|
|
|
There are also a few days where the modification date of the file does not
|
|
correspond to the represented statistical date but those are kept. This rather
|
|
points to certain times when the files have been moved on the backend,
|
|
recreated externally, or otherwise externally modified but does not mean the
|
|
data contained are bad.
|
|
|
|
Let's not over-emphasize the results here: it is a very coarse view on the
|
|
packages and kernels underneath. Still, the shape of the curve above is a good
|
|
indication for what awaits us in other stats: 2019 to 2021 were massive years
|
|
of growth with some slow-down in 2021. From 2022 onwards the growth picks back
|
|
up again, if at a more mellow pace.
|
|
|
|
## Package statistics
|
|
|
|
Now that we have an idea of how the overall reported sizes in the distribution
|
|
have changed over time, let's focus on the actual package statistics.
|
|
|
|
The popcorn files contain two pieces of information we're interested in: the
|
|
number of installs per package (e.g. how many people have `rsync` installed)
|
|
and the number of unique installs (i.e. how many people provide their
|
|
statistics). We will look at both of these in turn.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_weekly_packages
|
|
pplot(plt_weekly_packages)
|
|
```
|
|
|
|
The number of packages installed overall strongly rises until early 2021, when
|
|
it stagnates a little before rising more slowly again afterwards. The pattern
|
|
strongly mirrors the curve we saw before for the daily filesize. There is a
|
|
curious dip visible in the data in early 2021 which seems to say fewer packages
|
|
were installed during most of 2022 compared to 2021.
|
|
|
|
The graph above traces the _absolute_ number of package installations for each
|
|
week during the data collection period. That means, a simple sum of the number
|
|
of all currently installed packages for each day.
|
|
|
|
So to figure out one possible reason for the dip, let's turn to the daily
|
|
unique uploads, in we can see a similar pattern, though even more strongly
|
|
pronounced.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_unique_installs
|
|
pplot(plt_unique_installs)
|
|
```
|
|
|
|
This graphs similarly shows the _absolute_ number, this time of unique Void
|
|
Linux installations counted for each day. Of course, these are only the
|
|
_reported_ installations (since we don't know about unreported), as can be seen
|
|
in the overall small number of between 100 and 120 installations.
|
|
|
|
Unique installations rise sharply until early 2020. Then they not just stagnate
|
|
but shrink for the next three years. It is only from early 2023 onwards when
|
|
the numbers recover and begin rising again slowly.
|
|
|
|
We also have one day on 05 July 2024 which has _significantly_ fewer unique
|
|
uploads (36 only) than all the other days around it. It reflects in the graph
|
|
as a single week dipping down in 2024, but would be look more egregious on a
|
|
daily accumulation. I have no clue if something happened to data collection or
|
|
everybody collectively decided to leave their PC offline just for that day, but
|
|
the numbers are back to normal the day after.[^independence-day]
|
|
|
|
[^independence-day]:
|
|
I suppose one interpretation would be people taking their
|
|
4th of July celebrations very seriously, and thus not being present in the
|
|
statistics for the day after. However, I am not sure if this would reflect so
|
|
strongly in data collection, and it additionally pre-supposes the data
|
|
collected predominantly stemming from the United States. Lastly, one would
|
|
suppose this having a similar effect every year if that was the case.
|
|
|
|
This curve also goes some way to explaining the dip in overall package
|
|
installations previously. When there are fewer people uploading their daily
|
|
statistics the absolute number of package installations will be somewhat
|
|
reduced as a result, unless for some reason the remaining people all of a
|
|
sudden start having many more packages installed.
|
|
|
|
So this could be one reason for the dip in reported package ownership. The
|
|
decrease in daily reports maps relatively cleanly onto the dip in absolute
|
|
packages, and makes sense from a conceptual standpoint: fewer reports mean
|
|
fewer overall reported packages.
|
|
|
|
Next, let's verify that hunch by actually looking at the installed packages
|
|
_per user_ for each day.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_pkg_relative
|
|
pplot(plt_pkg_relative)
|
|
```
|
|
|
|
Combining both previous stats to look at the installed packages at a more
|
|
individual level per user, I think we see our hunch confirmed. There is no
|
|
similarly strong dip for the relative package ownership as there was for the
|
|
absolute package numbers.
|
|
|
|
Indeed, with the exception of a small more rapid increase in individual package
|
|
ownership in 2019, we see a much more stable increase in per-user packages than
|
|
the absolute numbers and no similarly big slump over three years.
|
|
|
|
Instead we see different patterns of rises and dips. Both in the beginning of
|
|
2020, and the beginning of 2024, we can see first a strong rise and then an
|
|
equally strong fall in the average number of user-owned packages at once.
|
|
This could point to one of multiple options:
|
|
|
|
Perhaps users have been collectively trying out more new packages over the end
|
|
of year holidays or with the start of the new year. New year, new workflow
|
|
could presumably be something a few users decide to do, and this may be a
|
|
reflection of it. Or, equally, using the new year to learn new software. The
|
|
subsequent dip would then mean an end to the period of trying out new stuff, or
|
|
adopting the new packages and dropping the old ones.
|
|
|
|
At the same time, with the relatively limited absolute number of installations,
|
|
it is also quite likely that the representation is skewed by a single user or a
|
|
couple users having a much larger package ownership than everybody else. This
|
|
may signify new users checking out Void Linux and downloading a large variety
|
|
of packages in the process.
|
|
|
|
Perhaps a similar pattern is visible in the higher number of packages per user
|
|
in 2019. With even fewer unique daily reports (between 20 and 60 for the year),
|
|
single users' package count differences reflect much more drastically on this
|
|
graph. So, one possibility for the rapid decrease followed by a more linear
|
|
increase is the 'balancing' of package ownerships across the (wider) reported
|
|
community.
|
|
|
|
Want to know how many packages you currently have installed? Find out with a
|
|
quick `xbps-query -m | wc -l` to count all your explicitly installed packages.
|
|
I currently have 234, so I'm below average for this cohort (indeed, I am much
|
|
more of an average 2021 Void Linux kid, it appears).
|
|
|
|
<!-- TODO: still accurate? -->
|
|
For an additional breakdown of the absolute numbers of packages on systems by
|
|
weekday and month of the year instead of over time, have a look at the Appendix
|
|
below.
|
|
|
|
An interesting trend is visible toward the end of the timeline window, with a
|
|
rapid decline in package numbers per user starting in early 2025. It is too
|
|
early for to clearly see if this is just variability or an actual trend in the
|
|
data, but it is very interesting to see.
|
|
|
|
Beyond pure installation numbers, let's also take a look at the actual
|
|
packages which take the top-installed spots on users' systems.
|
|
|
|
<!-- TODO: perhaps the pre-made ISOs play a role, especially Feb2024? no hang on feb 2025 -->
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_top_packages
|
|
pplot(plt_top_packages)
|
|
```
|
|
|
|
The top packages are unsurprisingly
|
|
the `base-system` and `xtools` packages, followed by `wget`, `htop` and
|
|
`rsync`.[^popcorn-removal]
|
|
|
|
[^popcorn-removal]:
|
|
I have removed the PopCorn package itself from the data.
|
|
Funnily enough, since _everybody_ who is represented in the data has to have
|
|
PopCorn installed or the data wouldn't be collected in the first place, if we
|
|
extrapolate from the collected data naively this means more people have PopCorn
|
|
installed than the base-system. Of course, viewed over the majority of Void
|
|
Linux installations this is hogwash. We have the absolute numbers and only
|
|
around 150 people ever have PopCorn installed. But it nicely represents some of
|
|
the danger of over-interpreting the results before us without also reflecting
|
|
on sample bias.
|
|
|
|
In my opinion the list of top packages reflect the technical audience of Void
|
|
Linux and does not hold too many surprises. Almost everyone uses `socklog` and
|
|
most people have the `nonfree` repo enabled. `firefox` is the most installed
|
|
browser, and everyone at least has `alsa-utils` installed, even if they're not
|
|
using `alsa` as their primary sound provider.
|
|
|
|
I am somewhat surprised by the prevalence of `git`, though this package is in
|
|
turn required by many others. Among them some of the other top packages such as
|
|
`xtools`, so it does make sense.
|
|
|
|
It, along with the prevalence of `tmux` (even above `zip`!) does once again
|
|
speak to the technical nature of Void Linux users, however, at least for those
|
|
opting into data collection.
|
|
|
|
Almost everyone keeps the `base-system` package installed but, importantly,
|
|
not _everyone_. The package is not represented for each installation, with a
|
|
sizable chunk of people having removed it.
|
|
|
|
Lastly, I am also pleasantly surprised by the appearance of `gimp` in the top
|
|
20 packages.
|
|
|
|
The 'rarest' 20 packages shows a snapshot of packages which have been installed
|
|
by _someone_ but only that single someone. In other words, there are quite a
|
|
few packages which nobody in the sample has installed but those are not
|
|
represented here. Instead, the rare packages tend to show those that somebody
|
|
built themselves, or only tried briefly. They provide more of a snapshot of the
|
|
kind of custom shenanigans users get up to within the `xbps` package system, or
|
|
could be viewed as a potential 'wishlist' of packages not yet officially
|
|
available.
|
|
|
|
Let's turn to the 'distribution' of package installations.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_package_distribution
|
|
pplot(plt_package_distribution)
|
|
```
|
|
|
|
Visualized above is the package installation frequency (or density) distribution.
|
|
On the Y-axis we see the amount of packages while on the X-axis we see the amount of installations.
|
|
What this means is that we see _how often_ packages tend to be installed,
|
|
and where the majority of packages is grouped.[^density-approximation]
|
|
|
|
[^density-approximation]:
|
|
In the package density count above, since we are
|
|
accumulating over the absolute numbers of all installations of all users, the
|
|
overall high numbers are really _high_, i.e. above 150,000. Since we are
|
|
sorting the package counts into a finite number of bins to make visualizing it
|
|
possible, the lowest bin overshoots the 0-mark and we get an estimation of
|
|
minus-installation counts. Of course, this is not possible, no package in the
|
|
data has been installed negative amount of times --- to my knowledge!
|
|
|
|
_Many_ packages are installed 0 to 10 times.
|
|
Some packages are installed above 10 times,
|
|
fewer yet above 100 times,
|
|
and so on,
|
|
and this distribution is what we see here.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_top_packages
|
|
_, defs = plt_top_packages.run() # pyright: ignore
|
|
df_pkg_dl = defs["df_pkg_dl"]
|
|
def get_num(df: pl.LazyFrame) -> int:
|
|
return df.count().collect(engine="streaming").item(0, 0)
|
|
|
|
one_ten_installs = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 1) & (pl.col("count") < 10)
|
|
)
|
|
ten_twenty_installs = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 10) & (pl.col("count") < 20)
|
|
)
|
|
twenty_thirty = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 20) & (pl.col("count") < 30)
|
|
)
|
|
thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30))
|
|
```
|
|
|
|
To be more precise with the numbers:
|
|
There are `{python} f"{get_num(one_ten_installs):,}"` packages which have between one
|
|
and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"`
|
|
packages between eleven and 20 installations, and
|
|
`{python} f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
|
|
`{python} f"{get_num(thirty_plus):,}"` packages have over 30 installations.
|
|
|
|
For now, these are the explorations I have done for the package data collected.
|
|
I think it is interesting to see, especially the evolution of package installations over time,
|
|
and per user,
|
|
as well as getting a glimpse of the most used packages in the sample.
|
|
|
|
But there are yet more things to explore in the statistics overall.
|
|
|
|
## Kernel Analysis
|
|
|
|
Beyond package numbers, the data also encapsulate information about the Linux
|
|
kernels used by Void Linux users.
|
|
The files report the exact kernel version users are running, including the major version,
|
|
minor versions, and any suffixes as well.
|
|
|
|
For example, there are many reports containing the `4.19.0-9-amd64` kernel, or
|
|
some containing the `6.1.53-1-lts` kernel, or `6.11.2-asahi-6.11.2-1_4`. These
|
|
are 'extraordinary' kernels in my opinion, and they do not follow clear naming
|
|
patterns. For the purposes of the following visualizations any such suffixes
|
|
have been cut off, looking only at the versioning of the main kernels
|
|
themselves.
|
|
|
|
Let's start by looking at the prevalence of the different major versions.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_kernel_versions
|
|
pplot(plt_kernel_versions)
|
|
```
|
|
|
|
This is an accumulation of the three major versions used during the collected timeline,
|
|
over the _whole_ time as absolute numbers.
|
|
|
|
When looking at the kernel versions used, we see a very strong jump between major kernel version
|
|
4 and major kernel version 5, with version 4 being significantly less prevalent in the data.
|
|
|
|
Of course, this makes sense from a release standpoint: kernel version 5.0 was
|
|
released in March 2019, just a single year after the start of data collection.[^kernel-releases]
|
|
Additionally, as we established above, this was also the time of the fewest
|
|
unique data reports, so the absolute amount of kernel 4 reports is even
|
|
smaller.
|
|
|
|
[^kernel-releases]:
|
|
Data collection began in May 2018.
|
|
All information on the kernel release timelines is taken
|
|
from the nicely comprehensive _Linux Kernel Version History_ Wikipedia page:
|
|
<https://en.wikipedia.org/wiki/Linux_kernel_version_history>.
|
|
|
|
Kernel version 5 still provides the dominant amount of reported kernel versions,
|
|
but just barely. This makes sense since major version 6.0 was released in October 2022.
|
|
It has thus been just over three years of version 5 being the latest kernel,
|
|
and almost exactly three years of version 6 being the latest kernel.
|
|
|
|
Again, we have to keep the curve of unique installations in mind for absolute numbers like these:
|
|
Kernel 5 was released right as the massive increase in unique Void Linux installation reports happened,
|
|
and kernel 6 right after the report slump happened.
|
|
This, in all likelihood, accounts for the slight imbalance between the numbers,
|
|
and will shift over the coming months.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import df_kernel_v99
|
|
_, defs = df_kernel_v99.run() # pyright: ignore
|
|
kernel_df_v99 = defs["kernel_df_v99"]
|
|
v99_num_rows = f"{kernel_df_v99.select(pl.len()).item()}"
|
|
v99_start_date = f"{kernel_df_v99.select("date").row(0)[0]}"
|
|
v99_end_date = f"{kernel_df_v99.select("date").row(-1)[0]}"
|
|
```
|
|
|
|
Just like with kernel suffixes, for this analysis we also had to exclude
|
|
`{python} v99_num_rows` rows which were apparently from the
|
|
future --- as they were running variations of major kernel version 99. In all
|
|
likelihood there is a custom compiled kernel version out there which reports its own
|
|
major version as 99. The strange version starts appearing on
|
|
`{python} v99_start_date` and shows up all the way until
|
|
`{python} v99_end_date`.
|
|
|
|
Let's turn to the actual adoption of kernels over time in the next visualization.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_kernel_timeline
|
|
pplot(plt_kernel_timeline)
|
|
```
|
|
|
|
```{python}
|
|
from datetime import date
|
|
from notebooks.popcorn import plt_kernel_timeline
|
|
_, defs = plt_kernel_timeline.run() # pyright: ignore
|
|
weekly_kernel_df = defs["weekly_kernel_df"]
|
|
|
|
last_kernel4: date = weekly_kernel_df.filter(pl.col("major_ver") == "4")[-1][
|
|
"date"
|
|
].item()
|
|
first_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[0][
|
|
"date"
|
|
].item()
|
|
last_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[-1][
|
|
"date"
|
|
].item()
|
|
```
|
|
|
|
A timeline analysis of the prevalent kernels in the data shows that new major
|
|
kernel version are adopted relatively rapidly and with the majority of switches
|
|
occuring at roughly the same time.
|
|
|
|
This change is especially stark between major kernel versions 5 and 6, which
|
|
seem to have traded place in usage almost over night. A reasonable speculation
|
|
for this rapid switch is that the `linux` kernel meta-package was pointed at
|
|
the new version at that time, so each update pulled the new kernel.
|
|
|
|
The first time that major version 5 of the kernel shows up is on
|
|
{first_kernel5}. From here, it took a long time for the last of the version 4
|
|
kernels to disappear. Interestingly, this roughly coincides with the big switch
|
|
between major version 5 and 6. The last time a major version 4 is seen is on
|
|
{last_kernel4}, while the last major version 5 kernels still pop up. It would
|
|
seem, then, that the people still running kernel version 4 used the opportunity
|
|
of everybody switching to the stable version of 6 to also upgrade their
|
|
machines.
|
|
|
|
If we cautiously extrapolate a little from the data we have, it would seem
|
|
reasonable that the last remnants of kernel version 5 may be disappearing
|
|
around May or June 2026. A lot of course depends on the upstream kernel release
|
|
windows and the stability of the releases themselves. But barring any major
|
|
upheavals in the kernel releases (of a magnitude like the removal of
|
|
[bcachefs](https://en.wikipedia.org/wiki/Bcachefs)) or major stability issues,
|
|
this seems a reasonable assumption to me.
|
|
|
|
## Conclusion
|
|
|
|
That's it for the main look at the packages and kernel versions in use in the
|
|
Void Linux community, currently and in the past.
|
|
|
|
There are of course more observations to be made.
|
|
One that still interests me is the development of the dominant packages over time ---
|
|
were the top packages relatively static or did they evolve from others?
|
|
|
|
<!-- Another interesting analysis that comes to mind is the... [TODO:] -->
|
|
|
|
## Appendix: Odds and Ends
|
|
|
|
The above graphics are the main ones that I think could be useful, entertaining, or somewhere in between.
|
|
However, when exploring data, many more visualizations come to light.
|
|
Most of them are a little more 'boring' than the ones selected above,
|
|
but may still be of interest for technical deep-dives or more specific investigations.
|
|
They are collected here, in my pseudo-appendix to the main article.
|
|
|
|
### The PopCorn files
|
|
|
|
Let's have a closer look at the provided PopCorn statistics files themselves.
|
|
|
|
The files consist of a long list of packages which have been reported to the
|
|
central server that day, along with the number of package instances. The amount
|
|
of unique installations from which these statistics are derived are given as a
|
|
sum. It also consists of the same list once again, but separated by
|
|
specifically installed versions of packages.
|
|
|
|
So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the
|
|
number of both packages with their versions separately. Another count is the
|
|
number of different Kernels that have been used on that day, with their exact
|
|
kernel name including major version, minor version and any suffix.
|
|
|
|
```json
|
|
{
|
|
"UniqueInstalls": 2,
|
|
"Packages": {
|
|
"ImageMagick": 1,
|
|
"PopCorn": 2,
|
|
"acpi": 1,
|
|
"alsa-utils": 2,
|
|
...
|
|
"xurls": 1,
|
|
"youtube-dl": 1
|
|
},
|
|
"Versions": {
|
|
"ImageMagick": {
|
|
"6.9.9.40_1": 1
|
|
},
|
|
"PopCorn": {
|
|
"0.2_1": 2
|
|
},
|
|
"acpi": {
|
|
"1.7_3": 1
|
|
},
|
|
"alsa-utils": {
|
|
"1.1.5_1": 1,
|
|
"1.1.6_1": 1
|
|
},
|
|
...
|
|
},
|
|
"XuKernel": {
|
|
"4.16.6_1": 1
|
|
}
|
|
```
|
|
|
|
When grouped by the packages and aggregated over all days, this results in a
|
|
table, for example the following is the table for the package count list:
|
|
|
|
```{python}
|
|
from notebooks.popcorn import tab_pkg
|
|
outp, _ = tab_pkg.run() # pyright: ignore
|
|
outp
|
|
```
|
|
|
|
When taking a look at the file sizes of the PopCorn report files we did so for
|
|
each day individually above. But we can also look at the accumulative growth
|
|
instead: here we just add up all the files reported so far for each day, and
|
|
show the resulting growth line.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_filesize_cumulative
|
|
outp, _ = plt_filesize_cumulative.run() # pyright: ignore
|
|
outp
|
|
```
|
|
|
|
A cumulative view gives a less granular look at the individual daily changes but
|
|
provides a more macro-level view on how big the statistics have grown to be overall.
|
|
We can see that, as each individually reported day adds up to 400KB nowadays, the
|
|
cumulative size is up to almost 700MB currently.
|
|
|
|
### Packages monthwise and per weekday
|
|
|
|
Let's also look at the packages installed on systems for different time slices.
|
|
We'll start with a look at the packages per weekday.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_weekday_packages
|
|
pplot(plt_weekday_packages)
|
|
```
|
|
|
|
There is no significant difference between the individual weekdays, as we would
|
|
expect. It seems strange to have a specific day on which everybody decides to
|
|
install or uninstall new packages.
|
|
|
|
That said, there is some slight variation, with Wednesdays generally having a
|
|
few fewer total packages to boast than other days, especially Tuesdays which
|
|
are slightly above the curve.
|
|
|
|
Let's just imagine everybody gets bored on Tuesday, installs a new package and
|
|
drops it again by Wednesday, along with a slew of other packages. Try-out
|
|
Tuesdays and Wastebin Wednesdays if you will.
|
|
|
|
Alright, but let's also take a look at the package numbers per month instead.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_month_packages
|
|
pplot(plt_month_packages)
|
|
```
|
|
|
|
Here we can see a bit more variation. First it is important to note that I have
|
|
removed the first months of 2018 prior to October from the analysis cut off any
|
|
days after September 2025, to have only full years represented and avoid any
|
|
months being present more often than others.[^months-removed]
|
|
|
|
[^months-removed]: I chose the first couple of months in the data, rather than
|
|
the most recent months as fewer people were collecting data, thus we have less
|
|
of a loss. Additionally, I presume people are more interested in current
|
|
statistics than older ones, just generally.
|
|
|
|
It is quite surprising to me just how much variation is visible in the results:
|
|
months from October to February have markedly fewer packages than the spring
|
|
and summer months. Are people generally more willing to use and try out new
|
|
packages in the summer? Alternatively, were any of the major usage dips taking
|
|
place during winter, while the increases in usage occured more toward summer?
|
|
|
|
I have not delved deep into the interpretation of these questions, but it may
|
|
be interesting to do so. The last option, of course, is that the data itself,
|
|
the data collection or analysis contains an error that I am not aware of.
|
|
|
|
### Missing days and dates
|
|
|
|
There are some missing days in the statistics.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import tab_missing_days
|
|
|
|
outp, _ = tab_missing_days.run() # pyright: ignore
|
|
outp
|
|
```
|
|
|
|
These missing days are primarily occuring at the end of January 2019, and
|
|
throughout 2025. However, with over 2600 days where the statistics _are_
|
|
available, these rows represent an insignificant issue for the overall data.
|
|
|
|
It would seem there was some kind of issue collecting or storing the collected
|
|
data at that point in 2019, which means a few days in a row are missing. This
|
|
skews absolute numbers for that week downwards, as well as any weekly averages
|
|
relying on this date-range.
|
|
|
|
However, no significant visual differences stem from this fact, which is why it
|
|
is not called out in the main article. As it is --- an interesting fact, and,
|
|
where this a more rigorous investigation, perhaps worthy of taking into account
|
|
as biasing the result, but for our purposes not too bad.
|
|
|
|
### The code and the data
|
|
|
|
All the data used in the previous sections originally comes from
|
|
<https://popcorn.voidlinux.org>. The collected data, in csv form, is available
|
|
from [this repository](https://git.martyoeh.me/datasci/ds-voidlinux-popcorn).
|
|
|
|
If you want to take a closer look at the functions creating the plots and
|
|
tables above, they are all available in [this repository]() in the
|
|
`notebooks/popcorn.py` file. It is the first project I have mostly written
|
|
using [marimo](https://github.com/marimo-team/marimo) instead of jupyter, and I
|
|
have to say I really enjoy its workflow.
|
|
|
|
To figure out which function creates which plot, just look up the function name
|
|
in the relevant cell imported in the `popcorn.qmd` quarto file in the
|
|
repository root, and search for it in the marimo notebook file.
|
|
|
|
Feel free to use any files or parts of this analysis for your own purposes. All
|
|
my own content, including this analysis is released under
|
|
[CC BY-NC 4.0](http://creativecommons.org/licenses/by-nc/4.0/).
|
|
|
|
If you have any ideas of further analysis, don't hesitate to let me know.
|
|
If you spot any errors or there are other issues, of course also let me know.
|
|
|
|
I assume most people reading this will be very familiar with and using it
|
|
already, but if not and any of this piques your interest, feel free to also
|
|
take [Void Linux](https://voidlinux.org/) for a spin. Don't forget to install
|
|
and enable
|
|
[PopCorn](https://github.com/void-linux/void-packages/tree/master/srcpkgs/PopCorn)
|
|
so that you too can contribute to the future statistics above ;-)
|