analysis-voidlinux-popcorn/popcorn.qmd

---
title: "Voidlinux popcorn"
subtitle: "Analysis of voidlinux package and kernel statistics"
toc: true
---

This notebook analyses the daily package repository statistics files,
colloquially known as 'popcorn' files, that are generated by the Void Linux
tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and
uploaded by users who have opted in to share with it.

The tool collects a variety of statistics from the user which it anonymizes
(you only get a uuid and to ensure your stats are only uploaded once, and
that's it) and mixes with all other daily uploaded statistics on the server.
The result is an overview of the package and kernel ecosystem of the Void Linux
distribution collected from a multitude of unique installations.

Before we get into the pretty graphs I would like to point out that you too can
participate in this collection, and thereby help the distribution maintainers
know which packages are more relevant or, perhaps more important, less relevant
for end users.

To upload your own statistics anonymously, simply `xbps-install PopCorn` and
enable its service to run in the background with `ln -s /etc/sv/popcorn
/var/service`.

```{python}
# | echo: false
import os
from typing import Any, Awaitable, Mapping

import lets_plot as lp
import polars as pl
from lets_plot import LetsPlot
from marimo import Cell


def run_cell(cell: Cell) -> tuple[Any, Mapping[str, Any]]:
    ret = cell.run()
    if isinstance(ret, Awaitable):
        raise NotImplementedError
    else:
        output, defs = ret
    return (output, defs)


fig_width, fig_height = (
    int(os.getenv("QUARTO_FIG_WIDTH") or 7),
    int(os.getenv("QUARTO_FIG_HEIGHT") or 5),
)


def pplot(cell: Cell) -> Any:
    outp, _ = run_cell(cell)
    return (
        outp
        + lp.flavor_darcula()
        + lp.ggsize(width=fig_width * 1000, height=fig_height * 1000)
    )


LetsPlot.setup_html()
```

## Daily statistics file size

Before we take an in-depth look at the packages, let's quickly get an overview
of the stats files themselves. These are the raw `JSON` files that we are using
throughout the rest of the article, reporting installed packages and kernels,
and their respective versions.

A look at the overall file size for each of the daily statistics files over
time reveals not necessarily changes in the absolute use of packages (e.g.
`rsync` being installed more or less often). Whether it has been downloaded
once or 100 times, the file size does not change drastically. Instead it
increases much more drastically when both `rsync` and `rclone` are installed,
or a variety of different versions for one of the packages are installed.
Similarly for different versions of the kernel.

An increase in file size here mainly suggests an increase in the 'breadth' of
files on offer in the repository, whether that be a _wider variety_ of packages
or more different package _versions_ that people are interested in, and that
the community chooses to use.

So while the overall amount of packages and installations --- we'll get to those
--- gives a general estimate of the interest in the distribution, we see at a
glance a more 'user-choice'-aligned view of how many aisles the buffet offers,
and how many of those the users are eating from.

```{python}
# | echo: true
from notebooks.popcorn import plt_filesize
pplot(plt_filesize)
```

As we can see, the difference over time is massive. Especially early on,
between 2019 and the start of 2021, the amount of different packages and
package versions used grew rapidly, with the pace also picking up once again
starting 2023.

From a reported filesize of around 50KB in the first days before the end of
2019 we have easily tripled the filesize to over 150KB needed for the report
per day. Nowadays we have reached just about 400KB daily report size, over 8
times the size beginning 2018.

There are a few outlier days with a size of 0 KB on the server, which we had to
remove from the data. In all likelihood, those days were not reported correctly
or there was some kind of issue on the backend so the stats for those days are
lost.

<!-- TODO: is this still true? -->

We take a look at the missing days
among other things at the end of this article.

There are also a few days where the modification date of the file does not
correspond to the represented statistical date but those are kept. This rather
points to certain times when the files have been moved on the backend,
recreated externally, or otherwise externally modified but does not mean the
data contained are bad.

Let's not over-emphasize the results here: it is a very coarse view on the
packages and kernels underneath. Still, the shape of the curve above is a good
indication for what awaits us in other stats: 2019 to 2021 were massive years
of growth with some slow-down in 2021. From 2022 onwards the growth picks back
up again, if at a more mellow pace.

## Package statistics

Now that we have an idea of how the overall reported sizes in the distribution
have changed over time, let's focus on the actual package statistics.

The popcorn files contain two main pieces of information: the number of
installs per package (e.g. how many people have `rsync` installed) and the
number of unique installs (i.e. how many people provide their statistics). We
will look at both of these in turn.

```{python}
from notebooks.popcorn import plt_weekly_packages
pplot(plt_weekly_packages)
```

The number of packages overall strongly rises until early 2021,
when it stagnates a little before rising more slowly again afterwards.
The pattern strongly mirrors the curve we saw before for the daily filesize.

Turning to the daily unique uploads, we can see a similar pattern, though even
more strongly pronounced.

```{python}
from notebooks.popcorn import plt_unique_installs
pplot(plt_unique_installs)
```

Unique installations rise sharply until early 2020. Then they not just stagnate
but shrink for the next three years. It is only early 2023 when the numbers
recover and begin rising again slowly.

We also have one day on 05 July 2024 which has significantly fewer unique
uploads (36 only) than all the other days around it. I have no clue if
something happened to data collection or everybody collectively decided to
leave their PC offline just for that day, but the numbers are back to normal
the day after.[^independence-day]

[^independence-day]:
    I suppose one interpretation would be people taking their
    4th of July celebrations very seriously, and thus not being present in the
    statistics for the day after. However, I am not sure if this would reflect so
    strongly in data collection, and it additionally pre-supposes the data
    collected predominantly stemming from the United States. Lastly, one would
    suppose this having a similar effect every year if that was the case.

This curve also goes some way to explaining the dip in overall package
installations previously. When there are fewer people uploading their daily
statistics the absolute number of package installations will be somewhat
reduced as a result, unless for some reason the remaining people all of a
sudden start having many more packages installed.

Let's check that out next, by actually looking at the installed packages _per
user_ for each day.

```{python}
from notebooks.popcorn import plt_pkg_relative
pplot(plt_pkg_relative)
```

Combining both stats to look at the installed packages at a more individual
level per user, we see this confirmed. There is no similarly strong dip for the
relative package ownership as there was for the absolute package numbers.

Indeed, with the exception of a small more rapid increase in individual package
ownership in 2019, we see a much more stable increase in per-user packages than
the absolute numbers and no similarly big slump over three years.

Instead we see different patterns of rises and dips. Both in the beginning of
2020, and the beginning of 2024, we can see first a strong rise and then an
equally strong fall in the average number of user-owned packages at once.
This could point to one of multiple options:

Perhaps users have been collectively trying out more new packages over the end
of year holidays or with the start of the new year. New year, new workflow
could presumably be something a few users decide to do, and this may be a
reflection of it. Or, equally, using the new year to learn new software. The
subsequent dip would then mean an end to the period of trying out new stuff, or
adopting the new packages and dropping the old ones.

At the same time, with the relatively limited absolute number of installations,
it is also quite likely that the representation is skewed by a single user or a
couple users having a much larger package ownership than everybody else. This
may signify new users checking out Void Linux and downloading a large variety
of packages in the process.

<!-- TODO: still accurate? -->

For a breakdown of the absolute numbers of packages on systems by weekday and
month of the year instead of over time, see the Appendix below.

An interesting trend is visible toward the end of the timeline window, with a
rapid decline in package numbers per user. It is too early for to clearly see
if this is just variability or an actual trend in the data.

Beyond pure installation numbers, let's take a look at the actual top-installed
packages on users' systems.

<!-- TODO: perhaps the pre-made ISOs play a role, especially Feb2024? no hang on feb 2025 -->

```{python}
from notebooks.popcorn import plt_top_packages
pplot(plt_top_packages)
```

The top packages are unsurprisingly
the `base-system` and `xtools` packages, followed by `wget`, `htop` and
`rsync`.[^popcorn-removal]

[^popcorn-removal]:
    I have removed the PopCorn package itself from the data.
    Funnily enough, since _everybody_ who is represented in the data has to have
    PopCorn installed or the data wouldn't be collected in the first place, if we
    extrapolate from the collected data naively this means more people have PopCorn
    installed than the base-system. Of course, viewed over the majority of Void
    Linux installations this is hogwash. We have the absolute numbers and only
    around 150 people ever have PopCorn installed. But it nicely represents some of
    the danger of over-interpreting the results before us without also reflecting
    on sample bias.

In my opinion the list of top packages reflect the technical audience of Void
Linux and does not hold too many surprises. Almost everyone uses `socklog` and
most people have the `nonfree` repo enabled. `firefox` is the most installed
browser, and everyone at least has `alsa-utils` installed, even if they're not
using `alsa` as their primary sound provider.

I am somewhat surprised by the prevalence of `git`, though this package is in
turn required by many others. Among them some of the other top packages such as
`xtools`, so it does make sense.

It, along with the prevalence of `tmux` (even above `zip`!) does once again
speak to the technical nature of Void Linux users, however, at least for those
opting into data collection.

Almost everyone keeps the `base-system` package installed but, importantly,
not _everyone_. The package is not represented for each installation, with a
sizable chunk of people having removed it.

Lastly, I am also pleasantly surprised by the appearance of `gimp` in the top
20 packages.

The 'rarest' 20 packages shows a snapshot of packages which have been installed
by _someone_ but only that single someone. In other words, there are quite a
few packages which nobody in the sample has installed but those are not
represented here. Instead, the rare packages tend to show those that somebody
built themselves, or only tried briefly. They provide more of a snapshot of the
kind of custom shenanigans users get up to within the `xbps` package system, or
could be viewed as a potential 'wishlist' of packages not yet officially
available.

Let's turn to the 'distribution' of package installations.

```{python}
from notebooks.popcorn import plt_package_distribution
pplot(plt_package_distribution)
```

Visualized above is the package installation frequency (or density) distribution.
On the Y-axis we see the amount of packages while on the X-axis we see the amount of installations.
What this means is that we see _how often_ packages tend to be installed,
and where the majority of packages is grouped.[^density-approximation]

[^density-approximation]:
    In the package density count above, since we are
    accumulating over the absolute numbers of all installations of all users, the
    overall high numbers are really _high_, i.e. above 150,000. Since we are
    sorting the package counts into a finite number of bins to make visualizing it
    possible, the lowest bin overshoots the 0-mark and we get an estimation of
    minus-installation counts. Of course, this is not possible, no package in the
    data has been installed negative amount of times --- to my knowledge!

_Many_ packages are installed 0 to 10 times.
Some packages are installed above 10 times,
fewer yet above 100 times,
and so on,
and this distribution is what we see here.

```{python}
from notebooks.popcorn import plt_top_packages
_, defs = plt_top_packages.run() # pyright: ignore
df_pkg_dl = defs["df_pkg_dl"]
def get_num(df: pl.LazyFrame) -> int:
    return df.count().collect(engine="streaming").item(0, 0)

one_ten_installs = df_pkg_dl.sort("count", descending=False).filter(
    (pl.col("count") >= 1) & (pl.col("count") < 10)
)
ten_twenty_installs = df_pkg_dl.sort("count", descending=False).filter(
    (pl.col("count") >= 10) & (pl.col("count") < 20)
)
twenty_thirty = df_pkg_dl.sort("count", descending=False).filter(
    (pl.col("count") >= 20) & (pl.col("count") < 30)
)
thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30))
```

To be more precise with the numbers:
There are `{python} f"{get_num(one_ten_installs):,}"` packages which have between one
and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"`
packages between eleven and 20 installations, and
`{python} f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
`{python} f"{get_num(thirty_plus):,}"` packages have over 30 installations.

For now, these are the explorations I have done for the package data collected.
I think it is interesting to see, especially the evolution of package installations over time,
and per user,
as well as getting a glimpse of the most used packages in the sample.

But there are yet more things to explore in the statistics overall.

## Kernel Analysis

Beyond package numbers, the data also encapsulate information about the Linux
kernels used by Void Linux users.
The files report the exact kernel version users are running, including the major version,
minor versions, and any suffixes as well.

For example, there are many reports containing the `4.19.0-9-amd64` kernel, or
some containing the `6.1.53-1-lts` kernel, or `6.11.2-asahi-6.11.2-1_4`. These
are 'extraordinary' kernels in my opinion, and they do not follow clear naming
patterns. For the purposes of the following visualizations any such suffixes
have been cut off, looking only at the versioning of the main kernels
themselves.

Let's start by looking at the prevalence of the different major versions.

```{python}
from notebooks.popcorn import plt_kernel_versions
pplot(plt_kernel_versions)
```

This is an accumulation of the three major versions used during the collected timeline,
over the _whole_ time as absolute numbers.

When looking at the kernel versions used, we see a very strong jump between major kernel version
4 and major kernel version 5, with version 4 being significantly less prevalent in the data.

Of course, this makes sense from a release standpoint: kernel version 5.0 was
released in March 2019, just a single year after the start of data collection.[^kernel-releases]
Additionally, as we established above, this was also the time of the fewest
unique data reports, so the absolute amount of kernel 4 reports is even
smaller.

[^kernel-releases]:
    Data collection began in May 2018.
    All information on the kernel release timelines is taken
    from the nicely comprehensive _Linux Kernel Version History_ Wikipedia page:
    <https://en.wikipedia.org/wiki/Linux_kernel_version_history>.

Kernel version 5 still provides the dominant amount of reported kernel versions,
but just barely. This makes sense since major version 6.0 was released in October 2022.
It has thus been just over three years of version 5 being the latest kernel,
and almost exactly three years of version 6 being the latest kernel.

Again, we have to keep the curve of unique installations in mind for absolute numbers like these:
Kernel 5 was released right as the massive increase in unique Void Linux installation reports happened,
and kernel 6 right after the report slump happened.
This, in all likelihood, accounts for the slight imbalance between the numbers,
and will shift over the coming months.

```{python}
from notebooks.popcorn import df_kernel_v99
_, defs = df_kernel_v99.run() # pyright: ignore
kernel_df_v99 = defs["kernel_df_v99"]
v99_num_rows = f"{kernel_df_v99.select(pl.len()).item()}"
v99_start_date = f"{kernel_df_v99.select("date").row(0)[0]}"
v99_end_date = f"{kernel_df_v99.select("date").row(-1)[0]}"
```

Just like with kernel suffixes, for this analysis we also had to exclude
`{python} v99_num_rows` rows which were apparently from the
future --- as they were running variations of major kernel version 99. In all
likelihood there is a custom compiled kernel version out there which reports its own
major version as 99. The strange version starts appearing on
`{python} v99_start_date` and shows up all the way until
`{python} v99_end_date`.

Let's turn to the actual adoption of kernels over time in the next visualization.

```{python}
from notebooks.popcorn import plt_kernel_timeline
pplot(plt_kernel_timeline)
```

```{python}
from datetime import date
from notebooks.popcorn import plt_kernel_timeline
_, defs = plt_kernel_timeline.run() # pyright: ignore
weekly_kernel_df = defs["weekly_kernel_df"]

last_kernel4: date = weekly_kernel_df.filter(pl.col("major_ver") == "4")[-1][
    "date"
].item()
first_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[0][
    "date"
].item()
last_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[-1][
    "date"
].item()
```

A timeline analysis of the prevalent kernels in the data shows that new major
kernel version are adopted relatively rapidly and with the majority of switches
occuring at roughly the same time.

This change is especially stark between major kernel versions 5 and 6, which
seem to have traded place in usage almost over night. A reasonable speculation
for this rapid switch is that the `linux` kernel meta-package was pointed at
the new version at that time, so each update pulled the new kernel.

The first time that major version 5 of the kernel shows up is on
{first_kernel5}. From here, it took a long time for the last of the version 4
kernels to disappear. Interestingly, this roughly coincides with the big switch
between major version 5 and 6. The last time a major version 4 is seen is on
{last_kernel4}, while the last major version 5 kernels still pop up. It would
seem, then, that the people still running kernel version 4 used the opportunity
of everybody switching to the stable version of 6 to also upgrade their
machines.

If we cautiously extrapolate a little from the data we have, it would seem
reasonable that the last remnants of kernel version 5 may be disappearing
around May or June 2026. A lot of course depends on the upstream kernel release
windows and the stability of the releases themselves. But barring any major
upheavals in the kernel releases (of a magnitude like the removal of
[bcachefs](https://en.wikipedia.org/wiki/Bcachefs)) or major stability issues,
this seems a reasonable assumption to me.

## Appendix: Odds and Ends

The above graphics are the main ones that I think could be useful, entertaining, or somewhere in between.
However, when exploring data, many more visualizations come to light.
Most of them are a little more 'boring' than the ones selected above,
but may still be of interest for technical deep-dives or more specific investigations.
They are collected here, in my pseudo-appendix to the main article.

### The PopCorn files

Let's have a closer look at the provided PopCorn statistics files themselves.

The files consist of a long list of packages which have been reported to the
central server that day, along with the number of package instances. The amount
of unique installations from which these statistics are derived are given as a
sum. It also consists of the same list once again, but separated by
specifically installed versions of packages.

So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the
number of both packages with their versions separately. Another count is the
number of different Kernels that have been used on that day, with their exact
kernel name including major version, minor version and any suffix.

```json
{
  "UniqueInstalls": 2,
  "Packages": {
    "ImageMagick": 1,
    "PopCorn": 2,
    "acpi": 1,
    "alsa-utils": 2,
  ...
    "xurls": 1,
    "youtube-dl": 1
  },
  "Versions": {
    "ImageMagick": {
      "6.9.9.40_1": 1
    },
    "PopCorn": {
      "0.2_1": 2
    },
    "acpi": {
      "1.7_3": 1
    },
    "alsa-utils": {
      "1.1.5_1": 1,
      "1.1.6_1": 1
    },
  ...
  },
  "XuKernel": {
    "4.16.6_1": 1
  }
```

When grouped by the packages and aggregated over all days, this results in a
table, for example the following is the table for the package count list:

```{python}
from notebooks.popcorn import tab_pkg
outp, _ = tab_pkg.run() # pyright: ignore
outp
```

When taking a look at the file sizes of the PopCorn report files we did so for
each day individually above. But we can also look at the accumulative growth
instead: here we just add up all the files reported so far for each day, and
show the resulting growth line.

```{python}
from notebooks.popcorn import plt_filesize_cumulative
outp, _ = plt_filesize_cumulative.run() # pyright: ignore
outp
```

A cumulative view gives a less granular look at the individual daily changes but
provides a more macro-level view on how big the statistics have grown to be overall.
We can see that, as each individually reported day adds up to 400KB nowadays, the
cumulative size is up to almost 700MB currently.

### Packages monthwise and per weekday

Let's also look at the packages installed on systems for different time slices.
We'll start with a look at the packages per weekday.

```{python}
from notebooks.popcorn import plt_weekday_packages
pplot(plt_weekday_packages)
```

There is no significant difference between the individual weekdays, as we would
expect. It seems strange to have a specific day on which everybody decides to
install or uninstall new packages.

That said, there is some slight variation, with Wednesdays generally having a
few fewer total packages to boast than other days, especially Tuesdays which
are slightly above the curve.

Let's just imagine everybody gets bored on Tuesday, installs a new package and
drops it again by Wednesday, along with a slew of other packages. Try-out
Tuesdays and Wastebin Wednesdays if you will.

Alright, but let's also take a look at the package numbers per month instead.

```{python}
from notebooks.popcorn import plt_month_packages
pplot(plt_month_packages)
```

Here we can see a bit more variation. First it is important to note that I have
removed the first months of 2018 prior to October from the analysis cut off any
days after September 2025, to have only full years represented and avoid any
months being present more often than others.[^months-removed]

[^months-removed]: I chose the first couple of months in the data, rather than
  the most recent months as fewer people were collecting data, thus we have less
  of a loss. Additionally, I presume people are more interested in current
  statistics than older ones, just generally.

It is quite surprising to me just how much variation is visible in the results:
months from October to February have markedly fewer packages than the spring
and summer months. Are people generally more willing to use and try out new
packages in the summer? Alternatively, were any of the major usage dips taking
place during winter, while the increases in usage occured more toward summer?

I have not delved deep into the interpretation of these questions, but it may
be interesting to do so. The last option, of course, is that the data itself,
the data collection or analysis contains an error that I am not aware of.

### Missing days and dates

There are some missing days in the statistics.

```{python}
from notebooks.popcorn import tab_missing_days

outp, _ = tab_missing_days.run() # pyright: ignore
outp
```

These missing days are primarily occuring at the end of January 2019, and
throughout 2025. However, with over 2600 days where the statistics _are_
available, these rows represent an insignificant issue for the overall data.

It would seem there was some kind of issue collecting or storing the collected
data at that point in 2019, which means a few days in a row are missing. This
skews absolute numbers for that week downwards, as well as any weekly averages
relying on this date-range.

However, no significant visual differences stem from this fact, which is why it
is not called out in the main article. As it is --- an interesting fact, and,
where this a more rigorous investigation, perhaps worthy of taking into account
as biasing the result, but for our purposes not too bad.

## Outline

- intro
- filesize
- unique installations reported from
- packages -> perhaps find new subcategories
  - global
  - relative (pkg/unique)
  - top packages
  - rare packages?
  - install distribution
- packages per time unit (find clever title, e.g. 'accumulated packages')
  - per year?
  - weekday
  - month of year (combine with weekday?)
- kernels
  - overall kernel version installations
  - kernels over time

- misc
  - missing days
  - moved days

- things we can't see (limitations)
  - packages on offer in the repositories
    - this could shed light on the bumps of users and relative package ownership