analysis-voidlinux-popcorn/popcorn.qmd

338 lines
11 KiB
Text

---
title: "Voidlinux popcorn"
subtitle: "Analysis of voidlinux package and kernel statistics"
---
This notebook analyses the daily package repository statistics files,
colloquially known as 'popcorn' files, that are generated by the Void Linux
tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and
uploaded by users who have opted in to share with it.
The tool collects a variety of statistics from the user which it anonymizes
(you only get a uuid and to ensure your stats are only uploaded once, and
that's it) and mixes with all other daily uploaded statistics on the server.
The result is an overview of the package and kernel ecosystem of the Void Linux
distribution collected from a multitude of unique installations.
Before we get into the pretty graphs I would like to point out that you too can
participate in this collection, and thereby help the distribution maintainers
know which packages are more relevant or, perhaps more important, less relevant
for end users.
To upload your own statistics anonymously, simply `xbps-install PopCorn` and
enable its service to run in the background with `ln -s /etc/sv/popcorn
/var/service`.
```{python}
# | echo: false
import os
from typing import Any, Awaitable, Mapping
import lets_plot as lp
import polars as pl
from lets_plot import LetsPlot
from marimo import Cell
def run_cell(cell: Cell) -> tuple[Any, Mapping[str, Any]]:
ret = cell.run()
if isinstance(ret, Awaitable):
raise NotImplementedError
else:
output, defs = ret
return (output, defs)
fig_width, fig_height = (
int(os.getenv("QUARTO_FIG_WIDTH") or 7),
int(os.getenv("QUARTO_FIG_HEIGHT") or 5),
)
def pplot(cell: Cell) -> Any:
outp, _ = run_cell(cell)
return (
outp
+ lp.flavor_darcula()
+ lp.ggsize(width=fig_width * 1000, height=fig_height * 1000)
)
LetsPlot.setup_html()
```
## Daily statistics file size
Before we take an in-depth look at the packages, let's quickly get an overview
of the stats files themselves. These are the raw `JSON` files that we are using
throughout the rest of the article, reporting installed packages and kernels,
and their respective versions.
A look at the overall file size for each of the daily statistics files over
time reveals not necessarily changes in the absolute use of packages (e.g.
`neovim` being installed more or less often). Whether it has been downloaded
once or 100 times, the file size does not change drastically. Instead it
increases much more drastically when both `neovim` and `emacs` are installed,
or a variety of different versions for one of the packages are installed.
Similarly for different versions of the kernel.
An increase in file size here mainly suggests an increase in the 'breadth' of
files on offer in the repository, whether that be a _wider variety_ of packages
or more different package _versions_ that people are interested in, and that
the community chooses to use.
So while the overall amount of packages and installations --- we'll get to those
--- gives a general estimate of the interest in the distribution, we see at a
glance a more 'user-choice'-aligned view of how many aisles the buffet offers,
and how many of those the users are eating from.
```{python}
# | echo: true
from notebooks.popcorn import plt_filesize
pplot(plt_filesize)
```
As we can see, the difference over time is massive. Especially early on, between 2019 and the
start of 2021, the amount of different packages and package versions used grew rapidly, with the
pace also picking up once again starting 2023.
From a reported filesize of around 50kB in the first days before the end of
2019 we have easily tripled the filesize to over 150kB needed for the report
per day. Nowadays we have reached just about 400kB daily report size, over 8
times the size beginning 2018.
There are a few outlier days with a size of 0 kB on the server, which we had to
remove from the data. In all likelihood, those days were not reported correctly
or there was some kind of issue on the backend so the stats for those days are
lost.
<!-- TODO: is this still true? -->
We take a look at the missing days
among other things at the end of this article.
There are also a few days where the modification date of the file does not
correspond to the represented statistical date but those are kept. This rather
points to certain times when the files have been moved on the backend,
recreated externally, or otherwise externally modified but does not mean the
data contained are bad.
Let's not over-emphasize the results here: it is a very coarse view on the
packages and kernels underneath. Still, the shape of the curve above is a good
indication for what awaits us in other stats: 2019 to 2021 were massive years
of growth with some slow-down in 2021. From 2022 onwards the growth picks back
up again, if at a more mellow pace.
## Package statistics
Now that we have an idea of how the overall interest in the distribution has changed over time,
let's look at the actual package statistics.
The popcorn files contain two main pieces of information: the number of installs per package
(e.g. how many people have rsync installed) and the number of unique installs (i.e. unique
machines providing statistics). We will look at both of these in turn.
```{python}
from notebooks.popcorn import plt_weekly_packages
pplot(plt_weekly_packages)
```
```{python}
from notebooks.popcorn import plt_pkg_relative
pplot(plt_pkg_relative)
```
The amount of packages installed on all machines increases strongly over time.
```{python}
from notebooks.popcorn import plt_weekday_packages
pplot(plt_weekday_packages)
```
```{python}
from notebooks.popcorn import plt_month_packages
pplot(plt_month_packages)
```
```{python}
from notebooks.popcorn import plt_top_packages
pplot(plt_top_packages)
```
```{python}
from notebooks.popcorn import plt_package_distribution
pplot(plt_package_distribution)
```
```{python}
from notebooks.popcorn import plt_top_packages
_, defs = plt_top_packages.run()
df_pkg_dl = defs["df_pkg_dl"]
def get_num(df: pl.LazyFrame) -> int:
return df.count().collect(engine="streaming").item(0, 0)
one_ten_installs = df_pkg_dl.sort("count", descending=False).filter(
(pl.col("count") >= 1) & (pl.col("count") < 10)
)
ten_twenty_installs = df_pkg_dl.sort("count", descending=False).filter(
(pl.col("count") >= 10) & (pl.col("count") < 20)
)
twenty_thirty = df_pkg_dl.sort("count", descending=False).filter(
(pl.col("count") >= 20) & (pl.col("count") < 30)
)
thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30))
```
There are `python f"{get_num(one_ten_installs):,}"` packages which have between one
and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"`
packages between eleven and 20 installations, and
`python f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
`python f"{get_num(thirty_plus):,}"` packages have over 30 installations.
## Kernel Analysis
```{python}
from notebooks.popcorn import plt_kernel_versions
pplot(plt_kernel_versions)
```
When looking at the kernel versions used, we see a very strong jump between major kernel version
4 and major kernel version 5.
For this analysis we had to exclude {kernel_df_v99.select(pl.len()).item()} rows which were
apparently from the future, as they were running variations of major kernel version 99. In all
likelihood there is a custom kernel version out there which reports its own major version as 99.
The strange version starts appearing on {kernel_df_v99.select("date").row(0)0} and shows up
all the way until {kernel_df_v99.select("date").row(-1)[0]}.
```{python}
from notebooks.popcorn import plt_kernel_timeline
pplot(plt_kernel_timeline)
```
```{python}
from datetime import date
from notebooks.popcorn import plt_kernel_timeline
_, defs = plt_kernel_timeline.run()
weekly_kernel_df = defs["weekly_kernel_df"]
last_kernel4: date = weekly_kernel_df.filter(pl.col("major_ver") == "4")[-1][
"date"
].item()
first_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[0][
"date"
].item()
last_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[-1][
"date"
].item()
```
A timeline analysis of the kernels used to report daily downloads shows that people generally
adopt new major kernel versons at roughly the same time. This change is especially stark between
major kernel versions 5 and 6, which seem to have traded place in usage almost over night.
The first time that major version 5 of the kernel shows up is on {first_kernel5}. From here, it
took a long time for the last of the version 4 kernels to disappear, coinciding with the big
switch between major version 5 and 6. The last time a major version 4 is seen is on
{last_kernel4}, while the last major version 5 kernels still pop up.
It would seem, then, that the people still running kernel version 4 used the opportunity of
everybody switching to the stable version of 6 to also upgrade their machines.
## Odds and Ends
### The PopCorn files
Let's have a look at the provided PopCorn statistics files themselves.
The files consist of a long list of packages which have been reported to the
central server that day, along with the number of package instances. The amount
of unique installations from which these statistics are derived are given as a
sum. It also consists of the same list once again, but separated by
specifically installed versions of packages.
So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the
number of both packages with their versions separately. Another count is the
number of different Kernels that have been used on that day, with their exact
kernel name including major version, minor version and any suffix.
```json
{
"UniqueInstalls": 2,
"Packages": {
"ImageMagick": 1,
"PopCorn": 2,
"acpi": 1,
"alsa-utils": 2,
...
"xurls": 1,
"youtube-dl": 1
},
"Versions": {
"ImageMagick": {
"6.9.9.40_1": 1
},
"PopCorn": {
"0.2_1": 2
},
"acpi": {
"1.7_3": 1
},
"alsa-utils": {
"1.1.5_1": 1,
"1.1.6_1": 1
},
...
},
"XuKernel": {
"4.16.6_1": 1
}
```
```{python}
from notebooks.popcorn import tab_pkg
outp, defs = tab_pkg.run()
outp
```
There are some missing days in the statistics.
```{python}
from notebooks.popcorn import tab_missing_days
outp, defs = tab_missing_days.run()
outp
```
## Outline
- intro
- filesize
- unique installations reported from
- packages -> perhaps find new subcategories
- global
- relative (pkg/unique)
- top packages
- rare packages?
- install distribution
- packages per time unit (find clever title, e.g. 'accumulated packages')
- per year?
- weekday
- month of year (combine with weekday?)
- kernels
- overall kernel version installations
- kernels over time
- misc
- missing days
- moved days
- things we can't see (limitations)
- packages on offer in the repositories
- this could shed light on the bumps of users and relative package ownership
Modified date != descriptive (named) date
```{python}
from notebooks.popcorn import plt_modified_times
pplot(plt_modified_times)
```