338 lines
11 KiB
Text
338 lines
11 KiB
Text
---
|
|
title: "Voidlinux popcorn"
|
|
subtitle: "Analysis of voidlinux package and kernel statistics"
|
|
---
|
|
|
|
This notebook analyses the daily package repository statistics files,
|
|
colloquially known as 'popcorn' files, that are generated by the Void Linux
|
|
tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and
|
|
uploaded by users who have opted in to share with it.
|
|
|
|
The tool collects a variety of statistics from the user which it anonymizes
|
|
(you only get a uuid and to ensure your stats are only uploaded once, and
|
|
that's it) and mixes with all other daily uploaded statistics on the server.
|
|
The result is an overview of the package and kernel ecosystem of the Void Linux
|
|
distribution collected from a multitude of unique installations.
|
|
|
|
Before we get into the pretty graphs I would like to point out that you too can
|
|
participate in this collection, and thereby help the distribution maintainers
|
|
know which packages are more relevant or, perhaps more important, less relevant
|
|
for end users.
|
|
|
|
To upload your own statistics anonymously, simply `xbps-install PopCorn` and
|
|
enable its service to run in the background with `ln -s /etc/sv/popcorn
|
|
/var/service`.
|
|
|
|
```{python}
|
|
# | echo: false
|
|
import os
|
|
from typing import Any, Awaitable, Mapping
|
|
|
|
import lets_plot as lp
|
|
import polars as pl
|
|
from lets_plot import LetsPlot
|
|
from marimo import Cell
|
|
|
|
|
|
def run_cell(cell: Cell) -> tuple[Any, Mapping[str, Any]]:
|
|
ret = cell.run()
|
|
if isinstance(ret, Awaitable):
|
|
raise NotImplementedError
|
|
else:
|
|
output, defs = ret
|
|
return (output, defs)
|
|
|
|
|
|
fig_width, fig_height = (
|
|
int(os.getenv("QUARTO_FIG_WIDTH") or 7),
|
|
int(os.getenv("QUARTO_FIG_HEIGHT") or 5),
|
|
)
|
|
|
|
|
|
def pplot(cell: Cell) -> Any:
|
|
outp, _ = run_cell(cell)
|
|
return (
|
|
outp
|
|
+ lp.flavor_darcula()
|
|
+ lp.ggsize(width=fig_width * 1000, height=fig_height * 1000)
|
|
)
|
|
|
|
|
|
LetsPlot.setup_html()
|
|
```
|
|
|
|
## Daily statistics file size
|
|
|
|
Before we take an in-depth look at the packages, let's quickly get an overview
|
|
of the stats files themselves. These are the raw `JSON` files that we are using
|
|
throughout the rest of the article, reporting installed packages and kernels,
|
|
and their respective versions.
|
|
|
|
A look at the overall file size for each of the daily statistics files over
|
|
time reveals not necessarily changes in the absolute use of packages (e.g.
|
|
`neovim` being installed more or less often). Whether it has been downloaded
|
|
once or 100 times, the file size does not change drastically. Instead it
|
|
increases much more drastically when both `neovim` and `emacs` are installed,
|
|
or a variety of different versions for one of the packages are installed.
|
|
Similarly for different versions of the kernel.
|
|
|
|
An increase in file size here mainly suggests an increase in the 'breadth' of
|
|
files on offer in the repository, whether that be a _wider variety_ of packages
|
|
or more different package _versions_ that people are interested in, and that
|
|
the community chooses to use.
|
|
|
|
So while the overall amount of packages and installations --- we'll get to those
|
|
--- gives a general estimate of the interest in the distribution, we see at a
|
|
glance a more 'user-choice'-aligned view of how many aisles the buffet offers,
|
|
and how many of those the users are eating from.
|
|
|
|
```{python}
|
|
# | echo: true
|
|
from notebooks.popcorn import plt_filesize
|
|
pplot(plt_filesize)
|
|
```
|
|
|
|
As we can see, the difference over time is massive. Especially early on, between 2019 and the
|
|
start of 2021, the amount of different packages and package versions used grew rapidly, with the
|
|
pace also picking up once again starting 2023.
|
|
|
|
From a reported filesize of around 50kB in the first days before the end of
|
|
2019 we have easily tripled the filesize to over 150kB needed for the report
|
|
per day. Nowadays we have reached just about 400kB daily report size, over 8
|
|
times the size beginning 2018.
|
|
|
|
There are a few outlier days with a size of 0 kB on the server, which we had to
|
|
remove from the data. In all likelihood, those days were not reported correctly
|
|
or there was some kind of issue on the backend so the stats for those days are
|
|
lost.
|
|
<!-- TODO: is this still true? -->
|
|
We take a look at the missing days
|
|
among other things at the end of this article.
|
|
|
|
There are also a few days where the modification date of the file does not
|
|
correspond to the represented statistical date but those are kept. This rather
|
|
points to certain times when the files have been moved on the backend,
|
|
recreated externally, or otherwise externally modified but does not mean the
|
|
data contained are bad.
|
|
|
|
Let's not over-emphasize the results here: it is a very coarse view on the
|
|
packages and kernels underneath. Still, the shape of the curve above is a good
|
|
indication for what awaits us in other stats: 2019 to 2021 were massive years
|
|
of growth with some slow-down in 2021. From 2022 onwards the growth picks back
|
|
up again, if at a more mellow pace.
|
|
|
|
## Package statistics
|
|
|
|
Now that we have an idea of how the overall interest in the distribution has changed over time,
|
|
let's look at the actual package statistics.
|
|
|
|
The popcorn files contain two main pieces of information: the number of installs per package
|
|
(e.g. how many people have rsync installed) and the number of unique installs (i.e. unique
|
|
machines providing statistics). We will look at both of these in turn.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_weekly_packages
|
|
pplot(plt_weekly_packages)
|
|
```
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_pkg_relative
|
|
pplot(plt_pkg_relative)
|
|
```
|
|
|
|
The amount of packages installed on all machines increases strongly over time.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_weekday_packages
|
|
pplot(plt_weekday_packages)
|
|
```
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_month_packages
|
|
pplot(plt_month_packages)
|
|
```
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_top_packages
|
|
pplot(plt_top_packages)
|
|
```
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_package_distribution
|
|
pplot(plt_package_distribution)
|
|
```
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_top_packages
|
|
_, defs = plt_top_packages.run()
|
|
df_pkg_dl = defs["df_pkg_dl"]
|
|
def get_num(df: pl.LazyFrame) -> int:
|
|
return df.count().collect(engine="streaming").item(0, 0)
|
|
|
|
one_ten_installs = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 1) & (pl.col("count") < 10)
|
|
)
|
|
ten_twenty_installs = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 10) & (pl.col("count") < 20)
|
|
)
|
|
twenty_thirty = df_pkg_dl.sort("count", descending=False).filter(
|
|
(pl.col("count") >= 20) & (pl.col("count") < 30)
|
|
)
|
|
thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30))
|
|
```
|
|
|
|
There are `python f"{get_num(one_ten_installs):,}"` packages which have between one
|
|
and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"`
|
|
packages between eleven and 20 installations, and
|
|
`python f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
|
|
`python f"{get_num(thirty_plus):,}"` packages have over 30 installations.
|
|
|
|
## Kernel Analysis
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_kernel_versions
|
|
pplot(plt_kernel_versions)
|
|
```
|
|
|
|
When looking at the kernel versions used, we see a very strong jump between major kernel version
|
|
4 and major kernel version 5.
|
|
|
|
For this analysis we had to exclude {kernel_df_v99.select(pl.len()).item()} rows which were
|
|
apparently from the future, as they were running variations of major kernel version 99. In all
|
|
likelihood there is a custom kernel version out there which reports its own major version as 99.
|
|
The strange version starts appearing on {kernel_df_v99.select("date").row(0)0} and shows up
|
|
all the way until {kernel_df_v99.select("date").row(-1)[0]}.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_kernel_timeline
|
|
pplot(plt_kernel_timeline)
|
|
```
|
|
|
|
```{python}
|
|
from datetime import date
|
|
from notebooks.popcorn import plt_kernel_timeline
|
|
_, defs = plt_kernel_timeline.run()
|
|
weekly_kernel_df = defs["weekly_kernel_df"]
|
|
|
|
last_kernel4: date = weekly_kernel_df.filter(pl.col("major_ver") == "4")[-1][
|
|
"date"
|
|
].item()
|
|
first_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[0][
|
|
"date"
|
|
].item()
|
|
last_kernel5: date = weekly_kernel_df.filter(pl.col("major_ver") == "5")[-1][
|
|
"date"
|
|
].item()
|
|
```
|
|
|
|
A timeline analysis of the kernels used to report daily downloads shows that people generally
|
|
adopt new major kernel versons at roughly the same time. This change is especially stark between
|
|
major kernel versions 5 and 6, which seem to have traded place in usage almost over night.
|
|
|
|
The first time that major version 5 of the kernel shows up is on {first_kernel5}. From here, it
|
|
took a long time for the last of the version 4 kernels to disappear, coinciding with the big
|
|
switch between major version 5 and 6. The last time a major version 4 is seen is on
|
|
{last_kernel4}, while the last major version 5 kernels still pop up.
|
|
It would seem, then, that the people still running kernel version 4 used the opportunity of
|
|
everybody switching to the stable version of 6 to also upgrade their machines.
|
|
|
|
## Odds and Ends
|
|
|
|
### The PopCorn files
|
|
|
|
Let's have a look at the provided PopCorn statistics files themselves.
|
|
|
|
The files consist of a long list of packages which have been reported to the
|
|
central server that day, along with the number of package instances. The amount
|
|
of unique installations from which these statistics are derived are given as a
|
|
sum. It also consists of the same list once again, but separated by
|
|
specifically installed versions of packages.
|
|
|
|
So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the
|
|
number of both packages with their versions separately. Another count is the
|
|
number of different Kernels that have been used on that day, with their exact
|
|
kernel name including major version, minor version and any suffix.
|
|
|
|
```json
|
|
{
|
|
"UniqueInstalls": 2,
|
|
"Packages": {
|
|
"ImageMagick": 1,
|
|
"PopCorn": 2,
|
|
"acpi": 1,
|
|
"alsa-utils": 2,
|
|
...
|
|
"xurls": 1,
|
|
"youtube-dl": 1
|
|
},
|
|
"Versions": {
|
|
"ImageMagick": {
|
|
"6.9.9.40_1": 1
|
|
},
|
|
"PopCorn": {
|
|
"0.2_1": 2
|
|
},
|
|
"acpi": {
|
|
"1.7_3": 1
|
|
},
|
|
"alsa-utils": {
|
|
"1.1.5_1": 1,
|
|
"1.1.6_1": 1
|
|
},
|
|
...
|
|
},
|
|
"XuKernel": {
|
|
"4.16.6_1": 1
|
|
}
|
|
```
|
|
|
|
|
|
```{python}
|
|
from notebooks.popcorn import tab_pkg
|
|
outp, defs = tab_pkg.run()
|
|
outp
|
|
```
|
|
|
|
|
|
There are some missing days in the statistics.
|
|
|
|
```{python}
|
|
from notebooks.popcorn import tab_missing_days
|
|
|
|
outp, defs = tab_missing_days.run()
|
|
outp
|
|
```
|
|
|
|
## Outline
|
|
|
|
- intro
|
|
- filesize
|
|
- unique installations reported from
|
|
- packages -> perhaps find new subcategories
|
|
- global
|
|
- relative (pkg/unique)
|
|
- top packages
|
|
- rare packages?
|
|
- install distribution
|
|
- packages per time unit (find clever title, e.g. 'accumulated packages')
|
|
- per year?
|
|
- weekday
|
|
- month of year (combine with weekday?)
|
|
- kernels
|
|
- overall kernel version installations
|
|
- kernels over time
|
|
|
|
- misc
|
|
- missing days
|
|
- moved days
|
|
|
|
- things we can't see (limitations)
|
|
- packages on offer in the repositories
|
|
- this could shed light on the bumps of users and relative package ownership
|
|
|
|
Modified date != descriptive (named) date
|
|
|
|
```{python}
|
|
from notebooks.popcorn import plt_modified_times
|
|
pplot(plt_modified_times)
|
|
```
|