Add daily filesize text
This commit is contained in:
parent
cfc8ecc4fd
commit
707632fb7d
1 changed files with 120 additions and 35 deletions
155
popcorn.qmd
155
popcorn.qmd
|
|
@ -5,7 +5,23 @@ subtitle: "Analysis of voidlinux package and kernel statistics"
|
|||
|
||||
This notebook analyses the daily package repository statistics files,
|
||||
colloquially known as 'popcorn' files, that are generated by the Void Linux
|
||||
package manager `xbps` and uploaded by users who have opted in to share.
|
||||
tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and
|
||||
uploaded by users who have opted in to share with it.
|
||||
|
||||
The tool collects a variety of statistics from the user which it anonymizes
|
||||
(you only get a uuid and to ensure your stats are only uploaded once, and
|
||||
that's it) and mixes with all other daily uploaded statistics on the server.
|
||||
The result is an overview of the package and kernel ecosystem of the Void Linux
|
||||
distribution collected from a multitude of unique installations.
|
||||
|
||||
Before we get into the pretty graphs I would like to point out that you too can
|
||||
participate in this collection, and thereby help the distribution maintainers
|
||||
know which packages are more relevant or, perhaps more important, less relevant
|
||||
for end users.
|
||||
|
||||
To upload your own statistics anonymously, simply `xbps-install PopCorn` and
|
||||
enable its service to run in the background with `ln -s /etc/sv/popcorn
|
||||
/var/service`.
|
||||
|
||||
```{python}
|
||||
# | echo: false
|
||||
|
|
@ -47,25 +63,28 @@ LetsPlot.setup_html()
|
|||
|
||||
## Daily statistics file size
|
||||
|
||||
The simplest operation we can do is look at the overall file size for each of the daily
|
||||
statistics files over time. The files consist of a long list of packages which have been checked
|
||||
from the repositories that day, along with the number of package instances. It also consists of
|
||||
the same list separated by specifically installed versions of packages, so if somebody has
|
||||
v0.9.1 and somebody else v0.9.3 instead this would count both packages separately.
|
||||
Before we take an in-depth look at the packages, let's quickly get an overview
|
||||
of the stats files themselves. These are the raw `JSON` files that we are using
|
||||
throughout the rest of the article, reporting installed packages and kernels,
|
||||
and their respective versions.
|
||||
|
||||
Another count is the number of different Kernels that have been used on that day, with their
|
||||
exact kernel name including major version, minor version and any suffix.
|
||||
A look at the overall file size for each of the daily statistics files over
|
||||
time reveals not necessarily changes in the absolute use of packages (e.g.
|
||||
`neovim` being installed more or less often). Whether it has been downloaded
|
||||
once or 100 times, the file size does not change drastically. Instead it
|
||||
increases much more drastically when both `neovim` and `emacs` are installed,
|
||||
or a variety of different versions for one of the packages are installed.
|
||||
Similarly for different versions of the kernel.
|
||||
|
||||
These are the major things that will lead to size increases in the file, but not just for an
|
||||
increased amount of absolute users, packages or uploads --- we will get to those shortly.
|
||||
An increase in file size here mainly suggests an increase in the 'breadth' of
|
||||
files on offer in the repository, whether that be a _wider variety_ of packages
|
||||
or more different package _versions_ that people are interested in, and that
|
||||
the community chooses to use.
|
||||
|
||||
No, an increase in file size here mainly suggests an increase in the 'breadth' of files on offer
|
||||
in the repository, whether that be a wider variety of program versions or more different
|
||||
packages that people are interested in, and those that the community chooses to use.
|
||||
|
||||
So while the overall amount of packages gives a general estimate of the interest in the
|
||||
distribution, this can show a more 'distributor'-aligned view on how many different aisles of
|
||||
the buffet people are eating from.
|
||||
So while the overall amount of packages and installations --- we'll get to those
|
||||
--- gives a general estimate of the interest in the distribution, we see at a
|
||||
glance a more 'user-choice'-aligned view of how many aisles the buffet offers,
|
||||
and how many of those the users are eating from.
|
||||
|
||||
```{python}
|
||||
# | echo: true
|
||||
|
|
@ -75,22 +94,32 @@ pplot(plt_filesize)
|
|||
|
||||
As we can see, the difference over time is massive. Especially early on, between 2019 and the
|
||||
start of 2021, the amount of different packages and package versions used grew rapidly, with the
|
||||
pace picking up once again starting 2023.
|
||||
pace also picking up once again starting 2023.
|
||||
|
||||
There are a few outlier days with a size of 0 kB, which we will remove from the data. In all
|
||||
likelihood, those days were not reported correctly or there was some kind of issue on the
|
||||
backend so the stats for those days are lost.
|
||||
From a reported filesize of around 50kB in the first days before the end of
|
||||
2019 we have easily tripled the filesize to over 150kB needed for the report
|
||||
per day. Nowadays we have reached just about 400kB daily report size, over 8
|
||||
times the size beginning 2018.
|
||||
|
||||
There are also a few days where the modification date of the file does not correspond to the
|
||||
represented statistical date but those are kept. This rather points to certain times when the
|
||||
files have been moved on the backend, or recreated externally but does not mean the data are
|
||||
bad.
|
||||
There are a few outlier days with a size of 0 kB on the server, which we had to
|
||||
remove from the data. In all likelihood, those days were not reported correctly
|
||||
or there was some kind of issue on the backend so the stats for those days are
|
||||
lost.
|
||||
<!-- TODO: is this still true? -->
|
||||
We take a look at the missing days
|
||||
among other things at the end of this article.
|
||||
|
||||
```{python}
|
||||
from notebooks.popcorn import tab_pkg
|
||||
outp, defs = tab_pkg.run()
|
||||
outp
|
||||
```
|
||||
There are also a few days where the modification date of the file does not
|
||||
correspond to the represented statistical date but those are kept. This rather
|
||||
points to certain times when the files have been moved on the backend,
|
||||
recreated externally, or otherwise externally modified but does not mean the
|
||||
data contained are bad.
|
||||
|
||||
Let's not over-emphasize the results here: it is a very coarse view on the
|
||||
packages and kernels underneath. Still, the shape of the curve above is a good
|
||||
indication for what awaits us in other stats: 2019 to 2021 were massive years
|
||||
of growth with some slow-down in 2021. From 2022 onwards the growth picks back
|
||||
up again, if at a more mellow pace.
|
||||
|
||||
## Package statistics
|
||||
|
||||
|
|
@ -152,11 +181,11 @@ twenty_thirty = df_pkg_dl.sort("count", descending=False).filter(
|
|||
thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30))
|
||||
```
|
||||
|
||||
There are `{python} f"{get_num(one_ten_installs):,}"` packages which have between one
|
||||
and ten installations in the data, `{python} f"{get_num(ten_twenty_installs):,}"`
|
||||
There are `python f"{get_num(one_ten_installs):,}"` packages which have between one
|
||||
and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"`
|
||||
packages between eleven and 20 installations, and
|
||||
`{python} f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
|
||||
`{python} f"{get_num(thirty_plus):,}"` packages have over 30 installations.
|
||||
`python f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations.
|
||||
`python f"{get_num(thirty_plus):,}"` packages have over 30 installations.
|
||||
|
||||
## Kernel Analysis
|
||||
|
||||
|
|
@ -171,7 +200,7 @@ When looking at the kernel versions used, we see a very strong jump between majo
|
|||
For this analysis we had to exclude {kernel_df_v99.select(pl.len()).item()} rows which were
|
||||
apparently from the future, as they were running variations of major kernel version 99. In all
|
||||
likelihood there is a custom kernel version out there which reports its own major version as 99.
|
||||
The strange version starts appearing on {kernel_df_v99.select("date").row(0)[0]} and shows up
|
||||
The strange version starts appearing on {kernel_df_v99.select("date").row(0)0} and shows up
|
||||
all the way until {kernel_df_v99.select("date").row(-1)[0]}.
|
||||
|
||||
```{python}
|
||||
|
|
@ -209,6 +238,62 @@ everybody switching to the stable version of 6 to also upgrade their machines.
|
|||
|
||||
## Odds and Ends
|
||||
|
||||
### The PopCorn files
|
||||
|
||||
Let's have a look at the provided PopCorn statistics files themselves.
|
||||
|
||||
The files consist of a long list of packages which have been reported to the
|
||||
central server that day, along with the number of package instances. The amount
|
||||
of unique installations from which these statistics are derived are given as a
|
||||
sum. It also consists of the same list once again, but separated by
|
||||
specifically installed versions of packages.
|
||||
|
||||
So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the
|
||||
number of both packages with their versions separately. Another count is the
|
||||
number of different Kernels that have been used on that day, with their exact
|
||||
kernel name including major version, minor version and any suffix.
|
||||
|
||||
```json
|
||||
{
|
||||
"UniqueInstalls": 2,
|
||||
"Packages": {
|
||||
"ImageMagick": 1,
|
||||
"PopCorn": 2,
|
||||
"acpi": 1,
|
||||
"alsa-utils": 2,
|
||||
...
|
||||
"xurls": 1,
|
||||
"youtube-dl": 1
|
||||
},
|
||||
"Versions": {
|
||||
"ImageMagick": {
|
||||
"6.9.9.40_1": 1
|
||||
},
|
||||
"PopCorn": {
|
||||
"0.2_1": 2
|
||||
},
|
||||
"acpi": {
|
||||
"1.7_3": 1
|
||||
},
|
||||
"alsa-utils": {
|
||||
"1.1.5_1": 1,
|
||||
"1.1.6_1": 1
|
||||
},
|
||||
...
|
||||
},
|
||||
"XuKernel": {
|
||||
"4.16.6_1": 1
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
```{python}
|
||||
from notebooks.popcorn import tab_pkg
|
||||
outp, defs = tab_pkg.run()
|
||||
outp
|
||||
```
|
||||
|
||||
|
||||
There are some missing days in the statistics.
|
||||
|
||||
```{python}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue