From 707632fb7dc525100a67a72f51a073b25c09f7ac Mon Sep 17 00:00:00 2001 From: Marty Oehme Date: Wed, 8 Oct 2025 09:27:32 +0200 Subject: [PATCH] Add daily filesize text --- popcorn.qmd | 155 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 120 insertions(+), 35 deletions(-) diff --git a/popcorn.qmd b/popcorn.qmd index de468f7..7cd4458 100644 --- a/popcorn.qmd +++ b/popcorn.qmd @@ -5,7 +5,23 @@ subtitle: "Analysis of voidlinux package and kernel statistics" This notebook analyses the daily package repository statistics files, colloquially known as 'popcorn' files, that are generated by the Void Linux -package manager `xbps` and uploaded by users who have opted in to share. +tool [`PopCorn`](https://voidlinux.org/packages/?arch=x86_64&q=popcorn) and +uploaded by users who have opted in to share with it. + +The tool collects a variety of statistics from the user which it anonymizes +(you only get a uuid and to ensure your stats are only uploaded once, and +that's it) and mixes with all other daily uploaded statistics on the server. +The result is an overview of the package and kernel ecosystem of the Void Linux +distribution collected from a multitude of unique installations. + +Before we get into the pretty graphs I would like to point out that you too can +participate in this collection, and thereby help the distribution maintainers +know which packages are more relevant or, perhaps more important, less relevant +for end users. + +To upload your own statistics anonymously, simply `xbps-install PopCorn` and +enable its service to run in the background with `ln -s /etc/sv/popcorn +/var/service`. ```{python} # | echo: false @@ -47,25 +63,28 @@ LetsPlot.setup_html() ## Daily statistics file size -The simplest operation we can do is look at the overall file size for each of the daily -statistics files over time. The files consist of a long list of packages which have been checked -from the repositories that day, along with the number of package instances. It also consists of -the same list separated by specifically installed versions of packages, so if somebody has -v0.9.1 and somebody else v0.9.3 instead this would count both packages separately. +Before we take an in-depth look at the packages, let's quickly get an overview +of the stats files themselves. These are the raw `JSON` files that we are using +throughout the rest of the article, reporting installed packages and kernels, +and their respective versions. -Another count is the number of different Kernels that have been used on that day, with their -exact kernel name including major version, minor version and any suffix. +A look at the overall file size for each of the daily statistics files over +time reveals not necessarily changes in the absolute use of packages (e.g. +`neovim` being installed more or less often). Whether it has been downloaded +once or 100 times, the file size does not change drastically. Instead it +increases much more drastically when both `neovim` and `emacs` are installed, +or a variety of different versions for one of the packages are installed. +Similarly for different versions of the kernel. -These are the major things that will lead to size increases in the file, but not just for an -increased amount of absolute users, packages or uploads --- we will get to those shortly. +An increase in file size here mainly suggests an increase in the 'breadth' of +files on offer in the repository, whether that be a _wider variety_ of packages +or more different package _versions_ that people are interested in, and that +the community chooses to use. -No, an increase in file size here mainly suggests an increase in the 'breadth' of files on offer -in the repository, whether that be a wider variety of program versions or more different -packages that people are interested in, and those that the community chooses to use. - -So while the overall amount of packages gives a general estimate of the interest in the -distribution, this can show a more 'distributor'-aligned view on how many different aisles of -the buffet people are eating from. +So while the overall amount of packages and installations --- we'll get to those +--- gives a general estimate of the interest in the distribution, we see at a +glance a more 'user-choice'-aligned view of how many aisles the buffet offers, +and how many of those the users are eating from. ```{python} # | echo: true @@ -75,22 +94,32 @@ pplot(plt_filesize) As we can see, the difference over time is massive. Especially early on, between 2019 and the start of 2021, the amount of different packages and package versions used grew rapidly, with the -pace picking up once again starting 2023. +pace also picking up once again starting 2023. -There are a few outlier days with a size of 0 kB, which we will remove from the data. In all -likelihood, those days were not reported correctly or there was some kind of issue on the -backend so the stats for those days are lost. +From a reported filesize of around 50kB in the first days before the end of +2019 we have easily tripled the filesize to over 150kB needed for the report +per day. Nowadays we have reached just about 400kB daily report size, over 8 +times the size beginning 2018. -There are also a few days where the modification date of the file does not correspond to the -represented statistical date but those are kept. This rather points to certain times when the -files have been moved on the backend, or recreated externally but does not mean the data are -bad. +There are a few outlier days with a size of 0 kB on the server, which we had to +remove from the data. In all likelihood, those days were not reported correctly +or there was some kind of issue on the backend so the stats for those days are +lost. + +We take a look at the missing days +among other things at the end of this article. -```{python} -from notebooks.popcorn import tab_pkg -outp, defs = tab_pkg.run() -outp -``` +There are also a few days where the modification date of the file does not +correspond to the represented statistical date but those are kept. This rather +points to certain times when the files have been moved on the backend, +recreated externally, or otherwise externally modified but does not mean the +data contained are bad. + +Let's not over-emphasize the results here: it is a very coarse view on the +packages and kernels underneath. Still, the shape of the curve above is a good +indication for what awaits us in other stats: 2019 to 2021 were massive years +of growth with some slow-down in 2021. From 2022 onwards the growth picks back +up again, if at a more mellow pace. ## Package statistics @@ -152,11 +181,11 @@ twenty_thirty = df_pkg_dl.sort("count", descending=False).filter( thirty_plus = df_pkg_dl.sort("count", descending=False).filter((pl.col("count") >= 30)) ``` -There are `{python} f"{get_num(one_ten_installs):,}"` packages which have between one -and ten installations in the data, `{python} f"{get_num(ten_twenty_installs):,}"` +There are `python f"{get_num(one_ten_installs):,}"` packages which have between one +and ten installations in the data, `python f"{get_num(ten_twenty_installs):,}"` packages between eleven and 20 installations, and -`{python} f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations. -`{python} f"{get_num(thirty_plus):,}"` packages have over 30 installations. +`python f"{get_num(twenty_thirty):,}"` packages between 21 and 30 installations. +`python f"{get_num(thirty_plus):,}"` packages have over 30 installations. ## Kernel Analysis @@ -171,7 +200,7 @@ When looking at the kernel versions used, we see a very strong jump between majo For this analysis we had to exclude {kernel_df_v99.select(pl.len()).item()} rows which were apparently from the future, as they were running variations of major kernel version 99. In all likelihood there is a custom kernel version out there which reports its own major version as 99. -The strange version starts appearing on {kernel_df_v99.select("date").row(0)[0]} and shows up +The strange version starts appearing on {kernel_df_v99.select("date").row(0)0} and shows up all the way until {kernel_df_v99.select("date").row(-1)[0]}. ```{python} @@ -209,6 +238,62 @@ everybody switching to the stable version of 6 to also upgrade their machines. ## Odds and Ends +### The PopCorn files + +Let's have a look at the provided PopCorn statistics files themselves. + +The files consist of a long list of packages which have been reported to the +central server that day, along with the number of package instances. The amount +of unique installations from which these statistics are derived are given as a +sum. It also consists of the same list once again, but separated by +specifically installed versions of packages. + +So if somebody has v0.9.1 and somebody else v0.9.3 instead this list counts the +number of both packages with their versions separately. Another count is the +number of different Kernels that have been used on that day, with their exact +kernel name including major version, minor version and any suffix. + +```json +{ + "UniqueInstalls": 2, + "Packages": { + "ImageMagick": 1, + "PopCorn": 2, + "acpi": 1, + "alsa-utils": 2, + ... + "xurls": 1, + "youtube-dl": 1 + }, + "Versions": { + "ImageMagick": { + "6.9.9.40_1": 1 + }, + "PopCorn": { + "0.2_1": 2 + }, + "acpi": { + "1.7_3": 1 + }, + "alsa-utils": { + "1.1.5_1": 1, + "1.1.6_1": 1 + }, + ... + }, + "XuKernel": { + "4.16.6_1": 1 + } +``` + + +```{python} +from notebooks.popcorn import tab_pkg +outp, defs = tab_pkg.run() +outp +``` + + There are some missing days in the statistics. ```{python}