Add appendix text

This commit is contained in:
Marty Oehme 2025-10-08 20:30:57 +02:00
parent 06ee312c80
commit 2c5cf37b2c
Signed by: Marty
GPG key ID: 4E535BC19C61886E

View file

@ -523,6 +523,57 @@ provides a more macro-level view on how big the statistics have grown to be over
We can see that, as each individually reported day adds up to 400KB nowadays, the
cumulative size is up to almost 700MB currently.
### Packages monthwise and per weekday
Let's also look at the packages installed on systems for different time slices.
We'll start with a look at the packages per weekday.
```{python}
from notebooks.popcorn import plt_weekday_packages
pplot(plt_weekday_packages)
```
There is no significant difference between the individual weekdays, as we would
expect. It seems strange to have a specific day on which everybody decides to
install or uninstall new packages.
That said, there is some slight variation, with Wednesdays generally having a
few fewer total packages to boast than other days, especially Tuesdays which
are slightly above the curve.
Let's just imagine everybody gets bored on Tuesday, installs a new package and
drops it again by Wednesday, along with a slew of other packages. Try-out
Tuesdays and Wastebin Wednesdays if you will.
Alright, but let's also take a look at the package numbers per month instead.
```{python}
from notebooks.popcorn import plt_month_packages
pplot(plt_month_packages)
```
Here we can see a bit more variation. First it is important to note that I have
removed the first months of 2018 prior to October from the analysis cut off any
days after September 2025, to have only full years represented and avoid any
months being present more often than others.[^months-removed]
[^months-removed]: I chose the first couple of months in the data, rather than
the most recent months as fewer people were collecting data, thus we have less
of a loss. Additionally, I presume people are more interested in current
statistics than older ones, just generally.
It is quite surprising to me just how much variation is visible in the results:
months from October to February have markedly fewer packages than the spring
and summer months. Are people generally more willing to use and try out new
packages in the summer? Alternatively, were any of the major usage dips taking
place during winter, while the increases in usage occured more toward summer?
I have not delved deep into the interpretation of these questions, but it may
be interesting to do so. The last option, of course, is that the data itself,
the data collection or analysis contains an error that I am not aware of.
### Missing days and dates
There are some missing days in the statistics.
```{python}
@ -532,17 +583,19 @@ outp, defs = tab_missing_days.run()
outp
```
### Packages monthwise and per weekday
These missing days are primarily occuring at the end of January 2019, and
throughout 2025. However, with over 2600 days where the statistics _are_
available, these rows represent an insignificant issue for the overall data.
```{python}
from notebooks.popcorn import plt_weekday_packages
pplot(plt_weekday_packages)
```
It would seem there was some kind of issue collecting or storing the collected
data at that point in 2019, which means a few days in a row are missing. This
skews absolute numbers for that week downwards, as well as any weekly averages
relying on this date-range.
```{python}
from notebooks.popcorn import plt_month_packages
pplot(plt_month_packages)
```
However, no significant visual differences stem from this fact, which is why it
is not called out in the main article. As it is --- an interesting fact, and,
where this a more rigorous investigation, perhaps worthy of taking into account
as biasing the result, but for our purposes not too bad.
## Outline
@ -570,10 +623,3 @@ pplot(plt_month_packages)
- things we can't see (limitations)
- packages on offer in the repositories
- this could shed light on the bumps of users and relative package ownership
Modified date != descriptive (named) date
```{python}
from notebooks.popcorn import plt_modified_times
pplot(plt_modified_times)
```