diff --git a/CHANGELOG.md b/CHANGELOG.md index e69de29..376255c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -0,0 +1,10 @@ +# CHANGELOG + +## 2025-09-30 + +- Clean output data until 2025-09-24 +- Removal of 0-byte empty raw data files + +## 2025-09-29 + +- Dirty output data until 2025-09-25 diff --git a/README.md b/README.md index 94b570e..e7c6e20 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,57 @@ -# Project +# Project Popcorn Voidlinux + +Contains the gathered data of the statistics collection for all +available dates (2025-09-24 as of today) in easier to work with CSV form. + +Data can be cleaned and processed with the available code. +Any action can easily be started using [`just`](https://github.com/casey/just) with the available `justfile`. ## Dataset structure -- All inputs (i.e. building blocks from other sources) are located in - `inputs/`. +- All inputs (i.e. building blocks from other sources) are located in `input/`. - All custom code is located in `code/`. +- All final output data is located in `output/` + +## Output data structure + +### Files + +Represents information about the individual JSON files available in the raw dataset. + +Contained in `files.csv`, 4 columns: + +- `date`: the date a specific file is relevant for +- `filename`: the full filename as it exists in the `input/` directory +- `mtime`: the last modification time of the file on the system +- `filesize`: the size of the file, in bytes + +### Kernels + +Represents information about the kernel versions represented in the raw dataset. + +Contained in `kernels.csv`, 3 columns: + +- `date`: the date a specific file is relevant for +- `kernel`: the full kernel name that is available in the raw data, including major version, minor + version and suffix +- `downloads`: the amount of times the kernel has been seen on the observation date + +### Packages + +Represents information about the package versions represented in the raw dataset. + +Contained in `packages.csv`, 4 columns: + +- `date`: the date a specific file is relevant for +- `package`: the full package name as it is available in the raw data +- `version`: the full package version as it is available in the raw data +- `count`: the amount of times the package and version combination has been seen on the observation date + +### Unique installs + +Represents information about the unique system installations represented in the raw dataset. + +Contained in `packages.csv`, 2 columns: + +- `date`: the date a specific file is relevant for +- `unique`: the amount of unique installations counted on the observation date