nuclear_explosions/notebooks/groupby_keep_zero-values.qmd
2024-06-22 23:11:49 +02:00

54 lines
2.1 KiB
Text

<!-- TODO: Load missing data from main nuclear_explosions.qmd notebook -->
The following is a simple groupby, counting the len of country rows per date:
```{python}
# | label: fig-percountry-drop
# | fig-cap: "Nuclear explosions by country, 1945-98"
per_country = df.group_by(pl.col("date", "country")).agg(pl.len()).sort("date")
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
) # ensure rotated right-anchor
plt.show()
```
This works well to group generally, but there is an issue:
If there is a year where a country did not have any entries at all,
the resulting df will not have `Date | Cty | 0` but instead will not have an entry at all.
This can be desirable for some applications, but for example if we then
draw a line plot based on this it would interpolate between the
country values and **not drop the line down to 0 for the years where a country does not have an entry**.
We can fix it by first doing a cross product of all keys we always want to have a row for.
Then we do the group by but supply it to a left-join on this cross product.
End result is we keep all the rows from the cross-product, but we still aggregate and have a len
column as before. For those where we don't have a len value we finally just fill in a 0 instead.
```{python}
# | label: fig-percountry-keep
# | fig-cap: "Nuclear explosions by country, 1945-98"
keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
per_country = keys.join(
df.group_by(["date", "country"], maintain_order=True).len(),
on=["date", "country"],
how="left",
coalesce=True,
).with_columns(pl.col("len").fill_null(0))
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
) # ensure rotated right-anchor
plt.show()
```
A more nicely function-based solution (though using the same solution under the hood) can be found
here: https://github.com/pola-rs/polars/issues/15997#issuecomment-2089362557