54 lines
2.1 KiB
Text
54 lines
2.1 KiB
Text
<!-- TODO: Load missing data from main nuclear_explosions.qmd notebook -->
|
|
|
|
The following is a simple groupby, counting the len of country rows per date:
|
|
|
|
```{python}
|
|
# | label: fig-percountry-drop
|
|
# | fig-cap: "Nuclear explosions by country, 1945-98"
|
|
per_country = df.group_by(pl.col("date", "country")).agg(pl.len()).sort("date")
|
|
|
|
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
|
|
g.set_xlabel("Year")
|
|
g.set_ylabel("Count")
|
|
plt.setp(
|
|
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
|
|
) # ensure rotated right-anchor
|
|
plt.show()
|
|
```
|
|
|
|
This works well to group generally, but there is an issue:
|
|
If there is a year where a country did not have any entries at all,
|
|
the resulting df will not have `Date | Cty | 0` but instead will not have an entry at all.
|
|
|
|
This can be desirable for some applications, but for example if we then
|
|
draw a line plot based on this it would interpolate between the
|
|
country values and **not drop the line down to 0 for the years where a country does not have an entry**.
|
|
|
|
We can fix it by first doing a cross product of all keys we always want to have a row for.
|
|
Then we do the group by but supply it to a left-join on this cross product.
|
|
|
|
End result is we keep all the rows from the cross-product, but we still aggregate and have a len
|
|
column as before. For those where we don't have a len value we finally just fill in a 0 instead.
|
|
|
|
```{python}
|
|
# | label: fig-percountry-keep
|
|
# | fig-cap: "Nuclear explosions by country, 1945-98"
|
|
keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
|
|
per_country = keys.join(
|
|
df.group_by(["date", "country"], maintain_order=True).len(),
|
|
on=["date", "country"],
|
|
how="left",
|
|
coalesce=True,
|
|
).with_columns(pl.col("len").fill_null(0))
|
|
|
|
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
|
|
g.set_xlabel("Year")
|
|
g.set_ylabel("Count")
|
|
plt.setp(
|
|
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
|
|
) # ensure rotated right-anchor
|
|
plt.show()
|
|
```
|
|
|
|
A more nicely function-based solution (though using the same solution under the hood) can be found
|
|
here: https://github.com/pola-rs/polars/issues/15997#issuecomment-2089362557
|