initial commit

2024-06-22 23:11:49 +02:00 · 2024-06-22 23:11:49 +02:00 · cceb1d1ec0
commit cceb1d1ec0
6 changed files with 5987 additions and 0 deletions
--- a/notebooks/groupby_keep_zero-values.qmd
+++ b/notebooks/groupby_keep_zero-values.qmd
@ -0,0 +1,54 @@
+<!-- TODO: Load missing data from main nuclear_explosions.qmd notebook -->
+
+The following is a simple groupby, counting the len of country rows per date:
+
+```{python}
+# | label: fig-percountry-drop
+# | fig-cap: "Nuclear explosions by country, 1945-98"
+per_country = df.group_by(pl.col("date", "country")).agg(pl.len()).sort("date")
+
+g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
+g.set_xlabel("Year")
+g.set_ylabel("Count")
+plt.setp(
+    g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
+)  # ensure rotated right-anchor
+plt.show()
+```
+
+This works well to group generally, but there is an issue:
+If there is a year where a country did not have any entries at all,
+the resulting df will not have `Date | Cty | 0` but instead will not have an entry at all.
+
+This can be desirable for some applications, but for example if we then
+draw a line plot based on this it would interpolate between the
+country values and **not drop the line down to 0 for the years where a country does not have an entry**.
+
+We can fix it by first doing a cross product of all keys we always want to have a row for.
+Then we do the group by but supply it to a left-join on this cross product.
+
+End result is we keep all the rows from the cross-product, but we still aggregate and have a len
+column as before. For those where we don't have a len value we finally just fill in a 0 instead.
+
+```{python}
+# | label: fig-percountry-keep
+# | fig-cap: "Nuclear explosions by country, 1945-98"
+keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
+per_country = keys.join(
+    df.group_by(["date", "country"], maintain_order=True).len(),
+    on=["date", "country"],
+    how="left",
+    coalesce=True,
+).with_columns(pl.col("len").fill_null(0))
+
+g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
+g.set_xlabel("Year")
+g.set_ylabel("Count")
+plt.setp(
+    g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
+)  # ensure rotated right-anchor
+plt.show()
+```
+
+A more nicely function-based solution (though using the same solution under the hood) can be found
+here: https://github.com/pola-rs/polars/issues/15997#issuecomment-2089362557
--- a/notebooks/sns_objects-style.qmd
+++ b/notebooks/sns_objects-style.qmd
@ -0,0 +1,75 @@
+
+ constructed with seaborn object-style plots instead.
+These kind of plots are much more structured for the workflow I use and the way I think about plotting,
+clearly delineating between a plot;
+some visual on the plot;
+some statistical transformation;
+some movement, labeling or scaling operation.
+They are also, however, fairly new and still considered experimental.
+
+They also don't allow *quite* the customization that the other plots do,
+and seem either a little buggy or I have not fully understood them yet in regards to ticks and labels.
+
+```{python}
+# | label: fig-groundlevel-so
+# | fig-cap: "Nuclear explosions, 1945-98"
+import seaborn.objects as so
+import matplotlib.dates as mdates
+
+above_cat = pl.Series(
+    [
+        "ATMOSPH",
+        "AIRDROP",
+        "TOWER",
+        "BALLOON",
+        "SURFACE",
+        "BARGE",
+        "ROCKET",
+        "SPACE",
+        "SHIP",
+        "WATERSUR",
+        "WATER SU",
+    ]
+)
+df_groundlevel = (
+    df.with_columns(
+    above_ground=pl.col("type").map_elements(
+        lambda x: True if x in above_cat else False, return_dtype=bool
+    ))
+    .group_by(pl.col("year", "country", "above_ground"))
+    .agg(count=pl.len())
+    .sort("year")
+)
+
+fig, ax = plt.subplots()
+ax.xaxis.set_tick_params(rotation=90)
+
+from seaborn import axes_style
+p = (
+    so.Plot(df_groundlevel, x="year", y="count", color="country")
+    .add(
+        so.Bars(),
+        so.Stack(),
+        data=df_groundlevel.filter(pl.col("above_ground") == True).sort("country"),
+    )
+    .add(
+        so.Bars(),
+        so.Stack(),
+        data=df_groundlevel.filter(pl.col("above_ground") == False).with_columns(
+            count=pl.col("count") * -1
+        ).sort("country"),
+    )
+    .label(x="Year", y="Count")
+    .scale(
+        x=so.Continuous().tick(locator=mdates.YearLocator(base=5), minor=4).label(like="{x:.0f}"),
+        # x=so.Nominal().tick(locator=mdates.YearLocator(base=5), minor=4), # this might work in the future
+    )
+    .theme({
+        **axes_style("darkgrid"),
+        "xtick.bottom": True,
+        "ytick.left": True
+    })
+    .on(ax)
+    .plot()
+)
+```