initial commit

2024-06-22 23:11:49 +02:00 · 2024-06-22 23:11:49 +02:00 · cceb1d1ec0
commit cceb1d1ec0
6 changed files with 5987 additions and 0 deletions
--- a/data/nuclear_explosions.csv
+++ b/data/nuclear_explosions.csv
--- a/notebooks/groupby_keep_zero-values.qmd
+++ b/notebooks/groupby_keep_zero-values.qmd
@ -0,0 +1,54 @@
+<!-- TODO: Load missing data from main nuclear_explosions.qmd notebook -->
+
+The following is a simple groupby, counting the len of country rows per date:
+
+```{python}
+# | label: fig-percountry-drop
+# | fig-cap: "Nuclear explosions by country, 1945-98"
+per_country = df.group_by(pl.col("date", "country")).agg(pl.len()).sort("date")
+
+g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
+g.set_xlabel("Year")
+g.set_ylabel("Count")
+plt.setp(
+    g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
+)  # ensure rotated right-anchor
+plt.show()
+```
+
+This works well to group generally, but there is an issue:
+If there is a year where a country did not have any entries at all,
+the resulting df will not have `Date | Cty | 0` but instead will not have an entry at all.
+
+This can be desirable for some applications, but for example if we then
+draw a line plot based on this it would interpolate between the
+country values and **not drop the line down to 0 for the years where a country does not have an entry**.
+
+We can fix it by first doing a cross product of all keys we always want to have a row for.
+Then we do the group by but supply it to a left-join on this cross product.
+
+End result is we keep all the rows from the cross-product, but we still aggregate and have a len
+column as before. For those where we don't have a len value we finally just fill in a 0 instead.
+
+```{python}
+# | label: fig-percountry-keep
+# | fig-cap: "Nuclear explosions by country, 1945-98"
+keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
+per_country = keys.join(
+    df.group_by(["date", "country"], maintain_order=True).len(),
+    on=["date", "country"],
+    how="left",
+    coalesce=True,
+).with_columns(pl.col("len").fill_null(0))
+
+g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
+g.set_xlabel("Year")
+g.set_ylabel("Count")
+plt.setp(
+    g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
+)  # ensure rotated right-anchor
+plt.show()
+```
+
+A more nicely function-based solution (though using the same solution under the hood) can be found
+here: https://github.com/pola-rs/polars/issues/15997#issuecomment-2089362557
--- a/notebooks/sns_objects-style.qmd
+++ b/notebooks/sns_objects-style.qmd
@ -0,0 +1,75 @@
+
+ constructed with seaborn object-style plots instead.
+These kind of plots are much more structured for the workflow I use and the way I think about plotting,
+clearly delineating between a plot;
+some visual on the plot;
+some statistical transformation;
+some movement, labeling or scaling operation.
+They are also, however, fairly new and still considered experimental.
+
+They also don't allow *quite* the customization that the other plots do,
+and seem either a little buggy or I have not fully understood them yet in regards to ticks and labels.
+
+```{python}
+# | label: fig-groundlevel-so
+# | fig-cap: "Nuclear explosions, 1945-98"
+import seaborn.objects as so
+import matplotlib.dates as mdates
+
+above_cat = pl.Series(
+    [
+        "ATMOSPH",
+        "AIRDROP",
+        "TOWER",
+        "BALLOON",
+        "SURFACE",
+        "BARGE",
+        "ROCKET",
+        "SPACE",
+        "SHIP",
+        "WATERSUR",
+        "WATER SU",
+    ]
+)
+df_groundlevel = (
+    df.with_columns(
+    above_ground=pl.col("type").map_elements(
+        lambda x: True if x in above_cat else False, return_dtype=bool
+    ))
+    .group_by(pl.col("year", "country", "above_ground"))
+    .agg(count=pl.len())
+    .sort("year")
+)
+
+fig, ax = plt.subplots()
+ax.xaxis.set_tick_params(rotation=90)
+
+from seaborn import axes_style
+p = (
+    so.Plot(df_groundlevel, x="year", y="count", color="country")
+    .add(
+        so.Bars(),
+        so.Stack(),
+        data=df_groundlevel.filter(pl.col("above_ground") == True).sort("country"),
+    )
+    .add(
+        so.Bars(),
+        so.Stack(),
+        data=df_groundlevel.filter(pl.col("above_ground") == False).with_columns(
+            count=pl.col("count") * -1
+        ).sort("country"),
+    )
+    .label(x="Year", y="Count")
+    .scale(
+        x=so.Continuous().tick(locator=mdates.YearLocator(base=5), minor=4).label(like="{x:.0f}"),
+        # x=so.Nominal().tick(locator=mdates.YearLocator(base=5), minor=4), # this might work in the future
+    )
+    .theme({
+        **axes_style("darkgrid"),
+        "xtick.bottom": True,
+        "ytick.left": True
+    })
+    .on(ax)
+    .plot()
+)
+```
--- a/nuclear_explosions.qmd
+++ b/nuclear_explosions.qmd
@ -0,0 +1,382 @@
+---
+title: Nuclear Explosions
+author: Marty Oehme
+output-dir: out
+references:
+    - type: techreport
+      id: Bergkvist2000
+      author:
+        - family: Bergkvist
+          given: Nils-Olov
+        - family: Ferm
+          given: Ragnhild
+      issued:
+        date-parts:
+          - - 2000
+            - 7
+            - 1
+      title: "Nuclear Explosions 1945 - 1998"
+      page: 1-42
+      issn: 1104-9154
+format:
+    html:
+        toc: true
+        code-fold: true
+    typst:
+        toc: true
+        echo: false
+    docx:
+        toc: true
+        echo: false
+---
+
+```{python}
+#| label: setup
+#| echo: false
+import matplotlib.dates as mdates
+import numpy as np
+import pandas as pd
+import polars as pl
+import polars.selectors as cs
+import seaborn as sns
+from matplotlib import pyplot as plt
+
+sns.set_theme(style="darkgrid")
+sns.set_context("notebook")
+
+schema_overrides = (
+    {
+        col: pl.Categorical
+        for col in ["type", "name", "purpose", "country", "source", "region"]
+    }
+    | {col: pl.Float64 for col in ["magnitude_body", "magnitude_surface"]}
+    | {col: pl.String for col in ["year", "name"]}
+)
+
+df = (
+    pl.read_csv(
+        "data/nuclear_explosions.csv",
+        schema_overrides=schema_overrides,
+        null_values=["NA"],
+    )
+    .with_columns(date=pl.col("year").str.strptime(pl.Date, "%Y"))
+    .with_columns(year=pl.col("date").dt.year().cast(pl.Int32))
+)
+```
+
+## Introduction
+
+The following is a re-creation and expansion of some of the graphs found in the
+@Bergkvist2000 produced report on nuclear explosions between 1945 and 1998. It
+is primarily a reproduction of key plots from the original report.
+Additionally, it serves as a exercise in plotting with the python library
+seaborn and the underlying matplotlib. Lastly, it approaches some less well
+tread territory for data science in the python universe as it uses the python
+library polars-rs for data loading and transformation. All the code used to
+transform the data and create the plots is available directly within the full
+text document, and separately as well. PDF and Docx formats are available with
+the plotting results only.
+
+Their original purpose was the collection of a long list of all the nuclear
+explosions occurring between those years, as well as analysing the responsible
+nations, tracking the types and purposes of the explosions, as well as
+connecting the rise and fall of nuclear explosion numbers to historical events
+throughout.
+
+## Total numbers
+
+::: {.callout-note}
+
+## Nuclear devices
+
+There are two main kinds of nuclear device: those based entirely, on fission,
+or the splitting of heavy atomic nucleii (previously known as atomic devices)
+and those in which the main energy is obtained by means of fusion, or of -light
+atomic nucleii (hydrogen or thermonuclear devices). A fusion explosion must
+however be initiated with the help of a fission device. The strength of a
+fusion explosion can be practically unlimited. The explosive power of a
+nuclear explosion is expressed in ktlotons, (kt) or megatons (Mt), which
+correspond to 1000 and i million'tonnes, of conventional explosive (TNT),
+respectively.
+
+[@Bergkvist2000, 6]
+:::
+
+We begin by investigating a table containing all the absolute counts and yields
+each country had explode, seen in @tbl-yields.
+
+```{python}
+# | label: tbl-yields
+# | tbl-cap: "Total number and yields of explosions"
+
+from great_tables import GT, md
+
+df_yields = (
+    df.select(["country", "id_no", "yield_lower", "yield_upper"])
+    .with_columns(yield_avg=pl.mean_horizontal(pl.col(["yield_lower", "yield_upper"])))
+    .group_by("country")
+    .agg(
+        pl.col("id_no").len().alias("count"),
+        pl.col("yield_avg").sum(),
+    )
+    # .with_columns(country=pl.col("country").cast(pl.String).str.to_titlecase())
+    .sort("count", descending=True)
+)
+
+(
+    GT(df_yields)
+    .tab_source_note(
+        source_note="Source: Author's elaboration based on Bergkvist and Ferm (2000)."
+    )
+    .tab_spanner(label="Totals", columns=["count", "yield_avg"])
+    .tab_stub(rowname_col="country")
+    .tab_stubhead(label="Country")
+    .cols_label(
+        count="Count",
+        yield_avg="Yield in kt",
+    )
+    .fmt_integer(columns="count")
+    .fmt_number(columns="yield_avg", decimals=1)
+)
+```
+
+## Numbers over time
+
+When investigating the nuclear explosions in the world, let us first start by
+looking at how many explosions occurred each year in total. This hides the
+specific details of who was responsible and which types were involved but
+instead paints a much stronger picture of the overall dimension of nuclear
+testing, as can be seen in @fig-total.
+
+```{python}
+# | label: fig-total
+# | fig-cap: "Total Nuclear explosions, 1945-98"
+per_year = df.group_by(pl.col("year")).agg(pl.len()).sort("year")
+with sns.axes_style(
+    "darkgrid", {"xtick.bottom": True, "ytick.left": True}
+):
+    g = sns.barplot(data=per_year, x="year", y="len", order=range(1945, 1999), width=1)
+    g.set_xlabel("Year")
+    g.set_ylabel("Count")
+    plt.setp(
+        g.get_xticklabels(),
+        rotation=90,
+        ha="right",
+        va="center",
+        rotation_mode="anchor",
+    )  # ensure rotated right-anchor
+    g.set_xticks(g.get_xticks(), minor=True) # enable minor ticks every entry
+    g.set_xticks(g.get_xticks()[::2])  # enable major ticks every 2nd entry
+    plt.show()
+del per_year
+```
+
+As we can see, the numbers of explosions rise increasingly towards 1957 and
+sharply until 1958, before dropping off for a year in 1959. The reasons for
+this drop are not entirely clear, but it is very likely that the data are
+simply missing for these years.
+<!-- FIXME: The reasons for this are a non-proliferation pact, in article -->
+
+<!-- TODO: Extract exact numbers from data on-the-fly -->
+There is another, very steep, rise in 1962 with over 175 recorded explosions,
+before an even sharper drop-off the following year down to just 50 explosions.
+
+Afterwards the changes appear less sharp and the changes remain between 77 and
+24 explosions per year, with a slight downward tendency.
+
+While these numbers show the overall proliferation of nuclear power, let us now
+instead turn to the contributions by individual countries. A split in the number
+of explosions over time by country can be seen in @fig-percountry.
+
+```{python}
+# | label: fig-percountry
+# | fig-cap: "Nuclear explosions by country, 1945-98"
+keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
+per_country = keys.join(
+    df.group_by(["date", "country"], maintain_order=True).len(),
+    on=["date", "country"],
+    how="left",
+    coalesce=True,
+).with_columns(pl.col("len").fill_null(0))
+
+g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
+g.set_xlabel("Year")
+g.set_ylabel("Count")
+plt.setp(
+    g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
+)  # ensure rotated right-anchor
+plt.show()
+del per_country
+```
+
+Once again we can see the visibly steep ramp-up to 1962, though it becomes
+clear that this was driven both by the USSR and the US. Of course the graph
+also makes visible the sheer unmatched number of explosions emenating from both
+of the countries, with only France catching up to the US numbers and China
+ultimately overtaking them in the 1990s.
+
+However, here it also becomes more clear how the UK was responsible for some
+early explosions in the late 1950s and early 1960s already, as well as the rise
+in France's nuclear testing from the early 1960s onwards to around 1980, before
+slowly decreasing in intensity afterwards.
+
+Let us turn to a cross-cut through the explosions in @fig-groundlevel, focusing
+on the number of explosions that have occurred underground and above-ground
+respectively.[^aboveground]
+
+[^aboveground]: Detonations counted as above ground are made up of atmospheric,
+airdrop, tower, balloon, barge or ship, rocket and water surface detonations.
+Any other detonation is counted as below ground, primarily taking place in
+tunnels, shafts and galleries.
+
+```{python}
+# | label: fig-groundlevel
+# | fig-cap: "Nuclear explosions above and below ground, 1945-98"
+from polars import Boolean
+
+above_cat = pl.Series(
+    [
+        "ATMOSPH",
+        "AIRDROP",
+        "TOWER",
+        "BALLOON",
+        "SURFACE",
+        "BARGE",
+        "ROCKET",
+        "SPACE",
+        "SHIP",
+        "WATERSUR",
+        "WATER SU",
+    ]
+)
+df_groundlevel = (
+    df.with_columns(
+        above_ground=pl.col("type").map_elements(
+            lambda x: True if x in above_cat else False, return_dtype=Boolean
+        )
+    )
+    .group_by(pl.col("date", "country", "above_ground"))
+    .agg(count=pl.len())
+    .sort("date")
+)
+
+with sns.axes_style("darkgrid", {"xtick.bottom": True, "ytick.left": True}):
+    for above_ground in [True, False]:
+        g = sns.histplot(
+            data=df_groundlevel.filter(
+                pl.col("above_ground") == above_ground
+            ).with_columns(
+                count=pl.col("count") * (1 if above_ground else -1),
+            ),
+            x="date",
+            weights="count",
+            hue="country",
+            multiple="stack",
+            binwidth=365,
+        )
+
+    g.xaxis.set_major_locator(mdates.YearLocator(base=5))
+    g.xaxis.set_minor_locator(mdates.YearLocator())
+    plt.setp(
+        g.get_xticklabels(), rotation=90, ha="right", va="top", rotation_mode="anchor"
+    )
+    # FIXME get dynamic range for yticks instead of hardcoding
+    g.set_yticks(np.arange(-130, 140, 20))
+    g.set_yticks(np.arange(-130, 140, 10), minor=True)
+    plt.show()
+del df_groundlevel
+```
+
+This plot paints a different picture yet again: while overall the number of
+explosions still rise and fall with some early sharp spikes, we can see a clear
+shift from above-ground to underground tests, starting with the year 1962.
+
+## Locations
+
+Finally, let's view a map of the world with the explosions marked.
+
+```{python}
+# | label: fig-worldmap
+# | fig-cap: "World map of nuclear explosions, 1945-98"
+import folium
+import geopandas as gpd
+from shapely.geometry import Point
+
+def set_style() -> pl.Expr:
+    return (
+        pl.when(pl.col("country") == "USSR")
+        .then(pl.lit({"color": "red"}, allow_object=True))
+        .otherwise(pl.lit({"color": "blue"}, allow_object=True))
+    )
+
+
+geom = [Point(xy) for xy in zip(df["longitude"], df["latitude"])]
+# df_pd = df.with_columns(style=set_style()).to_pandas().set_index("date")
+df_pd = df.with_columns().to_pandas().set_index("date")
+gdf = gpd.GeoDataFrame(
+    df_pd,
+    crs="EPSG:4326",
+    geometry=gpd.points_from_xy(x=df_pd["longitude"], y=df_pd["latitude"]),
+)
+del df_pd
+
+country_colors = {
+    "USA": "darkblue",
+    "USSR": "darkred",
+    "FRANCE": "pink",
+    "UK": "black",
+    "CHINA": "purple",
+    "INDIA": "orange",
+    "PAKIST": "green",
+}
+
+m = folium.Map(tiles="cartodb positron")
+for country in country_colors.keys():
+    fg = folium.FeatureGroup(name=country, show=True).add_to(m)
+    folium.GeoJson(
+        gdf[gdf["country"].str.contains(country)],
+        name="Nuclear Explosions",
+        marker=folium.Circle(radius=3, fill_opacity=0.4),
+        style_function=lambda x: {
+            "color": country_colors[x["properties"]["country"]],
+            "radius": (
+                x["properties"]["magnitude_body"]
+                if x["properties"]["magnitude_body"] > 0
+                else 1.0
+            )
+            * 10,
+        },
+        tooltip=folium.GeoJsonTooltip(fields=["year", "country", "type"]),
+        highlight_function=lambda x: {"fillOpacity": 0.8},
+        popup=folium.GeoJsonPopup(
+            fields=[
+                "year",
+                "country",
+                "region",
+                "source",
+                "latitude",
+                "longitude",
+                "magnitude_body",
+                "magnitude_surface",
+                "depth",
+                "yield_lower",
+                "yield_upper",
+                "purpose",
+                "name",
+                "type",
+            ]
+        ),
+    ).add_to(fg)
+folium.LayerControl().add_to(m)
+m
+```
+
+That is all for now.
+There are undoubtedly more explorations to undertake,
+but this is it for the time being.
+
+<!-- Ideas TODO:
+- do not just use 'count' of explosions but yields
+    - compare number to yields for ctrys
+- count up total number per country in table
+-->
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,32 @@
+[tool.poetry]
+name = "nuclear-explosions"
+version = "0.1.0"
+description = ""
+authors = ["Marty Oehme <marty.oehme@gmail.com>"]
+readme = "README.md"
+package-mode = false
+
+[tool.poetry.dependencies]
+python = "^3.12"
+polars = "^0.20.31"
+seaborn = "^0.13.2"
+pipefunc = "^0.18.1"
+pyarrow = "^16.1.0"
+great-tables = "^0.9.0"
+geopandas = "^0.14.4"
+folium = "^0.17.0"
+
+
+[tool.poetry.group.dev.dependencies]
+pynvim = "^0.5.0"
+pyperclip = "^1.8.2"
+jupyter-client = "^8.6.2"
+jupyter = "^1.0.0"
+
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.pyright]
+typeCheckingMode = "basic"
+reportUnusedExpression = false