initial commit

This commit is contained in:
Marty Oehme 2024-06-22 23:11:49 +02:00
commit cceb1d1ec0
Signed by: Marty
GPG key ID: EDBF2ED917B2EF6A
6 changed files with 5987 additions and 0 deletions

2052
data/nuclear_explosions.csv Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,54 @@
<!-- TODO: Load missing data from main nuclear_explosions.qmd notebook -->
The following is a simple groupby, counting the len of country rows per date:
```{python}
# | label: fig-percountry-drop
# | fig-cap: "Nuclear explosions by country, 1945-98"
per_country = df.group_by(pl.col("date", "country")).agg(pl.len()).sort("date")
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
) # ensure rotated right-anchor
plt.show()
```
This works well to group generally, but there is an issue:
If there is a year where a country did not have any entries at all,
the resulting df will not have `Date | Cty | 0` but instead will not have an entry at all.
This can be desirable for some applications, but for example if we then
draw a line plot based on this it would interpolate between the
country values and **not drop the line down to 0 for the years where a country does not have an entry**.
We can fix it by first doing a cross product of all keys we always want to have a row for.
Then we do the group by but supply it to a left-join on this cross product.
End result is we keep all the rows from the cross-product, but we still aggregate and have a len
column as before. For those where we don't have a len value we finally just fill in a 0 instead.
```{python}
# | label: fig-percountry-keep
# | fig-cap: "Nuclear explosions by country, 1945-98"
keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
per_country = keys.join(
df.group_by(["date", "country"], maintain_order=True).len(),
on=["date", "country"],
how="left",
coalesce=True,
).with_columns(pl.col("len").fill_null(0))
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
) # ensure rotated right-anchor
plt.show()
```
A more nicely function-based solution (though using the same solution under the hood) can be found
here: https://github.com/pola-rs/polars/issues/15997#issuecomment-2089362557

View file

@ -0,0 +1,75 @@
constructed with seaborn object-style plots instead.
These kind of plots are much more structured for the workflow I use and the way I think about plotting,
clearly delineating between a plot;
some visual on the plot;
some statistical transformation;
some movement, labeling or scaling operation.
They are also, however, fairly new and still considered experimental.
They also don't allow *quite* the customization that the other plots do,
and seem either a little buggy or I have not fully understood them yet in regards to ticks and labels.
```{python}
# | label: fig-groundlevel-so
# | fig-cap: "Nuclear explosions, 1945-98"
import seaborn.objects as so
import matplotlib.dates as mdates
above_cat = pl.Series(
[
"ATMOSPH",
"AIRDROP",
"TOWER",
"BALLOON",
"SURFACE",
"BARGE",
"ROCKET",
"SPACE",
"SHIP",
"WATERSUR",
"WATER SU",
]
)
df_groundlevel = (
df.with_columns(
above_ground=pl.col("type").map_elements(
lambda x: True if x in above_cat else False, return_dtype=bool
))
.group_by(pl.col("year", "country", "above_ground"))
.agg(count=pl.len())
.sort("year")
)
fig, ax = plt.subplots()
ax.xaxis.set_tick_params(rotation=90)
from seaborn import axes_style
p = (
so.Plot(df_groundlevel, x="year", y="count", color="country")
.add(
so.Bars(),
so.Stack(),
data=df_groundlevel.filter(pl.col("above_ground") == True).sort("country"),
)
.add(
so.Bars(),
so.Stack(),
data=df_groundlevel.filter(pl.col("above_ground") == False).with_columns(
count=pl.col("count") * -1
).sort("country"),
)
.label(x="Year", y="Count")
.scale(
x=so.Continuous().tick(locator=mdates.YearLocator(base=5), minor=4).label(like="{x:.0f}"),
# x=so.Nominal().tick(locator=mdates.YearLocator(base=5), minor=4), # this might work in the future
)
.theme({
**axes_style("darkgrid"),
"xtick.bottom": True,
"ytick.left": True
})
.on(ax)
.plot()
)
```

382
nuclear_explosions.qmd Normal file
View file

@ -0,0 +1,382 @@
---
title: Nuclear Explosions
author: Marty Oehme
output-dir: out
references:
- type: techreport
id: Bergkvist2000
author:
- family: Bergkvist
given: Nils-Olov
- family: Ferm
given: Ragnhild
issued:
date-parts:
- - 2000
- 7
- 1
title: "Nuclear Explosions 1945 - 1998"
page: 1-42
issn: 1104-9154
format:
html:
toc: true
code-fold: true
typst:
toc: true
echo: false
docx:
toc: true
echo: false
---
```{python}
#| label: setup
#| echo: false
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import polars as pl
import polars.selectors as cs
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_theme(style="darkgrid")
sns.set_context("notebook")
schema_overrides = (
{
col: pl.Categorical
for col in ["type", "name", "purpose", "country", "source", "region"]
}
| {col: pl.Float64 for col in ["magnitude_body", "magnitude_surface"]}
| {col: pl.String for col in ["year", "name"]}
)
df = (
pl.read_csv(
"data/nuclear_explosions.csv",
schema_overrides=schema_overrides,
null_values=["NA"],
)
.with_columns(date=pl.col("year").str.strptime(pl.Date, "%Y"))
.with_columns(year=pl.col("date").dt.year().cast(pl.Int32))
)
```
## Introduction
The following is a re-creation and expansion of some of the graphs found in the
@Bergkvist2000 produced report on nuclear explosions between 1945 and 1998. It
is primarily a reproduction of key plots from the original report.
Additionally, it serves as a exercise in plotting with the python library
seaborn and the underlying matplotlib. Lastly, it approaches some less well
tread territory for data science in the python universe as it uses the python
library polars-rs for data loading and transformation. All the code used to
transform the data and create the plots is available directly within the full
text document, and separately as well. PDF and Docx formats are available with
the plotting results only.
Their original purpose was the collection of a long list of all the nuclear
explosions occurring between those years, as well as analysing the responsible
nations, tracking the types and purposes of the explosions, as well as
connecting the rise and fall of nuclear explosion numbers to historical events
throughout.
## Total numbers
::: {.callout-note}
## Nuclear devices
There are two main kinds of nuclear device: those based entirely, on fission,
or the splitting of heavy atomic nucleii (previously known as atomic devices)
and those in which the main energy is obtained by means of fusion, or of -light
atomic nucleii (hydrogen or thermonuclear devices). A fusion explosion must
however be initiated with the help of a fission device. The strength of a
fusion explosion can be practically unlimited. The explosive power of a
nuclear explosion is expressed in ktlotons, (kt) or megatons (Mt), which
correspond to 1000 and i million'tonnes, of conventional explosive (TNT),
respectively.
[@Bergkvist2000, 6]
:::
We begin by investigating a table containing all the absolute counts and yields
each country had explode, seen in @tbl-yields.
```{python}
# | label: tbl-yields
# | tbl-cap: "Total number and yields of explosions"
from great_tables import GT, md
df_yields = (
df.select(["country", "id_no", "yield_lower", "yield_upper"])
.with_columns(yield_avg=pl.mean_horizontal(pl.col(["yield_lower", "yield_upper"])))
.group_by("country")
.agg(
pl.col("id_no").len().alias("count"),
pl.col("yield_avg").sum(),
)
# .with_columns(country=pl.col("country").cast(pl.String).str.to_titlecase())
.sort("count", descending=True)
)
(
GT(df_yields)
.tab_source_note(
source_note="Source: Author's elaboration based on Bergkvist and Ferm (2000)."
)
.tab_spanner(label="Totals", columns=["count", "yield_avg"])
.tab_stub(rowname_col="country")
.tab_stubhead(label="Country")
.cols_label(
count="Count",
yield_avg="Yield in kt",
)
.fmt_integer(columns="count")
.fmt_number(columns="yield_avg", decimals=1)
)
```
## Numbers over time
When investigating the nuclear explosions in the world, let us first start by
looking at how many explosions occurred each year in total. This hides the
specific details of who was responsible and which types were involved but
instead paints a much stronger picture of the overall dimension of nuclear
testing, as can be seen in @fig-total.
```{python}
# | label: fig-total
# | fig-cap: "Total Nuclear explosions, 1945-98"
per_year = df.group_by(pl.col("year")).agg(pl.len()).sort("year")
with sns.axes_style(
"darkgrid", {"xtick.bottom": True, "ytick.left": True}
):
g = sns.barplot(data=per_year, x="year", y="len", order=range(1945, 1999), width=1)
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(),
rotation=90,
ha="right",
va="center",
rotation_mode="anchor",
) # ensure rotated right-anchor
g.set_xticks(g.get_xticks(), minor=True) # enable minor ticks every entry
g.set_xticks(g.get_xticks()[::2]) # enable major ticks every 2nd entry
plt.show()
del per_year
```
As we can see, the numbers of explosions rise increasingly towards 1957 and
sharply until 1958, before dropping off for a year in 1959. The reasons for
this drop are not entirely clear, but it is very likely that the data are
simply missing for these years.
<!-- FIXME: The reasons for this are a non-proliferation pact, in article -->
<!-- TODO: Extract exact numbers from data on-the-fly -->
There is another, very steep, rise in 1962 with over 175 recorded explosions,
before an even sharper drop-off the following year down to just 50 explosions.
Afterwards the changes appear less sharp and the changes remain between 77 and
24 explosions per year, with a slight downward tendency.
While these numbers show the overall proliferation of nuclear power, let us now
instead turn to the contributions by individual countries. A split in the number
of explosions over time by country can be seen in @fig-percountry.
```{python}
# | label: fig-percountry
# | fig-cap: "Nuclear explosions by country, 1945-98"
keys = df.select("date").unique().join(df.select("country").unique(), how="cross")
per_country = keys.join(
df.group_by(["date", "country"], maintain_order=True).len(),
on=["date", "country"],
how="left",
coalesce=True,
).with_columns(pl.col("len").fill_null(0))
g = sns.lineplot(data=per_country, x="date", y="len", hue="country")
g.set_xlabel("Year")
g.set_ylabel("Count")
plt.setp(
g.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor"
) # ensure rotated right-anchor
plt.show()
del per_country
```
Once again we can see the visibly steep ramp-up to 1962, though it becomes
clear that this was driven both by the USSR and the US. Of course the graph
also makes visible the sheer unmatched number of explosions emenating from both
of the countries, with only France catching up to the US numbers and China
ultimately overtaking them in the 1990s.
However, here it also becomes more clear how the UK was responsible for some
early explosions in the late 1950s and early 1960s already, as well as the rise
in France's nuclear testing from the early 1960s onwards to around 1980, before
slowly decreasing in intensity afterwards.
Let us turn to a cross-cut through the explosions in @fig-groundlevel, focusing
on the number of explosions that have occurred underground and above-ground
respectively.[^aboveground]
[^aboveground]: Detonations counted as above ground are made up of atmospheric,
airdrop, tower, balloon, barge or ship, rocket and water surface detonations.
Any other detonation is counted as below ground, primarily taking place in
tunnels, shafts and galleries.
```{python}
# | label: fig-groundlevel
# | fig-cap: "Nuclear explosions above and below ground, 1945-98"
from polars import Boolean
above_cat = pl.Series(
[
"ATMOSPH",
"AIRDROP",
"TOWER",
"BALLOON",
"SURFACE",
"BARGE",
"ROCKET",
"SPACE",
"SHIP",
"WATERSUR",
"WATER SU",
]
)
df_groundlevel = (
df.with_columns(
above_ground=pl.col("type").map_elements(
lambda x: True if x in above_cat else False, return_dtype=Boolean
)
)
.group_by(pl.col("date", "country", "above_ground"))
.agg(count=pl.len())
.sort("date")
)
with sns.axes_style("darkgrid", {"xtick.bottom": True, "ytick.left": True}):
for above_ground in [True, False]:
g = sns.histplot(
data=df_groundlevel.filter(
pl.col("above_ground") == above_ground
).with_columns(
count=pl.col("count") * (1 if above_ground else -1),
),
x="date",
weights="count",
hue="country",
multiple="stack",
binwidth=365,
)
g.xaxis.set_major_locator(mdates.YearLocator(base=5))
g.xaxis.set_minor_locator(mdates.YearLocator())
plt.setp(
g.get_xticklabels(), rotation=90, ha="right", va="top", rotation_mode="anchor"
)
# FIXME get dynamic range for yticks instead of hardcoding
g.set_yticks(np.arange(-130, 140, 20))
g.set_yticks(np.arange(-130, 140, 10), minor=True)
plt.show()
del df_groundlevel
```
This plot paints a different picture yet again: while overall the number of
explosions still rise and fall with some early sharp spikes, we can see a clear
shift from above-ground to underground tests, starting with the year 1962.
## Locations
Finally, let's view a map of the world with the explosions marked.
```{python}
# | label: fig-worldmap
# | fig-cap: "World map of nuclear explosions, 1945-98"
import folium
import geopandas as gpd
from shapely.geometry import Point
def set_style() -> pl.Expr:
return (
pl.when(pl.col("country") == "USSR")
.then(pl.lit({"color": "red"}, allow_object=True))
.otherwise(pl.lit({"color": "blue"}, allow_object=True))
)
geom = [Point(xy) for xy in zip(df["longitude"], df["latitude"])]
# df_pd = df.with_columns(style=set_style()).to_pandas().set_index("date")
df_pd = df.with_columns().to_pandas().set_index("date")
gdf = gpd.GeoDataFrame(
df_pd,
crs="EPSG:4326",
geometry=gpd.points_from_xy(x=df_pd["longitude"], y=df_pd["latitude"]),
)
del df_pd
country_colors = {
"USA": "darkblue",
"USSR": "darkred",
"FRANCE": "pink",
"UK": "black",
"CHINA": "purple",
"INDIA": "orange",
"PAKIST": "green",
}
m = folium.Map(tiles="cartodb positron")
for country in country_colors.keys():
fg = folium.FeatureGroup(name=country, show=True).add_to(m)
folium.GeoJson(
gdf[gdf["country"].str.contains(country)],
name="Nuclear Explosions",
marker=folium.Circle(radius=3, fill_opacity=0.4),
style_function=lambda x: {
"color": country_colors[x["properties"]["country"]],
"radius": (
x["properties"]["magnitude_body"]
if x["properties"]["magnitude_body"] > 0
else 1.0
)
* 10,
},
tooltip=folium.GeoJsonTooltip(fields=["year", "country", "type"]),
highlight_function=lambda x: {"fillOpacity": 0.8},
popup=folium.GeoJsonPopup(
fields=[
"year",
"country",
"region",
"source",
"latitude",
"longitude",
"magnitude_body",
"magnitude_surface",
"depth",
"yield_lower",
"yield_upper",
"purpose",
"name",
"type",
]
),
).add_to(fg)
folium.LayerControl().add_to(m)
m
```
That is all for now.
There are undoubtedly more explorations to undertake,
but this is it for the time being.
<!-- Ideas TODO:
- do not just use 'count' of explosions but yields
- compare number to yields for ctrys
- count up total number per country in table
-->

3392
poetry.lock generated Normal file

File diff suppressed because it is too large Load diff

32
pyproject.toml Normal file
View file

@ -0,0 +1,32 @@
[tool.poetry]
name = "nuclear-explosions"
version = "0.1.0"
description = ""
authors = ["Marty Oehme <marty.oehme@gmail.com>"]
readme = "README.md"
package-mode = false
[tool.poetry.dependencies]
python = "^3.12"
polars = "^0.20.31"
seaborn = "^0.13.2"
pipefunc = "^0.18.1"
pyarrow = "^16.1.0"
great-tables = "^0.9.0"
geopandas = "^0.14.4"
folium = "^0.17.0"
[tool.poetry.group.dev.dependencies]
pynvim = "^0.5.0"
pyperclip = "^1.8.2"
jupyter-client = "^8.6.2"
jupyter = "^1.0.0"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.pyright]
typeCheckingMode = "basic"
reportUnusedExpression = false