Add meta-article

Contains thoughts about the process writing the article.
This commit is contained in:
Marty Oehme 2024-07-03 16:29:48 +02:00
parent 7d1d929b0e
commit 824ad25e60
Signed by: Marty
GPG key ID: EDBF2ED917B2EF6A

157
meta.md Normal file
View file

@ -0,0 +1,157 @@
This page documents some meta observations about my time recreating the nuclear explosions in this post,
mostly some little tips to work well with python polars and seaborn, or little tricks to integrate them and geopandas visualizations.
## From a lat/long polars dataframe to geopandas
To go from a polars frame to one we can use for GIS operations with geopandas is fairly simple:
We first move from a polars to an indexed pandas frame, in this case I have indexed on the date of each explosion.
We can use this intermediate dataframe to fill a geopandas frame which is built from the points of lat/long columns,
using the `gpd.points_from_xy()` function to create spatial `Point` objects from simple pandas Series.
Finally, we need to set a 'crs=' mapping for which this visualization simply uses the `EPSG:4326` global offsets (will generally be the same for global mappings).
```python
df_pd = df.with_columns().to_pandas().set_index("date")
gdf = gpd.GeoDataFrame(
df_pd,
crs="EPSG:4326",
geometry=gpd.points_from_xy(x=df_pd["longitude"], y=df_pd["latitude"]),
)
del df_pd
```
## Keeping the same seaborn color palette for the same categories
For the analysis, I have multiple plots which distinguish between the different countries undertaking nuclear detonations.
The country category thus appears repeatedly, and with static values (i.e. it will always contain 'US', 'USSR', 'China', 'France' and so on).
Now, seaborn has very nice functionality to automatically give different hues to categories like these in plots,
but how do we ensure that the hues given remain _the same_ throughout?
One way of achieving it would be to keep the order of categories the same throughout all plots.
However, this seems hidden,
often adds to the strain of just getting to the right data frame calculations,
appears a little too magic for my liking and, to top it off,
is even harder to achieve with some of polars' parallelized operations.
Instead we can explicitly map our categories to colors.
In my case, my categories for this example will always be the different countries:
```python
country_colors = {
"US": 'blue',
"USSR": 'red',
"France": 'pink'
"UK": 'black'
"China": 'purple'
"India": 'orange'
"Pakistan": 'green'
}
```
These are colors seaborn understands and can be given to a plot via the keyword option `palette=country_colors` which will pass along the colors above to the respective plot.
However, one advantage of seaborn is its nice in-built color schemes (i.e. palettes) which we will not make use of if we instead hard-code our color preferences like this.
Instead, we can directly access seaborn's color palette with `sns.color_palette()` which we can then use to explicitly map our categories to colors:
```python
cp=sns.color_palette()
country_colors = {
"US": cp[0],
"USSR": cp[3],
"France": cp[6],
"UK": cp[5],
"China": cp[4],
"India": cp[1],
"Pakistan": cp[2],
}
```
This mapping is passed exactly the same way as the other.
Now, we've ensured that colors in plots (that have the countries as hue category) will all have the same color for the same country throughout.
At the same time we have a single spot in which we can change the actual color theme seaborn uses, instead of hard-coding our preferences throughout.
This I find very useful when creating analyses with similar categories throughout,
In the nuclear analysis there is a folium geospatial (GeoJson) map at the very end which uses colors to distinguish between the countries once again.
Here we can make use of almost the same strategy, with the one caveat that folium expects the colors in hexadecimal format, while seaborn internally stores them as RGB value tuples.
What we can do, then is to use a simple translation function which converts from one format to the other on the fly,
and inject that into the map creation method of folium:
```python
def rgb_to_hex(rgb: tuple[float,float,float]) -> str:
return "#" + "".join([format(int(c*255), '02x') for c in rgb])
map = folium.Map(tiles="cartodb positron")
folium.GeoJson(
gdf,
name="Nuclear Explosions",
marker=folium.Circle(radius=3, fill_opacity=0.4),
style_function=lambda x: {
"color": rgb_to_hex(country_colors[x["properties"]["country"]]),
"radius": (
x["properties"]["magnitude_body"]
if x["properties"]["magnitude_body"] > 0
else 1.0
)
* 10,
},
).add_to(map)
```
## Using dictionary keys to create folium map layers
As a bonus we can even use our color category keys to create different layers on the folium map which can be turned on and off individually.
Thus we can decide which country's detonations we want to visualize.
Of course, we could also create these keys dynamically from the polars dataframe by extracting the `.unique()` elements of its "country" column (even though we use pandas geoframe for display),
but here I am using my explicit mapping instead.
The implementation works already with two additional lines and a loop,
by looping through our keys and adding a new layer for each one,
filtering out all the rows which do not exactly match the key using a pandas filter.
```python
m = folium.Map(tiles="cartodb positron")
for country in country_colors.keys():
fg = folium.FeatureGroup(name=country, show=True).add_to(m)
folium.GeoJson(
gdf[gdf["country"] == country],
name="Nuclear Explosions",
marker=folium.Circle(radius=3, fill_opacity=0.4),
style_function=lambda x: {
"color": rgb_to_hex(country_colors[x["properties"]["country"]]),
"radius": (
x["properties"]["magnitude_body"]
if x["properties"]["magnitude_body"] > 0
else 1.0
)
* 10,
},
).add_to(fg)
folium.LayerControl().add_to(m)
```
## Remaining issues
While working with polars is wonderful and seaborn takes a lot of the stress of creating half-way nicely formatted plots out of mind while first creating them,
some pain points remain.
While I am cautiously optimistic, seaborn's 'objects-style' interface still remains woefully undercooked.
It is already possible to create some basic plots with it and its declarative style is wonderful
(as in, it really matches the mental model I have of drawing individual plot elements into a coherent whole).
But for anything more complex --- which in my opinion is exactly where this interface will really shine ---
it remains out of reach because of missing methods and implementations.
This is, of course, ideally just a temporary issue until the implementation gets better,
but until then we are still stuck with the more strange mish-mash of seaborn simplicity with matplotlib exactness,
and having to know when to leave the former behind and delve into the arcane API of the latter.
Additionally, when combined with quarto for publishing some more pain points appear.
One that has been true for the longest time, and will likely remain so for the foreseeable future,
is that tables beyond a certain complexity are just _painful_ in quarto multi-output publishing.
This project made use the fantastic python library [great tables]() which indeed lives up to its name and produces absolutely great tables with very little effort.
However, it primarily targets the html format.
Getting this format into shape for quarto to then translate it into the pandoc AST and ultimately whatever format is not pretty.
For example LaTeX routinely just crashes instead of rendering the table correctly into a PDF file.