nuclear_explosions/meta.md

16 KiB

title subtitle description pubDate weight tags
Reproducible blog posts Moving from Quarto manuscript to Astro markdown post output Moving from Quarto manuscript to Astro markdown post output 2024-07-08T10:06:38 10
python
astro

This page documents some meta observations about my time recreating the nuclear explosions in the last post.

It goes over some little tips to work well with python polars and seaborn, as well as little tricks to integrate them and geopandas visualizations.

Finally, I make some observations on the actual process of transferring this produced output into a blog post on my website, written in the Astro static site builder framework.

Data modelling and visualization

From a lat/long polars dataframe to geopandas

To go from a polars frame to one we can use for GIS operations with geopandas is fairly simple: We first move from a polars to an indexed pandas frame, in this case I have indexed on the date of each explosion.

We can use this intermediate dataframe to fill a geopandas frame which is built from the points of lat/long columns, using the gpd.points_from_xy() function to create spatial Point objects from simple pandas Series. Finally, we need to set a 'crs=' mapping for which this visualization simply uses the EPSG:4326 global offsets (will generally be the same for global mappings).

df_pd = df.with_columns().to_pandas().set_index("date")
gdf = gpd.GeoDataFrame(
    df_pd,
    crs="EPSG:4326",
    geometry=gpd.points_from_xy(x=df_pd["longitude"], y=df_pd["latitude"]),
)
del df_pd

Keeping the same seaborn color palette for the same categories

For the analysis, I have multiple plots which distinguish between the different countries undertaking nuclear detonations. The country category thus appears repeatedly, and with static values (i.e. it will always contain 'US', 'USSR', 'China', 'France' and so on).

Now, seaborn has very nice functionality to automatically give different hues to categories like these in plots, but how do we ensure that the hues given remain the same throughout?

One way of achieving it would be to keep the order of categories the same throughout all plots. However, this seems hidden, often adds to the strain of just getting to the right data frame calculations, appears a little too magic for my liking and, to top it off, is even harder to achieve with some of polars' parallelized operations.

Instead we can explicitly map our categories to colors. In my case, my categories for this example will always be the different countries:

country_colors = {
    "US": 'blue',
    "USSR": 'red',
    "France": 'pink'
    "UK": 'black'
    "China": 'purple'
    "India": 'orange'
    "Pakistan": 'green'
}

These are colors seaborn understands and can be given to a plot via the keyword option palette=country_colors which will pass along the colors above to the respective plot.

However, one advantage of seaborn is its nice in-built color schemes (i.e. palettes) which we will not make use of if we instead hard-code our color preferences like this. Instead, we can directly access seaborn's color palette with sns.color_palette() which we can then use to explicitly map our categories to colors:

cp=sns.color_palette()
country_colors = {
    "US": cp[0],
    "USSR": cp[3],
    "France": cp[6],
    "UK": cp[5],
    "China": cp[4],
    "India": cp[1],
    "Pakistan": cp[2],
}

This mapping is passed exactly the same way as the other. Now, we've ensured that colors in plots (that have the countries as hue category) will all have the same color for the same country throughout. At the same time we have a single spot in which we can change the actual color theme seaborn uses, instead of hard-coding our preferences throughout.

This I find very useful when creating analyses with similar categories throughout,

In the nuclear analysis there is a folium geospatial (GeoJson) map at the very end which uses colors to distinguish between the countries once again. Here we can make use of almost the same strategy, with the one caveat that folium expects the colors in hexadecimal format, while seaborn internally stores them as RGB value tuples.

What we can do, then is to use a simple translation function which converts from one format to the other on the fly, and inject that into the map creation method of folium:

def rgb_to_hex(rgb: tuple[float,float,float]) -> str:
    return "#" + "".join([format(int(c*255), '02x') for c in rgb])


map = folium.Map(tiles="cartodb positron")
folium.GeoJson(
        gdf,
        name="Nuclear Explosions",
        marker=folium.Circle(radius=3, fill_opacity=0.4),
        style_function=lambda x: {
            "color": rgb_to_hex(country_colors[x["properties"]["country"]]),
            "radius": (
                x["properties"]["magnitude_body"]
                if x["properties"]["magnitude_body"] > 0
                else 1.0
            )
            * 10,
        },
    ).add_to(map)

Using dictionary keys to create folium map layers

As a bonus we can even use our color category keys to create different layers on the folium map which can be turned on and off individually. Thus we can decide which country's detonations we want to visualize.

Of course, we could also create these keys dynamically from the polars dataframe by extracting the .unique() elements of its "country" column (even though we use pandas geoframe for display), but here I am using my explicit mapping instead.

The implementation works already with two additional lines and a loop, by looping through our keys and adding a new layer for each one, filtering out all the rows which do not exactly match the key using a pandas filter.

m = folium.Map(tiles="cartodb positron")
for country in country_colors.keys():
    fg = folium.FeatureGroup(name=country, show=True).add_to(m)
    folium.GeoJson(
        gdf[gdf["country"] == country],
        name="Nuclear Explosions",
        marker=folium.Circle(radius=3, fill_opacity=0.4),
        style_function=lambda x: {
            "color": rgb_to_hex(country_colors[x["properties"]["country"]]),
            "radius": (
                x["properties"]["magnitude_body"]
                if x["properties"]["magnitude_body"] > 0
                else 1.0
            )
            * 10,
        },
    ).add_to(fg)
folium.LayerControl().add_to(m)

Output wrangling

Multiple project profiles

During development and analysis I have only had a single project which then in turn targeted two formats: html for previews and dynamic elements and pdf (in truth the new typst) for checking static elements. The following _quarto.yml file describes a full working project:

author: Marty Oehme
csl: https://www.zotero.org/styles/apa

project:
  type: default
  output-dir: output
  render:
    - index.qmd
    - meta.md

format:
  html:
    code-fold: true
    toc: true
    echo: true
  typst:
      toc: true
      echo: false
      citeproc: true
  docx:
      toc: true
      echo: false

This works well for single-'target' deployments which may arrive in multiple formats but are fundamentally the same. What happens, however, if we target something completely different (in my case this Astro blog) which may not even reside in the same directory?

We can create what quarto calls 'project profiles', simply by creating additional _quart-mypofile.yml files in the project root. They will Grab all the yaml data from the original _quarto.yml file and then add and overwrite it with their own file's data to create the overall profile.

So if we have the following two files:

# _quarto.yml
author: Marty Oehme
csl: https://www.zotero.org/styles/apa
# _quarto-local.yml
project:
  type: default
  output-dir: output
  render:
    - index.qmd
    - meta.md

format:
  html:
    code-fold: true
    toc: true
    echo: true
  typst:
      toc: true
      echo: false
      citeproc: true
  docx:
      toc: true
      echo: false

We have essentially recreated the above project, only as a project 'profile' to be invoked as quarto render --profile local.

Now, however, we can add a second _quarto-remote.yml profile:

# _quarto-remote.yml
project:
  type: default
  output-dir: some/remote/directory/maybe/even/over/nfs/or/sshfs
  render:
    - index.qmd

format:
  hugo-md:
    preserve-yaml: true
    code-fold: true
    keep-ipynb: true
    wrap: none
  typst:
    toc: true
    echo: false
    citeproc: true

If we invoke this profile with quarto render --profile remote we output to a different directory altogether, and have completely different render targets than in the local profile (in this case the same typst format and the new hugo-md format, while not rendering to docx at all).

This way we can separate different deployments beyond just carrying different formats by actually extending and overwriting all kinds of project options.1

If we don't invoke the profile we don't have explicit render or format targets and do not set an output dir. However, we also have a way to set a 'default' project profile (for which we don't have to enter the option each time).

We can do so by slightly extending the base _quarto.yml file.

# _quarto.yml
author: Marty Oehme
csl: https://www.zotero.org/styles/apa

profile:
  group:
    - [local, remote]

The two profiles are now in a 'profile group', of which only one can ever be active and of which the first one in the list will automatically be applied when invoking quarto render without any additional options.

This is how I have been doing it for the nuclear analysis: have a local (in my case I simply called it 'default') profile which renders the current project to a local working directory using most of the usual quarto output, such as html preview, and static outputs to double-check how everything is displayed.

Then, I added another profile on top which I called 'blog' and which outputs its renders directly into the correct post directory of my blog. There are, however, some remaining issues, detailed below.

Static content in an Astro blog page

One issue arises in that quarto has its own way of stowing external fragments (like the PNGs of visualizations) and this often does not automatically work with static site generators which expect static files like this to reside in the 'static' (or 'public') directory instead of next to the manuscript.

I have overcome this issue with a 'post-script' which runs after the main quarto processing is done, by adding to the relevant _quarto.yml:

project:
  type: default
  output-dir: /path/to/my/blog/post/2024-07-02-directory
  render:
    - index.qmd
  post-render:
    - tools/move-static-to-blog.py /path/to/my/blog/static/dir

This way, we can first create all the necessary outputs in the normal quarto output directory and afterwards have a script which takes the resulting static files and instead moves it to the correct place in the blog's public directory.

I am not a huge fan of the amount of hard-coding this approach requires but it does seem like the easiest way to just be able to hit render and have working results.

The following is one example of how to use python to move the required files to a specific static directory:

#!/usr/bin/env python3
import os
import shutil
import sys
from pathlib import Path

# Safeguards to only move when necessary
if not os.getenv("QUARTO_PROJECT_RENDER_ALL"):
    sys.exit(0)
q_output_dir = os.getenv("QUARTO_PROJECT_OUTPUT_DIR")
if not q_output_dir:
    print(f"ERROR: Output dir {q_output_dir} given by Quarto *does not exist*.")
    sys.exit(1)

args = sys.argv

files: list[Path] = []
# Get the correct dest and files from args
if len(args) < 2:
    print("Static output file directory for blog post-render is required.")
    sys.exit(1)
else:
    dest = Path(args[1])
    if len(args) > 2:
        for f in args[2:]:
            files.append(Path(f))

# Move safeguards
if not dest.is_dir():
    print(f"ERROR: Static output directory {dest} *does not exist*.")
    sys.exit(1)

if not files:
    for dirname in os.listdir(dest):
        if dirname.endswith("_files"):
            dirpath = dest.joinpath(dirname)
            for root, loc_dirs, loc_files in dirpath.walk():
                for file in loc_files:
                    files.append(root.joinpath(file))

for f in files:
    shutil.copy(f, dest)

It simply requires the target directory as the first argument and uses the QUARTO_PROJECT_OUTPUT_DIR env var (which Quarto supplies to any post-render script) as the source. Then it copies either all files that have been mentioned as additional arguments (safer) or all files that it finds in directories ending in '_files' (more dangerous).

Now all additional files reside in the root of the static file dir. If you instead want to 'rebuild' the same structure in the static dir as in the source dir for your assets, you will have to adjust the script to move between the root-relative file paths in the two folders (and autoamtically generate new directories if necessary).

This should take care of placing static files in the right places.

Dynamic content in an Astro blog page

Getting the folium/leaflet map to work in a static site generator like Astro was a bit of a pain. Essentially, the concept of getting it to work is the same as for static content above:

We save the folium-produced html output as a static file and place that in the static file directory of the blog. Then we integrate it into the page with an iframe html element.

However, some issues arise in producing the static html file in the first place.

Remaining issues

While working with polars is wonderful and seaborn takes a lot of the stress of creating half-way nicely formatted plots out of mind while first creating them, some pain points remain.

While I am cautiously optimistic, seaborn's 'objects-style' interface still remains woefully undercooked. It is already possible to create some basic plots with it and its declarative style is wonderful (as in, it really matches the mental model I have of drawing individual plot elements into a coherent whole). But for anything more complex --- which in my opinion is exactly where this interface will really shine --- it remains out of reach because of missing methods and implementations. This is, of course, ideally just a temporary issue until the implementation gets better, but until then we are still stuck with the more strange mish-mash of seaborn simplicity with matplotlib exactness, and having to know when to leave the former behind and delve into the arcane API of the latter.

Additionally, when combined with quarto for publishing some more pain points appear. One that has been true for the longest time, and will likely remain so for the foreseeable future, is that tables beyond a certain complexity are just painful in quarto multi-output publishing. This project made use the fantastic python library great tables which indeed lives up to its name and produces absolutely great tables with very little effort. However, it primarily targets the html format. Getting this format into shape for quarto to then translate it into the pandoc AST and ultimately whatever format is not pretty. For example LaTeX routinely just crashes instead of rendering the table correctly into a PDF file.


  1. It would for example even be conceivable to have one project profile targeting a locally output book project type while a second targets the deployment of a remote website type from the same source material. ↩︎