feat(script): Add first data descriptions

Added preliminary data visualizations and descriptions for the current
identified pool. Not for anything screened yet, but for raw identified
results to use in guiding my screening.
This commit is contained in:
Marty Oehme 2023-11-12 18:12:07 +01:00
parent 7391c03582
commit 7a2d12d252
Signed by: Marty
GPG key ID: EDBF2ED917B2EF6A

View file

@ -28,14 +28,18 @@ zotero:
```{python}
#| echo: false
from pathlib import Path
data_dir=Path("./02-data")
DATA_DIR=Path("./02-data")
BIB_PATH = DATA_DIR.joinpath("raw/01_wos-sample_2023-11-02")
## standard imports
from IPython.core.display import Markdown as md
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from tabulate import tabulate
sns.set_style("whitegrid")
```
# Introduction
@ -280,7 +284,7 @@ Since scoping reviews allow both broad and in-depth analyses, they are the most
import bibtexparser
bib_string=""
for partial_bib in data_dir.joinpath("raw/wos").glob("*.bib"):
for partial_bib in BIB_PATH.glob("*.bib"):
with open(partial_bib) as f:
bib_string+="\n".join(f.readlines())
sample = bibtexparser.parse_string(bib_string)
@ -542,11 +546,81 @@ The results to be identified in the matrix include a studys: i) key outcome m
sample_size = len(sample.entries)
md(f"""
The exploratory execution of queries results in an initial sample of {sample_size} studies after the identification process.
The majority of studies result from the income inequality cluster of the Boolean search, with horizontal cluster terms used often but rarely on their own.
The exploratory execution of queries results in an initial sample of {sample_size} potential studies after the identification process.
This contains all identified studies without duplicate removal, controlling for literature that has been superseded or any other screening criteria.
""")
```
The currently identified literature rises almost continuously in volume,
with small decreases between 2001 and 2008, as well as more significant ones in 2012 and 2016,
as can be seen in @fig-publications-per-year.
Keeping in mind that these results are not yet screened for their full relevance to the topic at hand, so far only being *potentially* relevant in falling into the requirements of the search pattern, an increased results output does not necessarily mean a clearly rising amount of relevant literature.
```{python}
#| label: fig-publications-per-year
#| fig-cap: Publications per year
reformatted = []
for e in sample.entries:
reformatted.append([e["Year"], e["Author"], e["Title"], e["Type"], e["Times-Cited"], e["Usage-Count-Since-2013"]])
bib_df = pd.DataFrame(reformatted, columns = ["Year", "Author", "Title", "Type", "Cited", "Usage"])
bib_df["Date"] = pd.to_datetime(bib_df["Year"], format="%Y")
bib_df["Year"] = bib_df["Date"].dt.year
# only keep newer entries
bib_df = bib_df[bib_df["Year"] >= 2000]
# create dummy category for white or gray lit type (based on 'article' appearing in type)
bib_df["Type"].value_counts()
bib_df["Literature"] = np.where(bib_df["Type"].str.contains("article", case=False, regex=False), "white", "gray")
bib_df["Literature"] = bib_df["Literature"].astype("category")
# plot by year, distinguished by literature type
ax = sns.countplot(bib_df, x="Year", hue="Literature")
ax.tick_params(axis='x', rotation=45)
# ax.set_xlabel("")
plt.tight_layout()
plt.show()
```
Anomalies such as the relatively significant dips in output in 2016 and 2012 become especially interesting against the strong later increase of output.
While this can mean a decreased interest or different focus points within academia during those time spans,
it may also point towards missing alternative term clusters that are newly arising, or a re-focus towards different interventions, and should thus be kept in mind for future scoping efforts.
Looking at the distribution between white and gray literature a strong difference with white literature clearly overtaking gray literature can be seen, a gap which should not be surprising since our database query efforts are primarily aimed at finding the most current versions of white literature.
The gap will perhaps shrink once the snowballing process is fully completed,
though it should remain clearly visible during the entire scoping process as a sign of a well targeted identification step.
@fig-citations-per-year-avg shows the average number of citations for all studies published within an individual year.
From the current un-screened literature sample, several patterns become visible:
First, in general, citation counts are slightly decreasing - as should generally be expected with newer publications as less time has passed allowing either their contents be dissected and distributed or any repeat citations having taken place.
```{python}
#| label: fig-citations-per-year-avg
#| fig-cap: Average citations per year
bib_df["Cited"] = bib_df["Cited"].astype("int")
grpd = bib_df.groupby(["Year"], as_index=False)["Cited"].mean()
ax = sns.barplot(grpd, x="Year", y="Cited")
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
```
Second, while such a decrease is visible in relatively recent years (especially 2019--2023), it is not a linear decrease throughout but rather a more erratically stable citation output.
This points to, first, no decrease in academic interest in the topic over this period of time,
second, no linearly developing concentration or centralization of knowledge output and dissemination,
and third potentially no clear-cut increase of *relevant* output over time either.
Lastly, several years such as 2001, 2002, 2005 and 2008 are clear outliers in their large amount of average citations which can point to one of several things:
It can point to clusters of relevant literature feeding wider dissemination or cross-disciplinary interest, a possible sign of still somewhat unfocused research production which does not approach from a single coherent perspective yet.
It can also point to a centralization of knowledge production, with studies feeding more intensely off each other during the review process, a possible sign of more focused knowledge production and thus valuable to more closely review during the screening process.
Or it may mean that clearly influential studies have been produced during those years, a possibility which may be more relevant during the early years (2000-2008).
This is because, as @fig-publications-per-year showed, the overall output was nowhere near rich as in the following years, allowing single influential works to skew the visible means for those years.
In all of these cases, such outliers should provide clear points of interest during the screening process for possible re-evaluation of current term clusters for scoping.
Should they point towards gaps (or over-optimization) of sepcific areas of interest during those time-frames or more generally, they may provide an impetus for tweaking the identification query terms to better align with the prevailing literature output.
# Synthesis of Evidence
This section will present a synthesis of evidence from the scoping review.