This repository contains scripts for bibliometric analysis of literature which cites dimensionality reduction method papers. These scripts download the entire Semantic Scholar Academic Graph (S2AG) and insert all paper and citation records into a PostgreSQL database. Jupyter notebooks plot the results of queries against this database. The scripts to download Semantic Scholar bulk files and insert records into the SQL database (everything upstream of querying the database) are general and could be reused for other bibliometric analyses.
The database insertion scripts are quite dumb/inefficient (100 Python for
loops running in parallel SLURM jobs, iterating over almost 1 TB of .jsonl
files, inserting in batches of 1,000 records), but the single Postgres server node cannot really handle more simultaneous clients than this anyway.
Primary key constraints in each model prevent insertion of duplicate records during a repair step.
conda env create -f environment.yml
Also see DB setup
Request a semantic scholar API key from https://www.semanticscholar.org/product/api#api-key
To install altair_saver
, you will need to have NodeJS (tested with v20) and NPM (v8 or earlier) installed.
If the installation fails, run the NPM install command manually:
npm install vega-lite vega-cli canvas
On macOS, the canvas package may require some Homebrew dependencies.
Adjust the time (-t
) that will be requested for the srun
command based on the intended client jobs that will be run against the server.
srun -p medium --pty -t 2-12:00 -n 4 --mem 16G bash
cd ~/lab
module load postgresql
pg_ctl -D $(pwd)/hd-vis-db -l logfile start
hostname
Copy the hostname
This took approximately one week to run (with 50-100 jobs running simultaneously as specified with -j
).
This command will download the bulk database files, unzip them, and then begin reading through the JSONL files and inserting records into the Postgres database.
cd ~/research/hd-vis-scripts
conda activate hd-vis
export SLURM_ACCOUNT=$(sshare -u mk596 -U | cut -d ' ' -f 1 | tail -n 1)
export HD_VIS_DB_NAME=mk596
export HD_VIS_DB_USER=mk596
export HD_VIS_DB_HOST="HOST_HERE"
export HD_VIS_DB_PASSWORD="MY_PASSWORD_HERE"
export S2_API_KEY="MY_KEY_HERE"
snakemake -j 50 --rerun-triggers mtime --keep-incomplete --keep-going --latency-wait 30 --slurm \
--default-resources slurm_account=$SLURM_ACCOUNT slurm_partition=short runtime=30
The .jsonl.gz
files are 220 GB total. After unzipping, the .jsonl
files are 840 GB total.
The insertion scripts use transactions with batches of 1,000 records at a time.
The transaction will fail if one of those 1,000 record insertions encounters an error (e.g., if a venue name is longer than the CharField
allows).
The insertion scripts store the indices of failed batches.
The repair scripts try to insert individual records within those failed batches.
This is slower, as we now perform a transaction for every record (as opposed to every 1,000 records), but now we are only iterating over records within failed batches.
This took approximately one week to run (with 50-100 jobs running simultaneously as specified with -j
).
cd ~/research/hd-vis-scripts
conda activate hd-vis
export SLURM_ACCOUNT=$(sshare -u mk596 -U | cut -d ' ' -f 1 | tail -n 1)
export HD_VIS_DB_NAME=mk596
export HD_VIS_DB_USER=mk596
export HD_VIS_DB_HOST="HOST_HERE"
export HD_VIS_DB_PASSWORD="MY_PASSWORD_HERE"
export S2_API_KEY="MY_KEY_HERE"
snakemake repair_all -j 50 --rerun-triggers mtime --keep-incomplete --keep-going --latency-wait 30 --slurm \
--default-resources slurm_account=$SLURM_ACCOUNT slurm_partition=short runtime=180
After inserting and repairing errors the database directory should be approximately 142 GB:
du -sh ~/lab/hd-vis-db
# 142G /home/mk596/lab/hd-vis-db
These queries take approximately 3 hours to run.
srun -p interactive --pty -t 8:00:00 -n 4 --mem 32G bash
cd ~/research/hd-vis-scripts
conda activate hd-vis
export HD_VIS_DB_NAME=mk596
export HD_VIS_DB_USER=mk596
export HD_VIS_DB_HOST="HOST_HERE"
export HD_VIS_DB_PASSWORD="MY_PASSWORD_HERE"
cd src
python q01_update_preprints.py
python q02_update_papers.py
python q03_get_field_year_percentiles.py
python q04_get_citing_methods_l1.py
python q05_get_citing_methods_l2.py
python q06_get_l1_papers_fields.py
python q07_get_l1_papers_fields.py
python q08_summary_stats.py
The resulting extracted CSV files are approximately 2 GB total.
If presigned URLs in data/raw/citations_meta.json
and data/raw/papers_meta.json
have expired, then:
snakemake -j 1 --until download_s2_papers_meta
snakemake -j 1 --until download_s2_citations_meta
snakemake -j 1 --touch
Execute download rules on different nodes:
snakemake -j 10 --until download_s2_papers_bulk_part --slurm \
--default-resources slurm_account=$SLURM_ACCOUNT slurm_partition=short runtime=30
snakemake -j 10 --until download_s2_citations_bulk_part --slurm \
--default-resources slurm_account=$SLURM_ACCOUNT slurm_partition=short runtime=30
Resources: