/joss-graph

Create and maintain a Neo4j graph database of JOSS journal activiites

Primary LanguageCypher

JOSS-graph

Github scraping utilities for creating and maintaining a Neo4j graph database for JOSS submission, review, and publication activities

Database stats

TBD

Background

The Journal of Open Source Software is an open journal that publishes peer-reviewed scientific research software. All submissions, reviews, editing, and formal publications are performed online via Github repositories and user accounts.

JOSS Topic Editors often choose reviewers from a central list of people who have agreed to review and have entered preference information, including preferred programming languages and areas of expertise. The list has grown to several hundred records and is cumbersome to use. It is sometimes of interest to know something about the review history of reviewers, the topics that submitters have written on (for the purpose of soliciting reviews from them), the frequency of reviews by a reviewer, and other information. This repo contains a graph model and scripts to slurp journal activity from the JOSS Github repos joss-papers and joss-reviews into a Neo4j graph database, so that it becomes easy to ask these questions.

Model

The graph model is described in joss-model.yaml. The format complies with the Model Description File format in Bento.

JOSS graph diagram

There are five main nodes:

  • person
  • assignment
  • submission
  • paper
  • issue

person

A person node represents an individual. It records the following properties:

  • handle (Github handle)
  • real_name (full name if available)
  • orcid (ORCID if available)
  • email (if available)
  • affiliation (if available)

Reviewers who have provided langugage and topic preferences have person nodes may be linked to language and topic nodes. language nodes possess the name property, topic nodes the content property. The property values are normalized to lower case and use only spaces for whitespace.

submission

A submission node records an instance of formal submission to JOSS as an instantiated pre-review (and followup review) issue in joss-reviews. It has the following properties:

  • title
  • disposition, one of (review_pending, under_review, paused, accepted, published, withdrawn, rejected, closed)
  • joss_doi
  • repository, URL of the submission's Github repo
  • prerev_issue_number, Issue number of the submissions's pre-review issue
  • review_issue_number, Issue number of the submissions's review issue (if any)

submission nodes are linked to issue nodes by has_prereview_issue and has_review_issue relationships.

issue

An issue node represents a Github issue. It has the following properties:

  • number
  • closed_date, a datetime in ISO 8601 UTC, e.g., 2019-06-30T23:15:13Z
  • created_date, in ISO 8601 UTC
  • url, the Github URL of the issue
  • labels, a single string of Github labels on the issue, separated by the pipe character |

paper

paper nodes record the location and doi information of published submissions. A paper is linked to its corresponding submission via a from_submission relationship. It has the following properties.

  • title
  • joss_doi
  • archive_doi, DOI of the software archive for the paper, frequently found on Zenodo
  • url at https://joss.theoj.org
  • published_date (YYYY-MM)
  • volume, JOSS volume number
  • issue, JOSS issue number

assignment

The assignment node records a single "encounter event" between a person and a submission. They are linked to a single person by an "assigned_to" relationship, and to a signle submission by an "assigned_for" relationship. An assignment node has the following properties:

  • role, one of (author, submitter, reviewer, editor, eic)

Sample queries

  • Q. How many papers has JOSS published to date?

     match (p:paper) return count(p);
    
  • Q. How many authors are also reviewers?

     match (p:person)--(a:assignment {role:"author"}), (p)--(b:assignment {role:"reviewer"})
     return count( distinct p );
    
  • Q. Who are the top 10 all-time reviewers by papers reviewed?

     match (p:person)--(a:assignment {role:"reviewer"})
     return p.handle, count(a) as num_papers order by num_papers desc limit 10;
    
  • Q. Which potential reviewers have also published in JOSS with submitting author "labarba"?

     match (a:person {handle:"labarba"})--(:assignment {role:"author"})--(s:submission) with a, s
     match (p:person)--(:assignment {role:"author"})--(s) where (p) <> (a) and
     (p)--(:assignment {role:"reviewer"})
     return distinct a.handle as a, p.handle as h, p.orcid as o order by h;
    
  • Q. What is the current submission/publication ratio for JOSS Topic Editors?

     match (p:person)<--(:assignment {role:"editor"}) with distinct p as e
     match (e)--(:assignment {role:"editor"})--(s:submission) 
     optional match (s)--(b:paper)
     return e.handle, toFloat(count(b))*100.0/toFloat(count(s)) as ratio, count(s) as n
     order by n desc;
    

Scripts

The following scripts are provided here to update a current Neo4j instance of JOSS-graph:

  • update-ghquery.pl - queries both the graph and the GitHub GraphQL (a.k.a. v4) endpoint to update submissions and publications

  • load-update.pl - converts the JSON output of update-ghquery.pl into Cypher statements

These rely on the Perl modules in the lib directory. The machinery can be built locally by cloning the repo, cd'ing to the main directory, and executing:

curl -L https://cpanmin.us | perl - App::cpanminus
cpanm Module::Build
cpanm -n Time::Zone # avoids a current bug in TimeDate tests
perl Build.PL
./Build
./Build installdeps --cpan_client cpanm
./Build install

Using Docker containers is easier.

Docker

Set up a Neo4j instance running in a Docker container, using the community Neo4j images available on Docker Hub (https://hub.docker.com). Prime this instance with the Neo4j v4.4 dump of the 2023-01-02 JOSS graph, by first creating a Neo4j database on your destination system, then pointing a Neo4j docker container at the database:

cd docker
export LOC=~/jg/neo4j/data # e.g.
mkdir -p $LOC
gunzip jg.20230102.v4-4.dump.gz
mv jg.20230102.v4-4.dump $LOC
# load dump into $LOC
docker run -v$LOC:/data --rm neo4j:4.4 \
  neo4j-admin load --from=data/jg.20230102.v4-4.dump
docker run -d -p7474:7474 -p7473:7473 -p7687:7687 -v$LOC:/data \
  --name jossgraph neo4j:4.4 

Following these commands should provide a live database accessible at http://localhost:7474.

To update the database, run the following container:

 docker run -d --rm \
   -e NEO_URL=localhost \
   -e GHCRED=ghp_XXXXXXXXX \
   -e NEOUSER=neo4j \
   -e NEOPASS=<password> \
   maj1/jg_manager

See the cron directory for scripts to perform this unattended at intervals.

License

Perl (GNU GPLv2 / Artistic License)