/RCPublications

Primary LanguageJupyter NotebookCreative Commons Zero v1.0 UniversalCC0-1.0

RCPublications

This provides metadata about publications for the Rich Context knowledge graph, which links publications to datasets.

The links and other metadata that are represented here originate from manually-curated documents provided by our community of researchers, agencies, and other data providers.

Updates arrive in mulitple drops and the manual curation gets performed over in that repo prior to commits: https://github.com/NYU-CI/RichContextMetadata/tree/master/metadata

Also, before working in this repo you must set up your pre-commit hooks for Git:

chmod +x .githooks/pre-commit
bash .githooks/one-time-hook-setup.sh

Instructions for adding new publications

  1. Create a branch
  2. Prepare the data in your CSV file
  3. Generate a JSON file to add as a partition
  4. Run unit tests prior to commit

1. Create a branch

Create a new branch with the same name as your metadata/ subdirectory.

Example: git checkout -b 20190717_usda_wic

2. Prepare the data in your CSV file

Identify the CSV within your metadata/ subdirectory. There may be multiple sheets in the original spreadsheet provided by the partner, so make sure you've selected the one created by someone on our team.

If your CSV lists a publication with a DOI but no URL, construct a URL in a new column in the CSV before proceeding: https://www.doi.org/<doi>

Excel code: ="https://www.doi.org/" & <doi_cell>

Finally, your CSV file should have the minimum required fields:

Remove any entries that don't have these fields.

3. Generate a JSON file to add as a partition

Use the scripts/publications_export_template.py script to generate a JSON file to add to the partitions/ directory.

  • Navigate to your subdirectory in RichContextMetadata/metadata where your CSV is stored

  • Copy the directory name where your CSV is located, and Copy the file name of the CSV you want to export

    • Note: The directory name will become part of the new JSON file.
  • Execute python scripts/publications_export_template.py <directory_name> <csv_file_name> on the terminal, or on your favorite IDE.

  • If you want to specify your own filename for the JSON partition, add it as a third argument, e.g: python scripts/publications_export_template.py <directory_name> <csv_file_name> <json_file_name>

    • This script will export your publications metadata to /partitions
  • If you run into any problems with the template, post a GitHub issue on this repo

Check the RCPublications/partitions subdirectory after the script is done running without errors, to make sure that the JSON files has the required fields and was exported properly.

4. Rebase if needed

Since our team is generally working on different partitions in parallel, often you'll need to rebase prior to creating a pull request. In other words,

git rebase master
git push -f origin

Sometimes there may be merge conflicts, which you'll need to fix manually before you can continue. See this Git rebase tutorial for more details.

5. Run unit tests prior to commit

Run the unit tests on your new JSON file partition prior to commit:

python test.py partitions/20190717_usda_wic_publications.json

Caveats

  • We handle duplicate titles downstream in the graph management