Dowload the dataset on Harvard's Dataverse (27.5 GB compressed; 500 GB uncompressed)



This script takes as input JSON files from CrossRef and converts them to citeproc JSON. It then uses the following citeproc JS library to create citation strings in ~1500 CSL styles. Citation styles are available at The final output is tagged XML citations in a CSV file.


💻Terminal (preferably 🐧Linux)

  1. Install nodejs

curl -sL | sudo -E bash

sudo apt-get install -y nodejs

  1. Node Modules

Citeproc and citeproc-js-node are required for this script. Working versions of these libraries are contained in this repo under /dataset-creation/node_modules. It is recommended to use these versions.

The folder /node_modules containing citeproc and citeproc-js-node must be placed under home directory if not already present there.

  1. Required libraries

cd node_modules

  • Install npmlog npm install npmlog
  • Install xmldom npm install xmldom
  • fs - for reading files
  • zip and unzip (optional) sudo apt install zip sudo apt install unzip


This script takes as input JSON files from CrossRef A script for downloading random CrossRef entries can be found in /dataset-creation/crossref/ Alternatively a dump of CrossRef metadata (2017-04-02) is located at with additional details found

Input files should be placed in the folder /dataset-creation/inputFiles/ . A sample is provided.

Citation Styles

1568 CSL citation styles are located under dataset-creation/csl. Citation styles are obtained from A small number were removed as they were not working.

Create Citation Dataset

Under the folder /dataset-creation run the following:

node generateCSVcitationdataset.js tags [input_filename]


node generateCSVcitationdataset.js tags sampleCrossref.json

This script will generate the CSV citation file and save it to /dataset-creation/outputFiles/ . The output file will be named 'output_[input_filename].csv'. It will also create cslciteproc.json under /dataset-creation/cslCiteprocOutput/


If you have a large number of input files it may be better to run ./ under /dataset-creation. This script will loop through all input files located in /dataset-creation/inputFiles. For each file it will run generateCSVcitationdataset. Each output file will then be zipped and saved in /dataset-creation/outputFiles. The unzipped version will be removed.


The output will be a CSV file with the following columns:

  • doi
  • articleType (journal, book etc.)
  • citationStyle
  • citationStringAnnotated

articleType and citationStyle are indexes and the appropriate index file is located under /dataset-creation/indexes/ . citationStringAnnotated is an annotated XML citation. A sample output file is given in outputFiles/output_sampleCrossref.csv. Any errors, such as a citation style which didn't work, are logged to log.txt.


MIT License

Main Authors and Contributors

  • Mark Grennan grennama (@)
  • Joeran Beel beelj (@)
  • Martin Schibel
  • Andrew Collins
  • Dominika Tkaczyk