Extract references from PDF bibliographies and find the corresponding papers in OpenReview
- Docker (For pre-packaged Grobid containers)
- node >= 16
> docker pull lfoppiano/grobid:0.7.2
> docker run --rm -p 8070:8070 lfoppiano/grobid:0.7.2
Note: The first time the Grobid REST API is called after starting the container, it will load up a bunch of model files on demand, which takes a long time (~15-20 seconds or more, depending on the machine). The command line app might timeout on the first usage.
> npm install
> npm run build
Install (with sudo if necessary) to system node installation:
> npm i -g ./
Grobid can use Biblio-Glutton to improve performance if necessary.
> pdf-ref-resolver extract-references --help
Extract PDF references using Grobid service, match with OpenReview papers
Options:
--version Show version number [boolean]
--help Show help [boolean]
--pdf Input pdf file [string] [required]
--output-path Specify a directory to write output. Defaults to same
as input PDF [string]
--config Path to config file [string] [required]
--overwrite Overwrite any existing output file
[boolean] [default: false]
--with-matched only output references that match an OpenReview note
[boolean] [default: false]
--with-unmatched only output references that *do not* match an
OpenReview note [boolean] [default: false]
--with-partial-matched only output references with match < 100% to an
OpenReview note [boolean] [default: false]
--with-source include the raw Grobid data in the output (verbose,
for debugging) [boolean] [default: false]
Configuration to specify REST endpoints and credentials
{
"openreview": {
"restApi": "https://api.openreview.net",
"restApi2": "https://api2.openreview.net",
"restUser": "my-username",
"restPassword": "my-password"
},
}
> pdf-ref-resolver extract-references --pdf ./path/to/input.pdf --config ~/my-config.json --output-path .
> pdf-ref-resolver extract-references --pdf ./path/to/input.pdf --config ~/my-config.json
Summary header block, with number of Grobid extracted references, and counts of those with valid titles and matches to OpenReview notes.
{
"summary": {
"references": 25,
"withTitles": 25,
"withNoteMatches": 13
},
"references": [
Reference Entry:
{
"refNumber": 5,
Title / Author fields as extracted by Grobid. Field 'isValid' will be false if the reference had no title, or could not be processed for some reason. Messages as to reason why will be recorded in the "warnings" array.
"title": "Expectation Propagation in Gaussian process dynamical systems",
"authors": [
"M P Deisenroth",
"S Mohamed"
],
"isValid": true,
"warnings": [],
Array of notes from OpenReview that matched. Notes are matched by first searching OpenReview, using Grobid-extracted title as keywords, then filtered with a string similarity function. Fields titleMatch/nameMatch are numbers 0-100, 100 being a perfect match.
Name matching uses a string similarity function that doesn't penalize deletions from the longer version of the name to the shorter, to allow matching between full and abbreviated names, so, e.g., "Marc Peter Deisenroth" -> "M P Deisenroth" is a 100% match.
"openreviewMatches": [
{
"id": "HkNwk_ZObS",
"authors": [
{
"name": "Marc Peter Deisenroth",
"id": "~Marc_Deisenroth1",
"nameMatch": 100
},
{
"name": "Shakir Mohamed",
"id": "~Shakir_Mohamed1",
"nameMatch": 100
}
],
"titleMatch": 100,
"apiSource": "api.openreview.net"
}
]
},
]