Grobid is well-documented but retraining and evaluating on existing gold-standard corpora isn't straightforward -- most of the docs assume that you're manually annotating the training data automatically generated by Grobid. Currently, our process of generating data for retraining requires a bash environment, Python, and Java and looks like this:
-
If you have Word documents, convert them to PDF; Grobid uses PDF input.
$ unoconv -f pdf *.docx
-
Run Grobid's createTraining step to generate documents that end in
.fulltext
or.header
; these are used for training input. It also generates under-tagged XML, which we'll be replacing. From Grobid's top-level directory, runjava -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.2-onejar.jar -gH grobid-home -dIn /path/to/pdfs -dOut /path/to/output/directory -exe createTraining
-
If your gold standard corpora are in JATS rather than TEI format, convert them to a TEI flavour that Grobid likes by running Pub2TEI. Unfortunately it's XSLT2 so you'll need Saxon; clone it down and call it with
Publishers.xsl
liketransform -s:/path/to/JATS/ -xsl:Pub2TEI/Stylesheets/Publishers.xsl -o:/path/to/output/directory
-
Generate pdftotext output with -raw to match Grobid's preferred linebreak patterning.
$ for x in $(ls *.pdf); do pdftotext -raw $x; done
-
Run
linebreaker.py
with your gold-standard TEI as the your first argument, and pdftotext -raw output as your second. This replaces<lb>
elements, to signify linebreaks for Grobid's parser, into the known-good target XML. If both the XML and the text files are in the same directory, you can do$ for x in $(ls *.xml); do python linebreaker.py $x $(echo $x | sed -e "s/\.xml/.txt/"); done
. Note that linebreaker was written specifically for body text right now; adapting it for headers would be welcome! This output should replace the XML generated by the createTraining step. -
You are now ready to retrain Grobid! Performance is unfortunately quite slow right now -- Wapiti is theoretically multithreaded but Grobid will execute it with a single thread, and even after making tweaks to its config, it will generally only process 5-10 documents in an hour. You can retrain with
java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.5.2-onejar.jar 0 fulltext -gH grobid-home
Models generated above will be automatically placed into the correct subdirectory of a Grobid build and can be used automatically; if you want to evaluate their performance on a large corpus, the easiest way to do so is with Grobid's Pub2TEI end-to-end evaluation method. Per the Grobid documentation, create a directory like:
├── article1
│ ├── article1.pdf
│ └── article1.pub2tei.tei.xml
│ └── article1.nxml
│
└── articles2
│ ├── article2.pdf
│ └── article2.pub2tei.tei.xml
│ └── article2.nxml
...
You should already have the needed files from earlier; you should evaluate on a slightly different set than those which were used for training. From Grobid's top-level directory, run ./gradlew Pub2TeiEval -Pp2t=/path/to/this/directory -Prun=1
. There's your evaluation data!