/linebreaker

script for transforming TEI into training data for Grobid

Primary LanguagePython

Using linebreaker to generate training data for Grobid from gold standard corpora

Grobid is well-documented but retraining and evaluating on existing gold-standard corpora isn't straightforward -- most of the docs assume that you're manually annotating the training data automatically generated by Grobid. Currently, our process of generating data for retraining requires a bash environment, Python, and Java and looks like this:

  1. If you have Word documents, convert them to PDF; Grobid uses PDF input. $ unoconv -f pdf *.docx

  2. Run Grobid's createTraining step to generate documents that end in .fulltext or .header; these are used for training input. It also generates under-tagged XML, which we'll be replacing. From Grobid's top-level directory, run java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.2-onejar.jar -gH grobid-home -dIn /path/to/pdfs -dOut /path/to/output/directory -exe createTraining

  3. If your gold standard corpora are in JATS rather than TEI format, convert them to a TEI flavour that Grobid likes by running Pub2TEI. Unfortunately it's XSLT2 so you'll need Saxon; clone it down and call it with Publishers.xsl like transform -s:/path/to/JATS/ -xsl:Pub2TEI/Stylesheets/Publishers.xsl -o:/path/to/output/directory

  4. Generate pdftotext output with -raw to match Grobid's preferred linebreak patterning. $ for x in $(ls *.pdf); do pdftotext -raw $x; done

  5. Run linebreaker.py with your gold-standard TEI as the your first argument, and pdftotext -raw output as your second. This replaces <lb> elements, to signify linebreaks for Grobid's parser, into the known-good target XML. If both the XML and the text files are in the same directory, you can do $ for x in $(ls *.xml); do python linebreaker.py $x $(echo $x | sed -e "s/\.xml/.txt/"); done. Note that linebreaker was written specifically for body text right now; adapting it for headers would be welcome! This output should replace the XML generated by the createTraining step.

  6. You are now ready to retrain Grobid! Performance is unfortunately quite slow right now -- Wapiti is theoretically multithreaded but Grobid will execute it with a single thread, and even after making tweaks to its config, it will generally only process 5-10 documents in an hour. You can retrain with java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.5.2-onejar.jar 0 fulltext -gH grobid-home

Evaluating a trained Grobid model

Models generated above will be automatically placed into the correct subdirectory of a Grobid build and can be used automatically; if you want to evaluate their performance on a large corpus, the easiest way to do so is with Grobid's Pub2TEI end-to-end evaluation method. Per the Grobid documentation, create a directory like:

├── article1
│   ├── article1.pdf
│   └── article1.pub2tei.tei.xml
│   └── article1.nxml
│  
└── articles2
│   ├── article2.pdf
│   └── article2.pub2tei.tei.xml
│   └── article2.nxml
...

You should already have the needed files from earlier; you should evaluate on a slightly different set than those which were used for training. From Grobid's top-level directory, run ./gradlew Pub2TeiEval -Pp2t=/path/to/this/directory -Prun=1. There's your evaluation data!