INL/BlackLab

Support for CoNLL-U format

Closed this issue · 3 comments

(Requested by @JessedeDoes)
Expand TSV input type to be able to deal with the CoNLL-U format.

The format is basically a TSV with some special features (point 2 and 3):

  1. Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
  2. Blank lines marking sentence boundaries.
  3. Comment lines starting with hash (#).

So we should probably add two options, e.g. blankLinesMarkSentenceBoundaries (default false) and commentLineCharacter (if this is the first character on the line, skip that line; default: none)

For those wishing to contribute: DocIndexerTabular is the class that handles tabular formats like TSV and CSV. The two options could be added to the fileTypeOptions that can be specified in a .blf.yaml format definition file (see here).

Perhaps good to know we developed https://bitbucket.org/fryske-akademy/taaldatabanken/src/master/udpipe-tdb/teitagger/. It converts conllu to tei (with our namespace for linguistic attributes, which will be released soon).

I may contribute to this issue because of this idea (though going via tei/xpath3 is I think more powerful):
image

Sounds interesting! @JessedeDoes (who originally asked about this format), did you see this?