Support for CoNLL-U format
Closed this issue · 3 comments
(Requested by @JessedeDoes)
Expand TSV input type to be able to deal with the CoNLL-U format.
The format is basically a TSV with some special features (point 2 and 3):
- Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
- Blank lines marking sentence boundaries.
- Comment lines starting with hash (#).
So we should probably add two options, e.g. blankLinesMarkSentenceBoundaries (default false) and commentLineCharacter (if this is the first character on the line, skip that line; default: none)
For those wishing to contribute: DocIndexerTabular
is the class that handles tabular formats like TSV and CSV. The two options could be added to the fileTypeOptions that can be specified in a .blf.yaml
format definition file (see here).
Perhaps good to know we developed https://bitbucket.org/fryske-akademy/taaldatabanken/src/master/udpipe-tdb/teitagger/. It converts conllu to tei (with our namespace for linguistic attributes, which will be released soon).
I may contribute to this issue because of this idea (though going via tei/xpath3 is I think more powerful):
Sounds interesting! @JessedeDoes (who originally asked about this format), did you see this?