In ConLLX output format, some lines are broken into multiple
strubell opened this issue · 4 comments
strubell commented
I ran the command line tool to extract basic Stanford dependencies and found that in numerous files some lines got broken into multiple lines.
For example, running
edu.jhu.agiga.AgigaPrinter basic-deps /path/to/LDC2012T21/data/xml/afp_eng_199512.xml.gz
splits the first line containing the token GOLDEN
into the following three lines:
1 GOLDEN GOLDEN NNP
NNP
_ 2 nn _ _
strubell commented
I see, it's likely because in the original data file there is a stray newline after the pos tag:
<token id="1">
<word>GOLDEN</word>
<lemma>GOLDEN</lemma>
<CharacterOffsetBegin>383</CharacterOffsetBegin>
<CharacterOffsetEnd>389</CharacterOffsetEnd>
<POS>NNP
</POS>
<NER>O</NER>
</token>
mgormley commented
Thanks for catching this!