In ConLLX output format, some lines are broken into multiple

Question

In ConLLX output format, some lines are broken into multiple

strubell opened this issue 9 years ago · 4 comments

I ran the command line tool to extract basic Stanford dependencies and found that in numerous files some lines got broken into multiple lines.

For example, running
edu.jhu.agiga.AgigaPrinter basic-deps /path/to/LDC2012T21/data/xml/afp_eng_199512.xml.gz
splits the first line containing the token GOLDEN into the following three lines:

1       GOLDEN  GOLDEN  NNP
        NNP
        _       2       nn      _       _

Answer 1 · 2016-03-29T16:55:45.000Z

I see, it's likely because in the original data file there is a stray newline after the pos tag:

      <token id="1">
        <word>GOLDEN</word>
        <lemma>GOLDEN</lemma>
        <CharacterOffsetBegin>383</CharacterOffsetBegin>
        <CharacterOffsetEnd>389</CharacterOffsetEnd>
        <POS>NNP
</POS>
        <NER>O</NER>
      </token>

Answer 2 · 2016-03-29T17:16:45.000Z

I submitted PR #2 which should resolve this issue :)

Answer 3 · 2016-03-30T13:16:50.000Z

Thanks for catching this!

Answer 4 · 2016-03-30T14:17:35.000Z

This issue is resolved, but led to followup in issue #3.