mgormley/agiga

In ConLLX output format, some lines are broken into multiple

strubell opened this issue · 4 comments

I ran the command line tool to extract basic Stanford dependencies and found that in numerous files some lines got broken into multiple lines.

For example, running
edu.jhu.agiga.AgigaPrinter basic-deps /path/to/LDC2012T21/data/xml/afp_eng_199512.xml.gz
splits the first line containing the token GOLDEN into the following three lines:

1       GOLDEN  GOLDEN  NNP
        NNP
        _       2       nn      _       _     

I see, it's likely because in the original data file there is a stray newline after the pos tag:

      <token id="1">
        <word>GOLDEN</word>
        <lemma>GOLDEN</lemma>
        <CharacterOffsetBegin>383</CharacterOffsetBegin>
        <CharacterOffsetEnd>389</CharacterOffsetEnd>
        <POS>NNP
</POS>
        <NER>O</NER>
      </token>

I submitted PR #2 which should resolve this issue :)

Thanks for catching this!

This issue is resolved, but led to followup in issue #3.