TabbenBenchmark/tabben

Words with multiple lemmas in `duolingo-categorical` are parsed incorrectly

Opened this issue · 0 comments

Currently, entries in the duolingo-categorical dataset are parsed incorrectly from the original when they have 2 or more lemmas. For example,

des/de<pr>+le<det><def><mf><pl>

breaks down into the surface form des, with 2 lemmas de (part of speech is pr) and le (part of speech is det) and additional modifiers def, mf, and pl.

Right now, this would be parsed as:

surface_form lemma part_of_speech def mf pl det
des de pr 1 1 1 1

where only the first lemma and part of speech are parsed correctly. The second lemma is ignored and its part of speech is incorporated as another modifier/tag.

Possible solutions:

  1. replicate each of these entries and treat the lemmas separately (with other attributes the same)
  2. add other attributes for "second lemma" and "second part of speech"
  3. something else?
  4. nothing? maybe this isn't a big deal since this is supposed to be a naïve categorical mapping, and most of the affected words are stop words?