Words with multiple lemmas in `duolingo-categorical` are parsed incorrectly
Opened this issue · 0 comments
tmthyln commented
Currently, entries in the duolingo-categorical
dataset are parsed incorrectly from the original when they have 2 or more lemmas. For example,
des/de<pr>+le<det><def><mf><pl>
breaks down into the surface form des
, with 2 lemmas de
(part of speech is pr
) and le
(part of speech is det
) and additional modifiers def
, mf
, and pl
.
Right now, this would be parsed as:
surface_form | lemma | part_of_speech | def | mf | pl | det |
---|---|---|---|---|---|---|
des | de | pr | 1 | 1 | 1 | 1 |
where only the first lemma and part of speech are parsed correctly. The second lemma is ignored and its part of speech is incorporated as another modifier/tag.
Possible solutions:
- replicate each of these entries and treat the lemmas separately (with other attributes the same)
- add other attributes for "second lemma" and "second part of speech"
- something else?
- nothing? maybe this isn't a big deal since this is supposed to be a naïve categorical mapping, and most of the affected words are stop words?