Words with multiple lemmas in `duolingo-categorical` are parsed incorrectly

Question

Words with multiple lemmas in `duolingo-categorical` are parsed incorrectly

Opened this issue 3 years ago · 0 comments

Currently, entries in the duolingo-categorical dataset are parsed incorrectly from the original when they have 2 or more lemmas. For example,

des/de<pr>+le<det><def><mf><pl>

breaks down into the surface form des, with 2 lemmas de (part of speech is pr) and le (part of speech is det) and additional modifiers def, mf, and pl.

Right now, this would be parsed as:

surface_form	lemma	part_of_speech	def	mf	pl	det
des	de	pr	1	1	1	1

where only the first lemma and part of speech are parsed correctly. The second lemma is ignored and its part of speech is incorporated as another modifier/tag.

Possible solutions:

replicate each of these entries and treat the lemmas separately (with other attributes the same)
add other attributes for "second lemma" and "second part of speech"
something else?
nothing? maybe this isn't a big deal since this is supposed to be a naïve categorical mapping, and most of the affected words are stop words?