kuhumcst/DanNet

Add Supersenses to ConNLL-U file

Closed this issue · 4 comments

(the final part of #141, separated into this separate task)

Also: need to write a bit of documentation about how this result was achieved.

One issue I have run into is that since I have corrected split some senses that were appearing in multiple synsets, these now do not resolve using the old sense IDs, e.g. Aserbajdsjan is both a country and the people in the country.

The only way to resolve this is to compare the definition too.

Another issue: many synsets do not have supersenses assigned since the mapping only had e.g. a noun supersense, while the group of synsets also included verbs. In such cases no supersenses can be assigned. In the Elexis dataset, this amounts to ~700 synsets.

These remaining synsets have now been added in bca42f8 apart from 30 synsets which have sense IDs but do not exist in the DanNet dataset and whose descriptions are all {hyponymOf someLabel}. Since they do not reference IDs but only labels as their hypernyms, mapping these programatically is no easy task and should probably be done manually.

Talked to Bolette. The remaining missing supersense should not be added directly in the index, but rather a list should be produced based on what actually appears in the Elexis corpus. So the next task is to run through this corpus and collect every ID in use and then compare that to the sense IDs in DanNet.