nltk/nltk_data

Fix WordNet 3.0 gloss inconsistencies

genericallyterrible opened this issue · 5 comments

@fcbond, @stevenbird There are several consistency issues with the gloss portions of WordNet 3.0 making parsing difficult. Would it be possible for us to manually fix these issues without breaking word associations as seen with the problems currently facing the update to WordNet 3.1?

Would it be possible for us to manually fix these issues without breaking word associations [...]

Replying specifically to this: It is incredibly difficult to alter WNDB data without breaking things, as the synset IDs are byte-offsets in the file, so any modified gloss has to have the same number of bytes as before. Secondly, we're not allowed to change the Princeton WordNet data and still call it as such (it would have to be called the "NLTK Wordnet of English" or something).

the problems currently facing the update to WordNet 3.1?

That issue was closed 2 years ago, which suggests to me that there are no plans to add WordNet 3.1 to the NLTK. There was an attempt at adding next-generation wordnet support to, or alongside, the NLTK (see https://github.com/nltk/wordnet), and it included WordNet 3.1 data as an option. Development stalled, however, so I took over the effort (and package name on PyPI) with an entirely new module, which Francis has linked above.

@goodmami, thanks for the update. This sounds like a more sustainable option. How easily could a user of the NLTK wordnet package port their code to use your package? Does it include the similarity metrics?

Thanks, @fcbond!

@stevenbird, Wn has the similarity metrics, information content (it even reads the wordnet_ic files from nltk_data), Morphy, etc. Some absent features that may be desired are looking things up by sense keys (e.g., eat%2:34:02::; workaround) or the NLTK's shorthand synset identifiers (feed.v.06). If you wish to discuss a plan for deprecating the NLTK's wordnet module in favor of Wn, we should open separate issues to track the necessary changes to the code, data, documentation, and book.

Back to the current issue: in the modern WN-LMF format for wordnets, Definition and Example elements are structurally separate, having been split from WNDB's combined "gloss" line in the format-conversion process. That process, however, may not account for the inconsistencies noted by @genericallyterrible, who did a nice and thorough analysis in nltk/nltk#2527. So as to not let that effort go to waste, it might be good to compare it with the WNDB-to-LMF converter. The relevant code is here.