dumitrescustefan/RoWordNet

Incorrect notion of a synset

SoimulPatriei opened this issue · 7 comments

This API does not implement the correct notion of a synset.

wn.synsets('sacru')
['ENG30-06430385-n', 'ENG30-06429590-n', 'ENG30-02055062-a', 'ENG30-02172518-n', 'ENG30-11717399-n', 'ENG30-02054610-a', 'ENG30-02587261-a']
wn.synset('ENG30-06430385-n').literals
['text_sacru', 'text', 'sacru']
but 'text_sacru' is not synonymous with 'sacru'. What this API seems to be doing is split the multi words into component words and add them in the synset.
That this is the case, is proven in my next example where I'm searching the word 'de' (a preposition in Romanian) and get 3868 synsets.
len(wn.synsets(literal='de'))
3868

Hi, this is actually a matter of searching. Because literals are often multi word expressions, if we limited the search to strict equality then we'd miss a lot of synsets (there are synsets that contain only multi word literals - unless you know exactly what you're searching for, those will always remain undiscoverable).

Maybe we should add an option like wn.synsets('sacru', strict=True) that will search for the full string and not substrings. This way a search for de will yield 0 results.

Regarding synset ENG30-06430385-n, unfortunately, this is an issue of the data in the wordnet not of the API. As the Romanian wordnet was created by translating the English one, some design choices were made, and very rarely you get this kind of behavior, even if the wordnet was manually curated.

Thank-you for the answer. Maybe I did not explain well enough :

  1. First, there is clear an artificial word expansion by the API and not the design of the Romanian Wordnet. Let me give you some examples

wn.synset('ENG30-08228538-n')
Synset(id='ENG30-08228538-n', literals=['club_de_carte', 'club', 'de', 'carte'], definition='club în care membrii beneficiază de reduceri de prețuri pentru anumite cărți'). So there is a synset ['club_de_carte', 'club', 'de', 'carte'] but this is not in the Romanian Wordnet. What you seem to do is to split club_de_carte in components and add them as members of the synset. That is why you have 3868 of synsets containing the preposition "de".

Other example:

wn.synset('ENG30-06389109-n')
Synset(id='ENG30-06389109-n', literals=['text_dactilografiat', 'copie_dactilografiată', 'text', 'dactilografiat', 'copie', 'dactilografiată'], definition='Manuscris dactilografiat')
Again this synset for sure is not in the Romanian Wordnet. [dactilografiat', 'copie', 'dactilografiată'] do not represent a meaning in Romanian but you are splitting the words : 'text_dactilografiat', 'copie_dactilografiată', and you are adding them as the members of the synset.

  1. The correct behavior is to give the synsets containing the words required by the user. I want all the senses of the word "carte" for example. Or I want all the synsets containing "club_de_carte". You might add an option allowing the user to search the members of the synset by regular expression (e.g. give me all the synsets containing "club" but this is a different issue).

Investigated, you are correct, this is a bug. Will return with a fix.

Please check pip install rowordnet==1.0.0. I have implemented the strict=True option in the synset search.
So your search will become : wn.synsets('sacru', strict=True) if you want exact matches, or wn.synsets('sacru') (default is False to be backwards compatible) if you want all synsets having sacru as a subword.

The bug with the parts of the literals should be clear now. Please confirm.

Thanks! That was quick ...I confirm the bugs have been fixed. There is a guy you might know , Andrei-Marius Avram , who implemented semantic similarity measures for Romanian Wordnet using your API. I'm using these measures in a project. But though there is a "Demo on similarity metrics available as a Jupyter notebook" on the main page the measures cannot be accessed through the API. Thanks again !

Hi, I think that is me. I implemented the similarity measures, but I did not upgrade the version of the API on pypi, so they were not available unless you installed directly from github.

The similarity measures should be available now in the new version (1.0.0).

Thanks,
Now they are working. I'm writing the code for my project using the similarity measures. If I notice something I will let you know.