dumitrescustefan/RoWordNet

The Similarity computation is unusable ...

SoimulPatriei opened this issue · 14 comments

As it is the similarity computation is not usable. The task I' trying to accomplish is the following: given around 800 pairs of words, English and Romanian (the Romanian are translations of the English) generate all senses for the words in the pair, make the Cartesian product between the senses and compute similarity measures between the resulting synsets. The task completes for English side in under a minute. For Romanian, I have started the script on the server yesterday at 19:00 and it entered a loop (hanged on) this morning at 07:00 after computing 263 pairs. I have updated my previous script on Google Drive for you to see what is the problem. It seems that for some synset pairs it takes 5 minutes or more (sometimes even 30 minutes, and sometimes it hangs on) to compute the similarity. Here is time I've got : English Total time: 23.05 seconds#### Total Similarity Computations 765
Romanian Total time: 1300.31 seconds#### Total Similarity Computations 441. So, for English it finished 765 similarity computations in 23 seconds (in fact it is much less because the English intialize the wordnet for 15 seconds circa) and in Romanian less computations took 25 minutes.

Hi guys,
I was wondering if someone works to speed up the similarity computation to a reasonable time limit. Please, let me know asap.
Thanks!

it's on the list. I will have some time late next week. Maybe Andrei can handle it faster?

Great!

I have been extremely busy lately and I didn't have time for any additional tasks. Most likely I will also take a look at the end of next week.

Hi, we have created a new branch 'optim_similarities' where we optimized the computation of all similarities. We also ran your 'wordnet_test.py' script on this new version and our API seems to obtain a total time of ~0.46 seconds for 441 similarity computations. Also, the similarity scores seem to be okay. Could you test out this version from the branch and confirm us that everything works as expected?

Sorry for the long update period, we've been very busy the last few weeks. Thank you!

Thank-you! I have programmed a test of similarity computation for tomorrow. I will let you know.

Hi,
I've done this : pip install git+https://github.com/dumitrescustefan/RoWordNet.git@optim_similarities
And I run my test script, but I'm obtaining the same slow time as before. Did I do something wrong?

Hi,
That's very weird. Try to firstly uninstall the API and then reinstall it from the branch. I believe pip has ignored installing it from the branch because it has same version as the non-optimized variant.

Sorry for the late answer!

I reinstalled it and I could run the test and it ran very fast. However, there are still bugs. Here is one

wn.synset('ENG30-03624767-n').literals
['cal']
wn.synset('ENG30-04548613-n').literals
['coșoroabă', 'iapă']
wn.path_similarity('ENG30-03624767-n', 'ENG30-04548613-n')
Traceback (most recent call last):
File "", line 1, in
File "/Users/eduardbarbu/anaconda3/lib/python3.6/site-packages/rowordnet/rowordnet.py", line 792, in path_similarity
shortest_path_distance = len(self.shortest_path(synset_id1, synset_id2, relations={"hypernym", "hyponym"}))
File "/Users/eduardbarbu/anaconda3/lib/python3.6/site-packages/rowordnet/rowordnet.py", line 757, in shortest_path
return nx.shortest_path(self._hypernym_graph, synset_id1, synset_id2)
File "/Users/eduardbarbu/anaconda3/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/generic.py", line 170, in shortest_path
paths = nx.bidirectional_shortest_path(G, source, target)
File "/Users/eduardbarbu/anaconda3/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/unweighted.py", line 223, in bidirectional_shortest_path
raise nx.NodeNotFound(msg.format(source, target))
networkx.exception.NodeNotFound: Either source ENG30-03624767-n or target ENG30-04548613-n is not in G

I have modified the wordnet_test.py script so you can run it with our main file. When you can run that script this means that in what concerns us there are no bugs.

Hmm, I will look into it and I will come back as soon as I find a solution. Thanks for the scripts!

I solved the bug that appeared in your test script and now the computation takes ~22.5 seconds with our API, compared with ~15 on PWN. Please confirm us that everything works fine for you too now. Thank you!

Thanks! I have tested it and it appears to work without error. After we perform the data analysis, if we notice something, I will let you know.

@avramandrei in that case I think it's safe to merge to master, do the honors please. @SoimulPatriei please reopen if needed.