malllabiisc/WordGCN

Reproduction Problem

Punchwes opened this issue · 5 comments

Hi @svjan5 ,

Thanks for your paper, as well as releasing the code.

I follow your current code and default settings, after several runs it seems hard to reproduce your reported result on test set.

My results on five runs with deviation are like below:

Analogy task:

Google MSR SemEval2012_2
our 45.16±1.61 49.41±0.60 16.28±1.73
reported 52.8 23.4

Similarity task

MEN WS353 WS353R WS353S SimLex999 RW RG65 MTurk TR9856
our 69.99±0.19 58.35±0.52 43.51±1.68 70.68±0.51 47.61±0.29 37.91±0.47 58.19±1.33 59.90±0.86 17.23±0.26
reported 45.7 73.2 45.5 33.7

Categorisation task

AP BLESS Batting ESSLI_2c ESSLI_2b ESSLI_1a
our 59.22±1.97 69.04±0.86 39.50±1.34 67.41±3.63 77.78±6.20 80.67±1.33
reported 69.3 85.2 45.2

As you can see, there is a large gap for tasks like SemEval2012_2 and categorisation tasks. The deviations for several tasks are also a little bit large.

I wonder where did I go wrong? Forgive my carelessness, Is there anything I missed?

Hi @Punchwes,
Please make sure that you are using the same test/valid split which we have provided in our repo. With that, there should not be any issue in reproducing the results.

Hi @svjan5 ,
Yes, I am using the provided web_data for the process. The valid for choosing model and test for testing. I run python switch_evaluation_data.py -split test to get test data.

Also I am wondering at which epoch you reached the best performance? I find that the best model is always obtained within 2 or 3 epochs. Is this the same for you?

By the way, is the model you released previously the best performing one? The result on test set seems not very consistent with the reported one.

Hi @Punchwes,
Let me know what performance you are getting with the provided pre-trained embeddings. We trained the model for around 4-5 days for getting the reported performance. We were reducing the learning rate when the loss was getting saturated.

I need to confirm whether the shared model checkpoint is the best performing model or not.

Hi @svjan5 , thanks for your information.

The provided gives performance like:

AP BLESS Batting WS353R WS353S SimLex999 RW MSR SemEval2012_2
provided 67.14 81.84 44.55 45.89 73.18 45.53 33.89 27.02 16.31
reported 69.3 85.2 45.2 45.7 73.2 45.5 33.7 52.8 23.4

Large gaps can be seen at Analogy tasks and BLESS.

According to your description, I think the only difference between our running is that you further reduced learning rate for several other epochs while I did not conduct that. I would try to run several epochs with your reduced learning rate to see if it could be able to reproduce the reported results.

Thanks.

Hi @Punchwes,
It seems like you have almost obtained the reported results on similarity tasks. I am not sure why the results are low for analogy tasks.
I will check it and get back to you.