Reproduction Problem

Question

Reproduction Problem

Punchwes opened this issue 5 years ago · 5 comments

Thanks for your paper, as well as releasing the code.

I follow your current code and default settings, after several runs it seems hard to reproduce your reported result on test set.

My results on five runs with deviation are like below:

Analogy task:

	Google	MSR	SemEval2012_2
our	45.16±1.61	49.41±0.60	16.28±1.73
reported		52.8	23.4

Similarity task

	MEN	WS353	WS353R	WS353S	SimLex999	RW	RG65	MTurk	TR9856
our	69.99±0.19	58.35±0.52	43.51±1.68	70.68±0.51	47.61±0.29	37.91±0.47	58.19±1.33	59.90±0.86	17.23±0.26
reported			45.7	73.2	45.5	33.7

Categorisation task

	AP	BLESS	Batting	ESSLI_2c	ESSLI_2b	ESSLI_1a
our	59.22±1.97	69.04±0.86	39.50±1.34	67.41±3.63	77.78±6.20	80.67±1.33
reported	69.3	85.2	45.2

As you can see, there is a large gap for tasks like SemEval2012_2 and categorisation tasks. The deviations for several tasks are also a little bit large.

I wonder where did I go wrong? Forgive my carelessness, Is there anything I missed?

Answer 1 · 2019-12-14T10:00:05.000Z

Hi @Punchwes,
Please make sure that you are using the same test/valid split which we have provided in our repo. With that, there should not be any issue in reproducing the results.

Answer 2 · 2019-12-14T10:44:45.000Z

Hi @svjan5 ,
Yes, I am using the provided web_data for the process. The valid for choosing model and test for testing. I run python switch_evaluation_data.py -split test to get test data.

Also I am wondering at which epoch you reached the best performance? I find that the best model is always obtained within 2 or 3 epochs. Is this the same for you?

By the way, is the model you released previously the best performing one? The result on test set seems not very consistent with the reported one.

Answer 3 · 2019-12-16T14:08:51.000Z

Hi @Punchwes,
Let me know what performance you are getting with the provided pre-trained embeddings. We trained the model for around 4-5 days for getting the reported performance. We were reducing the learning rate when the loss was getting saturated.

I need to confirm whether the shared model checkpoint is the best performing model or not.

Answer 4 · 2019-12-16T20:13:46.000Z

Hi @svjan5 , thanks for your information.

The provided gives performance like:

	AP	BLESS	Batting	WS353R	WS353S	SimLex999	RW	MSR	SemEval2012_2
provided	67.14	81.84	44.55	45.89	73.18	45.53	33.89	27.02	16.31
reported	69.3	85.2	45.2	45.7	73.2	45.5	33.7	52.8	23.4

Large gaps can be seen at Analogy tasks and BLESS.

According to your description, I think the only difference between our running is that you further reduced learning rate for several other epochs while I did not conduct that. I would try to run several epochs with your reduced learning rate to see if it could be able to reproduce the reported results.

Thanks.

Answer 5 · 2019-12-31T07:54:17.000Z

Hi @Punchwes,
It seems like you have almost obtained the reported results on similarity tasks. I am not sure why the results are low for analogy tasks.
I will check it and get back to you.