Having trouble reproducing experiemnt results.
BenfengXu opened this issue · 7 comments
Hi authors, great thanks for sharing the code, the results in this paper are really good!
I've tried the code, but right now I'm getting different results from the paper, I wonder if you can help me find out what's wrong.
To be specific, I finetune on Semeval using three different ckpt:
1.Raw BERT
I directly finetune raw BERT on Semeval dataset (i.e., "None" for "path/to/ckpt", I using the pytorch model provided by huggingface), the test results are {0.887, 0.887, 0.877, 0.884, 0.883} (five runs with different seeds), so the median should be 0.884, which is significantly high than 0.871 from the paper.
2.Downloaded CP model
I download your pretrained CP model, and finetune it on Semeval, the results are {0.882, 0.891, 0.884, 0.886, 0.875}, so the median should be 0.884, which is also higher than 0.876 from the paper, but no improvements compare to the baseline.
3. Reproduced CP model
I run CP pretrain as the same config in the paper, (i.e., 32 batch size, 16 gradient accumulation, 4 GPU, which gives 2048 batchsize in total), then save the ckpt from 3500 steps, finetune it on on Semeval, the results are {0.887, 0.887, 0.885, 0.887, 0.882}, so the median is 0.887, It is 0.003 higher than baseline results from my own run, but the improvements seem not quite significant and also unstable. If I train for more steps, the median results are 0.884, 0.883 for 7000 steps and 10500 steps. I also tried other batch size like 32*8=256, the results are also not good.
Some details
I install the dependencies as required, including the provided transformers and torch 1.4, I run the above experiemnts on Tesla P40 machine with 8 GPU, most of the training parameters are set as default other than necessary ones like ckpt dir and GPU id, etc.
Really appreciate your attention and advice. ^_^
Thank you for your careful experiments! I forget to point out that we don't use our own eval code for SemEval dataset, we use the official evaluation script, see https://github.com/sahitya0000/Relation-Classification/tree/master/corpus/SemEval2010_task8_scorer-v1.2. If you have any problems, please let us know.
Thank you for your careful experiments! I forget to point out that we don't use our own eval code for SemEval dataset, we use the official evaluation script, see https://github.com/sahitya0000/Relation-Classification/tree/master/corpus/SemEval2010_task8_scorer-v1.2. If you have any problems, please let us know.
I revised the evaluation code, output the prediction file and compute macro-f1 using official scorer, but the results are still not right...
using raw bert, the result is 0.880, median of {0.880, 0.880, 0.883, 0.880, 0.874};
using downloaded cp model, the result is 0.877, median of {0.876, 0.887, 0.877, 0.881, 0.870}.
Let me know your expriment details? You shall run on the train set, then evaluate on the dev set, then select the highest epoch to test. Meanwhile, I will check my original expriment code.
My original results were:Bert {86.66, 87.02, 87.41, 87.11, 87.47}, so the result is 87.11; CP {88.57, 87,61, 87.52, 87.80, 87.57}, so the result is 87.57. I will re-run the evaluation. Thank you for careful experiments.
I have re-run this code. The results are:
BERT: {88.02, 87.61, 87.36, 87.63, 88.11}, so the result is 0.876
CP: {87.95, 87.51, 88.08, 87.42, 87.97}, so the result is 0.880
When I did experiment before, I don't add "[SEP]" to the end of the sentence, but in this repo I add it. I just run the code with "[SEP]" excluded, the results are:
BERT: {87.89,86.94,86.64,87.55,87.36}, so the result is 0.874
CP: {87.86, 88.24, 87.71, 87.52, 87.90}, so the result is 0.879
You can see the detailed results in https://cloud.tsinghua.edu.cn/d/2016e7bf57a34eb3954e/ .
The reason why the results are different from the paper reported maybe the influence of different devices. But the improvement of our model is consistent. I think you can check your code carefully, and if your results are still abnormal, your can send your code to me through email.
To speak more, our paper aims to propose a new pre-training framework for RE, we don't aim to achieve the SOTA. In many supervised datasets, it's hard to get a big improvement. Our method performs very well in low-resource senarios and under few-shot settings, meanwhile it can gain little improvement on supervised RE datasets.
Thank you for your attention and careful experiments again.
I have re-run this code. The results are:
BERT: {88.02, 87.61, 87.36, 87.63, 88.11}, so the result is 0.876
CP: {87.95, 87.51, 88.08, 87.42, 87.97}, so the result is 0.880When I did experiment before, I don't add "[SEP]" to the end of the sentence, but in this repo I add it. I just run the code with "[SEP]" excluded, the results are:
BERT: {87.89,86.94,86.64,87.55,87.36}, so the result is 0.874
CP: {87.86, 88.24, 87.71, 87.52, 87.90}, so the result is 0.879You can see the detailed results in https://cloud.tsinghua.edu.cn/d/2016e7bf57a34eb3954e/ .
The reason why the results are different from the paper reported maybe the influence of different devices. But the improvement of our model is consistent. I think you can check your code carefully, and if your results are still abnormal, your can send your code to me through email.
To speak more, our paper aims to propose a new pre-training framework for RE, we don't aim to achieve the SOTA. In many supervised datasets, it's hard to get a big improvement. Our method performs very well in low-resource senarios and under few-shot settings, meanwhile it can gain little improvement on supervised RE datasets.
Thank you for your attention and careful experiments again.
Thanks for you patient and comprehensive explanation!
-
results indeed fluctuates across different devices
I run the baseline experement (finetune raw BERT on semeval), on P40, I got median of 88.00, on V100, I got 87.74, the variance Δ=0.26, so the previous inconsistent performence of CP ckpt can be explained. -
Improvements of CP
I rerun the CP pretrain, and got 88.04 on V100, compare to 87.74, the improvements are +0.4, which is expected according to the paper.
I also tried few-shot setting as you mentioned, the improments are indeed more significant, using 10% semeval data, the results are 82.27 compare to 80.41 from raw BERT, the improvements are +1.86.
I think all my troubles are now resolved, thanks again, hope you have a nice day!
I am glad to hear you have solved all the troubles. Thank you for your attention to our work and your careful experiment! Hope your research goes well.