Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting
Closed this issue · 4 comments
@gaotianyu1350
Hi, thank you for the great work / publishing beautiful codes!
I have some questions reproducing the STS results for pre-trained bert models.
When I run the following command in my environment, I got higher STS scores comparing to the results shown in your paper.
Do you have any idea what is causing the issue?
Code executed
python evaluation.py \
--model_name_or_path bert-base-uncased \
--pooler avg_first_last \
--task_set sts \
--mode test
Results
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 45.09 | 64.30 | 54.56 | 70.52 | 67.87 | 59.05 | 63.75 | 60.73 |
Expected results (scores shown in your paper)
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 39.70 | 59.38 | 49.67 | 66.03 | 66.19 | 53.87 | 62.06 | 56.70 |
Strangely, I can fully reproduce the scores for SimCSE models via following command:
% python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test
Here is the result of pip freeze
and I am using one NVIDIA RTX 6000 Ada GPU.
Thank you very much for your help!
pip freeze result
aiofiles==23.2.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.5.0
async-timeout==4.0.3
attrs==24.2.0
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
contourpy==1.1.1
cycler==0.12.1
datasets==3.0.1
dill==0.3.8
exceptiongroup==1.2.2
fastapi==0.115.2
ffmpy==0.4.0
filelock==3.16.1
fonttools==4.54.1
frozenlist==1.4.1
fsspec==2024.6.1
gradio==4.44.1
gradio-client==1.3.0
h11==0.14.0
httpcore==1.0.6
httpx==0.27.2
huggingface-hub==0.25.2
idna==3.10
importlib-resources==6.4.5
jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.7.5
mdurl==0.1.2
multidict==6.1.0
multiprocess==0.70.17
numpy==1.24.4
orjson==3.10.7
packaging==24.1
pandas==2.0.3
pillow==10.4.0
prettytable==3.11.0
propcache==0.2.0
pyarrow==17.0.0
pydantic==2.9.2
pydantic-core==2.23.4
pydub==0.25.1
pygments==2.18.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-multipart==0.0.12
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
rich==13.9.2
ruff==0.6.9
sacremoses==0.1.1
safetensors==0.4.5
scikit-learn==1.3.2
scipy==1.10.1
semantic-version==2.10.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
starlette==0.39.2
threadpoolctl==3.5.0
tokenizers==0.9.4
tomlkit==0.12.0
torch==1.7.1+cu110
torchtyping==0.1.5
tqdm==4.66.5
transformers==4.2.1
typeguard==2.13.3
typer==0.12.5
typing-extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
uvicorn==0.31.1
wcwidth==0.2.13
websockets==12.0
xxhash==3.5.0
yarl==1.15.1
zipp==3.20.2
Hi,
It looks like the dependency is the same as our experiment setting and the hardware shouldn't cause that much of a difference. Unfortunately I am also not sure what caused the difference.... have you tried testing the RoBERTa first-last avg?
@gaotianyu1350
Thanks for the prompt response! Also could not reproduce the results for the RoBERTa first-last avg.
It turned out that due to the logic change in first-last avg pooling in this commit, it seems that the current codebase cannot reproduce the result for the models which use first-last avg (like BERT or RoBERTa).
After rolling back the codebase (just simply use the static word embedding layer instead of the contextualized embeddings from first layer), I can successfully reproduce the STS results shown in the paper.
It would be very nice if you could add some notes in README or your paper about this discrepancy for those who are/will be trying to reproduce the result! :)
Hi,
Thanks for figuring it out!! Yeah it makes sense that using the contextual embedding improves the result. I'll add a note to the readme.
Thank you for updating README! Closing this issue.