timoschick/dino

Request for public STSb-DINO dataset

MatthewCYM opened this issue · 5 comments

Hi,

May I ask for the STSb-DINO dataset mentioned in the paper?

I try to generate the dataset through running

python3 dino.py \
--output_dir output \
--task_file task_specs/sts-x2.json \
--input_file raw.in \
--input_file_type plain \
--num_entries_per_input_and_label 2 \
--max_output_length 40 \
--decay_constant 100 \
--top_p 0.9 \
--top_k 5 \
--remove_duplicates \
--remove_identical_pairs

where the task specification is
{
"task_name": "sts",
"labels": {
"2": {
"instruction": "Task: Write two sentences that mean the same thing.\nSentence 1: ""\nSentence 2: "",
"counter_labels": []
},
"1": {
"instruction": "Task: Write two sentences that are somewhat similar.\nSentence 1: ""\nSentence 2: "",
"counter_labels": [
"2"
]
},
"0": {
"instruction": "Task: Write two sentences that are on completely different topics.\nSentence 1: ""\nSentence 2: "",
"counter_labels": [
"2",
"1"
]
}
}
}

However, the results on the STS-b is only around 60% with SRoberta training.

May I ask which part goes wrong?

Regards,
Yiming

Hi @MatthewCYM, I currently do not have access to high speed internet which is why I cannot upload the dataset. I'll do so early next week. I hope I can also share our finetuning script by then (it still needs some "beautification" to be published).

Hi @MatthewCYM, the datasets are now available (see https://github.com/timoschick/dino/blob/main/README.md#-generated-dinos). To reproduce our results, you should use the postprocessed (pp) version (or do the postprocessing yourself using this script). When evaluating on any STS dataset, note that we use the same metrics as SentenceBERT, which differ from those used by SentEval. You may find this script helpful, but it requires some adaptation (for example, it assumes that labels are in the range (0,5). Let me know if you are still unable to reproduce our results, then I'll prioritize the "beautification" of our finetuning script.

Hi @timoschick,

We still cannot reproduce the result on STS benchmark (only around 66% on STS-b). Could you please public the finetuning script?

Thank you.

Hi @MatthewCYM,

you can now find a slightly modified version of the training script that we've used here: https://github.com/timoschick/dino/blob/main/scripts/sts/run_training.py

After installing sentence-transformers (see requirements.txt for the correct version), you can run it as follows:

python3 run_training.py --input_file <PATH_TO_YOUR_DINO_DATASET> --output_dir <SOME_OUTPUT_DIR>

Using STS‑🦕‑x2 (pp) (https://www.cis.uni-muenchen.de/~schickt/dino/sts-dino-x2-postprocessed.jsonl) as an input file, this should give a Spearman correlation of 77.97 for STSb. There's a slight deviation from the number reported in the paper (77.82) because one step of our postprocessing involves converting a list into a set and back - which doesn't always give identical results - and we didn't save the original dataset.

If you manage to find the difference between your finetuning script and ours, I'd be very happy to learn about it (if for example there's a single hyperparameter that causes such a large difference in performance, that would be a very important finding that we should add to the paper).

Hi @MatthewCYM, there seems to be some issue regarding dependencies (sentence-transformers requires torch>=1.6.0, whereas the rest of DINO requires torch==1.5.0). You can refer to this comment for a workaround.