[Question] Some Questions
Closed this issue · 18 comments
Hey there, I am so interested in this terrific work, and found some questions when I tried to reproduce the results in the paper:
- Q1: Do the checkpoints released on the Hugging Face (3 * FlanT5 models) correspond to the
Soft domain shift
models? - Q2: How could I compute the final evaluation results including
bertscore-f1
,rouge-l
, andbleu
as the results reported in the paper? Of course, after using the commandpython3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/
, I got the predicted definition of each item of a word in the test data. And I usepython code/definition_pair_similarity.py --data_path predicted.tsv --output_path "result.tsv"
to compute each word's metrics. As a result, should I average each line of results inresult.tsv
for the final mean values?
Thanks so much!
Hi,
- The models on HF are fine-tuned on three datasets (Oxford, WordNet and CoDWoE), so yes, they fall into the "soft domain shift category" from the paper.
- If you want to evaluate the generated definitions against gold definitions, you don't need the
definition_pair_similarity.py
script. Just make sure you have a tab-separated file with three columns: "Targets", "Definition" and "Generated_Definition", and run evaluate_simple.py on it.
Hi,
1. The models on HF are fine-tuned on three datasets (Oxford, WordNet and CoDWoE), so yes, they fall into the "soft domain shift category" from the paper. 2. If you want to evaluate the generated definitions against gold definitions, you don't need the `definition_pair_similarity.py` script. Just make sure you have a tab-separated file with three columns: "Targets", "Definition" and "Generated_Definition", and run [evaluate_simple.py](https://github.com/ltgoslo/definition_modeling/blob/main/code/evaluation/evaluate_simple.py) on it.
Exactly helpful, thank you so much.
Thanks again for your previous response!
Now I wonder what decoding parameters (e.g., temperature
, beam_size
) I can use to reproduce the results BLEU
, ROUGE-L
, and BERT-F1
reported in the paper.
Hi,
As mentioned in the paper, we used simple greedy decoding, thus no need to tune temperature or beam size.
The only trick we employed is target word filtering: that is, prohibiting the model from generating the target word itself, to avoid circular definitions.
Hi, As mentioned in the paper, we used simple greedy decoding, thus no need to tune temperature or beam size. The only trick we employed is target word filtering: that is, prohibiting the model from generating the target word itself, to avoid circular definitions.
Thank you! I guess running evaluate_simple.py directly is okay, right?
Yes, sure.
Somehow, I got the following reproduced results, which are inconsistent with the digits reported in the paper.
My Reproduction Steps
- Run generate_t5.py for generation;
python code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/ --save predicted_wordnet_flan-t5-xl.tsv
- Run evaluate_simple.py for evaluation;
python code/evaluation/evaluate_simple.py --data_path predicted_wordnet_flan-t5-xl.tsv --output results_wordnet_flant5-xl.tsv
- I got the results below.
I would like to know if there's anything I missed. Thank you so much!
What are you testing it on, what split of the Wordnet dataset?
What are you testing it on, what split of the Wordnet dataset?
The test split of WordNet, including 1,775 examples.
We will look into it, but most probably it is related to different implementations of Rouge and BLEU scores.
Hi, did you solve / look into it? I faced the same issue. Which script should I consider for evaluation?
code/evaluation/evaluate_simple.py
or
code/definition_pair_similarity...py
Thanks for your assistance.
Bests,
Francesco
Hi, did you solve / look into it? I faced the same issue. Which script should I consider for evaluation?
code/evaluation/evaluate_simple.py
orcode/definition_pair_similarity...py
Thanks for your assistance. Bests, Francesco
Hi Francesco,
On the one hand, you can use evaluate_simple.py
to compute the metrics, but I am afraid that the same results can not be reproduced in the paper. The metrics used in the script are implemented by Hugging Face.
On the other hand, you can use definition_pair_similarity.py
to get the more comparable results on bleu, rouge, and bertscore-f1 reported in the paper.
IMO, it would be better to have a glimpse of several research works and their evaluation implementation on DM from several years ago. I believe they will help you.
- Learning to Describe Unknown Phrases with Local and Global Contexts, Ishiwatari (2019)
- Definition Modelling for Appropriate Specificity, Huang et al. (2021)
- Multitasking Framework for Unsupervised Simple Definition Generation, Kong et al. (2022)
By the way, feel free to contact me (yangliu.real@gmail.com) if you have any ideas on the DM task or do further research on it. (I am working on it :)
Best,
Yang
Dear Yang, thank you for your response :) I am working on something related and will probably reach out to you.
How did you obtain more comparable results for BLEU, ROUGE, and BERTScore? Could you confirm the following?
In definition_pair_similarity.py
:
- BLEU score is replaced with SacreBLEU
- BERTScore is replaced with sentence-embedding similarity
- ROUGE is not present
- METEOR is used with alpha=0.5
Thanks for your assistance.
Bests,
Francesco
Dear Yang, thank you for your response :) I am working on something related and will probably reach out to you. How did you obtain more comparable results for BLEU, ROUGE, and BERTScore? Could you confirm the following? In
definition_pair_similarity.py
:* BLEU score is replaced with SacreBLEU * BERTScore is replaced with sentence-embedding similarity * ROUGE is not present * METEOR is used with alpha=0.5
Thanks for your assistance. Bests, Francesco
You're welcome!
- I recommend you try the
sentence_bleu (NLTK)
to reproduce the result of the BLEU score indefinition_pair_similarity.py
, due tosentence-BLEU (NLTK)
resulting in a higher score somehow but other implemented versions likesacreBLEU
orHugging Face BLEU (Corpus BLEU)
leading low computation results. To summarize, I thinksentence-BLEU (NLTK)
in thedefinition_pair_similarity.py
is the most likely probable way to reproduce the result of bleu in the paper. - BERT Score should directly follow the evaluation method in the
evaluate_simple.py
. The sentence-embedding similarity computed by SentenceTransformer will be much lower than vanilla BERTScore. - Yes, ROUGE is not present in
definition_pair_similarity.py
but exists inevaluate_simple.py
. I think to use the one inevaluate_simple.py
is okay. - I think the alpha should follow the default value, instead of 0.5 specified in
definition_pair_similarity.py
I have to say I failed to reproduce the reported results in the paper at last. And I think it would be better to chat on this elsewhere, rather than in this closed issue :)
Hi @FrancescoPeriti and @jacklanda
Sorry for the delay with this issue.
As promised, we have looked into it.
- The evaluation script can aggregate the scores for unique target words or unique senses. The results in Table 3 of the paper for WordNet and Oxford were achieved with aggregating by senses, but since then we moved on to using other resources which do not provide explicit sense ids, only target words, examples and definitions. Because of that, the evaluation script was modified to always use the
Targets
column as quasi-senses, which of course leads to artificially low results if there are many senses per word on average.
The evaluation script is now updated so that it uses theSense
column if its present andTargets
otherwise. - The evaluation script offers several different strategies of dealing with the situations when the gold data has multiple definitions for one and the same sense (the
multiple_definitions_same_sense_id
argument). The default option ismean
, that is, take the average of the evaluation scores between a given generated definition and all possible gold definitions. But Table 3 in the paper uses themax
option, that is, take the maximum evaluation score across all gold definitions for a specific generated definition. - Finally, by default the evaluation script uses whitespace tokenization for Rouge, because the default tokenizer of
rouge_scorer
is tuned for English and plays bad with other languages. But in the paper, we dealt with English, so we used its default tokenizer.
To sum up: if you want to reproduce Table 3 from the paper, run the evaluation script with the following parameters:
./evaluate_simple.py --data_path ${TEST} --output ${OUTPUT} --multiple_definitions_same_sense_id=max --whitespace 0
I've just re-evaluated our generated definitions from a year ago, and the results are:
sacrebleu 32.8088
rougeL 0.5221
bertscore 0.9122
exact_match 0.2328
Dear Andrey (et al.),
Your work has been quite inspiring :) I was just looking in that direction.
Thanks for your time and explanation.
Bests
Dear Andrey (et al.), Your work has been quite inspiring :) I was just looking in that direction. Thanks for your time and explanation. Bests
"The sun will rise, and the truth will come to light."
@FrancescoPeriti Hey Francesco, have you already reproduced the same results Andrey provided on the WordNet / Oxford?
@jacklanda note that you probably won't be able to generate exactly the same definitions with up-to-date versions of HF Transformers, since generation methods are being changed and updated all the time.
But the scores should be in approximately the same range. Also, if you need the exact definitions on which the evaluation scores from the paper were computed, we can publish those.