ltgoslo/definition_modeling

[Question] Some Questions

Closed this issue · 18 comments

Hey there, I am so interested in this terrific work, and found some questions when I tried to reproduce the results in the paper:

  • Q1: Do the checkpoints released on the Hugging Face (3 * FlanT5 models) correspond to the Soft domain shift models?
  • Q2: How could I compute the final evaluation results including bertscore-f1, rouge-l, and bleu as the results reported in the paper? Of course, after using the command python3 code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/, I got the predicted definition of each item of a word in the test data. And I use python code/definition_pair_similarity.py --data_path predicted.tsv --output_path "result.tsv" to compute each word's metrics. As a result, should I average each line of results in result.tsv for the final mean values?

Thanks so much!

Hi,

  1. The models on HF are fine-tuned on three datasets (Oxford, WordNet and CoDWoE), so yes, they fall into the "soft domain shift category" from the paper.
  2. If you want to evaluate the generated definitions against gold definitions, you don't need the definition_pair_similarity.py script. Just make sure you have a tab-separated file with three columns: "Targets", "Definition" and "Generated_Definition", and run evaluate_simple.py on it.

Hi,

1. The models on HF are fine-tuned on three datasets (Oxford, WordNet and CoDWoE), so yes, they fall into the "soft domain shift category" from the paper.

2. If you want to evaluate the generated definitions against gold definitions, you don't need the `definition_pair_similarity.py` script. Just make sure you have a tab-separated file with three columns: "Targets", "Definition" and "Generated_Definition", and run [evaluate_simple.py](https://github.com/ltgoslo/definition_modeling/blob/main/code/evaluation/evaluate_simple.py) on it.

Exactly helpful, thank you so much.

Thanks again for your previous response!

Now I wonder what decoding parameters (e.g., temperature, beam_size) I can use to reproduce the results BLEU, ROUGE-L, and BERT-F1 reported in the paper.

Hi,
As mentioned in the paper, we used simple greedy decoding, thus no need to tune temperature or beam size.
The only trick we employed is target word filtering: that is, prohibiting the model from generating the target word itself, to avoid circular definitions.

Hi, As mentioned in the paper, we used simple greedy decoding, thus no need to tune temperature or beam size. The only trick we employed is target word filtering: that is, prohibiting the model from generating the target word itself, to avoid circular definitions.

Thank you! I guess running evaluate_simple.py directly is okay, right?

Yes, sure.

Somehow, I got the following reproduced results, which are inconsistent with the digits reported in the paper.

My Reproduction Steps

  1. Run generate_t5.py for generation;
python code/modeling/generate_t5.py --model ltg/flan-t5-definition-en-xl --testdata wordnet/ --save predicted_wordnet_flan-t5-xl.tsv
  1. Run evaluate_simple.py for evaluation;
python code/evaluation/evaluate_simple.py --data_path predicted_wordnet_flan-t5-xl.tsv --output results_wordnet_flant5-xl.tsv
  1. I got the results below.
图片

I would like to know if there's anything I missed. Thank you so much!

What are you testing it on, what split of the Wordnet dataset?

What are you testing it on, what split of the Wordnet dataset?

The test split of WordNet, including 1,775 examples.

We will look into it, but most probably it is related to different implementations of Rouge and BLEU scores.

Hi, did you solve / look into it? I faced the same issue. Which script should I consider for evaluation?

code/evaluation/evaluate_simple.py or
code/definition_pair_similarity...py

Thanks for your assistance.
Bests,
Francesco

Hi, did you solve / look into it? I faced the same issue. Which script should I consider for evaluation?

code/evaluation/evaluate_simple.py or code/definition_pair_similarity...py

Thanks for your assistance. Bests, Francesco

Hi Francesco,

On the one hand, you can use evaluate_simple.py to compute the metrics, but I am afraid that the same results can not be reproduced in the paper. The metrics used in the script are implemented by Hugging Face.

On the other hand, you can use definition_pair_similarity.py to get the more comparable results on bleu, rouge, and bertscore-f1 reported in the paper.

IMO, it would be better to have a glimpse of several research works and their evaluation implementation on DM from several years ago. I believe they will help you.

By the way, feel free to contact me (yangliu.real@gmail.com) if you have any ideas on the DM task or do further research on it. (I am working on it :)

Best,
Yang

Dear Yang, thank you for your response :) I am working on something related and will probably reach out to you.
How did you obtain more comparable results for BLEU, ROUGE, and BERTScore? Could you confirm the following?
In definition_pair_similarity.py:

  • BLEU score is replaced with SacreBLEU
  • BERTScore is replaced with sentence-embedding similarity
  • ROUGE is not present
  • METEOR is used with alpha=0.5

Thanks for your assistance.
Bests,
Francesco

Dear Yang, thank you for your response :) I am working on something related and will probably reach out to you. How did you obtain more comparable results for BLEU, ROUGE, and BERTScore? Could you confirm the following? In definition_pair_similarity.py:

* BLEU score is replaced with SacreBLEU

* BERTScore is replaced with sentence-embedding similarity

* ROUGE is not present

* METEOR is used with alpha=0.5

Thanks for your assistance. Bests, Francesco

You're welcome!

  • I recommend you try the sentence_bleu (NLTK) to reproduce the result of the BLEU score in definition_pair_similarity.py, due to sentence-BLEU (NLTK) resulting in a higher score somehow but other implemented versions like sacreBLEU or Hugging Face BLEU (Corpus BLEU) leading low computation results. To summarize, I think sentence-BLEU (NLTK) in the definition_pair_similarity.py is the most likely probable way to reproduce the result of bleu in the paper.
  • BERT Score should directly follow the evaluation method in the evaluate_simple.py. The sentence-embedding similarity computed by SentenceTransformer will be much lower than vanilla BERTScore.
  • Yes, ROUGE is not present in definition_pair_similarity.py but exists in evaluate_simple.py. I think to use the one in evaluate_simple.py is okay.
  • I think the alpha should follow the default value, instead of 0.5 specified in definition_pair_similarity.py

I have to say I failed to reproduce the reported results in the paper at last. And I think it would be better to chat on this elsewhere, rather than in this closed issue :)

Hi @FrancescoPeriti and @jacklanda
Sorry for the delay with this issue.

As promised, we have looked into it.

  1. The evaluation script can aggregate the scores for unique target words or unique senses. The results in Table 3 of the paper for WordNet and Oxford were achieved with aggregating by senses, but since then we moved on to using other resources which do not provide explicit sense ids, only target words, examples and definitions. Because of that, the evaluation script was modified to always use the Targets column as quasi-senses, which of course leads to artificially low results if there are many senses per word on average.
    The evaluation script is now updated so that it uses the Sense column if its present and Targets otherwise.
  2. The evaluation script offers several different strategies of dealing with the situations when the gold data has multiple definitions for one and the same sense (the multiple_definitions_same_sense_id argument). The default option is mean, that is, take the average of the evaluation scores between a given generated definition and all possible gold definitions. But Table 3 in the paper uses the max option, that is, take the maximum evaluation score across all gold definitions for a specific generated definition.
  3. Finally, by default the evaluation script uses whitespace tokenization for Rouge, because the default tokenizer of rouge_scorer is tuned for English and plays bad with other languages. But in the paper, we dealt with English, so we used its default tokenizer.

To sum up: if you want to reproduce Table 3 from the paper, run the evaluation script with the following parameters:
./evaluate_simple.py --data_path ${TEST} --output ${OUTPUT} --multiple_definitions_same_sense_id=max --whitespace 0
I've just re-evaluated our generated definitions from a year ago, and the results are:

sacrebleu       32.8088
rougeL  0.5221
bertscore       0.9122
exact_match     0.2328

Dear Andrey (et al.),
Your work has been quite inspiring :) I was just looking in that direction.
Thanks for your time and explanation.
Bests

Dear Andrey (et al.), Your work has been quite inspiring :) I was just looking in that direction. Thanks for your time and explanation. Bests

"The sun will rise, and the truth will come to light."

@FrancescoPeriti Hey Francesco, have you already reproduced the same results Andrey provided on the WordNet / Oxford?

@jacklanda note that you probably won't be able to generate exactly the same definitions with up-to-date versions of HF Transformers, since generation methods are being changed and updated all the time.
But the scores should be in approximately the same range. Also, if you need the exact definitions on which the evaluation scores from the paper were computed, we can publish those.