Discrepancy in BLEU Score During No Modification Evaluation
lan666as opened this issue · 0 comments
lan666as commented
Hi, I'm trying to replicate the No Modification evaluation result as described in your paper.
I've installed sacrebleu==1.4.14 and adapted the evaluation code as follows:
def eval_bleu_moses(ref_file: str, sys_file: str, evaluation_dir: str = "eval"):
os.makedirs(evaluation_dir, exist_ok=True)
subprocess.run([f"cat {ref_file} | {MOSES_DETOKENIZER} -l en > {evaluation_dir}/ref.txt"], shell=True)
subprocess.run([f"cat {sys_file} | {MOSES_DETOKENIZER} -l en > {evaluation_dir}/sys.txt"], shell=True)
with open(f"{evaluation_dir}/ref.txt",'r+') as file:
refs = [file.read().split('\n')]
with open(f"{evaluation_dir}/sys.txt",'r+') as file:
sys = file.read().split('\n')
bleu = sacrebleu.corpus_bleu(sys, refs)
return bleu
and then running
eval_bleu_moses(ref_file='data/labelled/test.for', sys_file='data/labelled/test.inf')
However, I'm noticing a discrepancy in the BLEU score. While the paper reports a BLEU score of 35.32, my implementation produces a BLEU score of 32.43 (65.3/42.0/28.7/20.3 (BP = 0.912 ratio = 0.916 hyp_len = 5398 ref_len = 5894)).
Can you please confirm if there is a specific reason for this discrepancy or is there something I might be missing? Any advice or guidance would be greatly appreciated.
Thank you.