microsoft/ProphetNet

Wrong Tokenization in SquadQG Evaluation Scripts

hzhwcmhf opened this issue · 0 comments

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

def fix_tokenization(text):
input_tokens = text.split()
output_tokens = []
has_left_quote = False
has_left_single_quote = False
i = 0
prev_dash = False
while i < len(input_tokens):
tok = input_tokens[i]
flag_prev_dash = False
if tok in _tok_dict.keys():
output_tokens.append(_tok_dict[tok])
i += 1
elif tok == "\"":
if has_left_quote:
output_tokens.append("''")
else:
output_tokens.append("``")
has_left_quote = not has_left_quote
i += 1
elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t":
output_tokens[-1] = output_tokens[-1][:-1]
output_tokens.append("n't")
i += 2
elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
output_tokens.append("'"+input_tokens[i + 1])
i += 2
elif tok == "'":
if has_left_single_quote:
output_tokens.append("'")
else:
output_tokens.append("`")
has_left_single_quote = not has_left_single_quote
i += 1
elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
output_tokens.append("...")
i += 3
elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]):
# $ 3 , 000 -> $ 3,000
output_tokens[-1] += ','+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit():
# 3 . 03 -> $ 3.03
output_tokens[-1] += '.'+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.':
# U . N . -> U.N.
k = i+3
while k+2 < len(input_tokens):
if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.':
k += 2
else:
break
output_tokens[-1] += ''.join(input_tokens[i:k])
i += 2
elif tok == "-":
if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
output_tokens.append("--")
i += 2
elif i == len(input_tokens) - 1 or i == 0:
output_tokens.append("-")
i += 1
elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
output_tokens[-1] += "-"
i += 1
flag_prev_dash = True
else:
output_tokens.append("-")
i += 1
elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
output_tokens[-1] += tok
i += 1
else:
output_tokens.append(tok)
i += 1
prev_dash = flag_prev_dash
return " ".join(output_tokens)

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

BLEU METEOR ROUGE-L
MASS-base reported in GLGE 20.1 24.4 49.4
MASS-base reproduce with fix_tokenization 20.69 24.92 49.21
MASS-base reproduce without fix_tokenization 22.54 25.03 50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply