Thanks for the great work.
I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.
In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization
:
|
def fix_tokenization(text): |
|
input_tokens = text.split() |
|
output_tokens = [] |
|
has_left_quote = False |
|
has_left_single_quote = False |
|
|
|
i = 0 |
|
prev_dash = False |
|
while i < len(input_tokens): |
|
tok = input_tokens[i] |
|
flag_prev_dash = False |
|
if tok in _tok_dict.keys(): |
|
output_tokens.append(_tok_dict[tok]) |
|
i += 1 |
|
elif tok == "\"": |
|
if has_left_quote: |
|
output_tokens.append("''") |
|
else: |
|
output_tokens.append("``") |
|
has_left_quote = not has_left_quote |
|
i += 1 |
|
elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t": |
|
output_tokens[-1] = output_tokens[-1][:-1] |
|
output_tokens.append("n't") |
|
i += 2 |
|
elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"): |
|
output_tokens.append("'"+input_tokens[i + 1]) |
|
i += 2 |
|
elif tok == "'": |
|
if has_left_single_quote: |
|
output_tokens.append("'") |
|
else: |
|
output_tokens.append("`") |
|
has_left_single_quote = not has_left_single_quote |
|
i += 1 |
|
elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".": |
|
output_tokens.append("...") |
|
i += 3 |
|
elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]): |
|
# $ 3 , 000 -> $ 3,000 |
|
output_tokens[-1] += ','+input_tokens[i + 1] |
|
i += 2 |
|
elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit(): |
|
# 3 . 03 -> $ 3.03 |
|
output_tokens[-1] += '.'+input_tokens[i + 1] |
|
i += 2 |
|
elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.': |
|
# U . N . -> U.N. |
|
k = i+3 |
|
while k+2 < len(input_tokens): |
|
if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.': |
|
k += 2 |
|
else: |
|
break |
|
output_tokens[-1] += ''.join(input_tokens[i:k]) |
|
i += 2 |
|
elif tok == "-": |
|
if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-": |
|
output_tokens.append("--") |
|
i += 2 |
|
elif i == len(input_tokens) - 1 or i == 0: |
|
output_tokens.append("-") |
|
i += 1 |
|
elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation: |
|
output_tokens[-1] += "-" |
|
i += 1 |
|
flag_prev_dash = True |
|
else: |
|
output_tokens.append("-") |
|
i += 1 |
|
elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation: |
|
output_tokens[-1] += tok |
|
i += 1 |
|
else: |
|
output_tokens.append(tok) |
|
i += 1 |
|
prev_dash = flag_prev_dash |
|
return " ".join(output_tokens) |
For example, it turns . . .
to ...
, "
to ''
, 1 , 000
to 1,000
.
However, the original data do not like the sentence after fix_tokenization
. Here are some samples from the test set:
What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?
Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization
:
|
BLEU |
METEOR |
ROUGE-L |
MASS-base reported in GLGE |
20.1 |
24.4 |
49.4 |
MASS-base reproduce with fix_tokenization |
20.69 |
24.92 |
49.21 |
MASS-base reproduce without fix_tokenization |
22.54 |
25.03 |
50.27 |
I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.
Looking forward to your reply