Wrong Tokenization in SquadQG Evaluation Scripts

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

ProphetNet/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py

Lines 40 to 117 in 0a1b59c

    
           def fix_tokenization(text): 
        
               input_tokens = text.split() 
        
               output_tokens = [] 
        
               has_left_quote = False 
        
               has_left_single_quote = False 
        
               i = 0 
        
               prev_dash = False 
        
               while i < len(input_tokens): 
        
                   tok = input_tokens[i] 
        
                   flag_prev_dash = False 
        
                   if tok in _tok_dict.keys(): 
        
                       output_tokens.append(_tok_dict[tok]) 
        
                       i += 1 
        
                   elif tok == "\"": 
        
                       if has_left_quote: 
        
                           output_tokens.append("''") 
        
                       else: 
        
                           output_tokens.append("``") 
        
                       has_left_quote = not has_left_quote 
        
                       i += 1 
        
                   elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t": 
        
                       output_tokens[-1] = output_tokens[-1][:-1] 
        
                       output_tokens.append("n't") 
        
                       i += 2 
        
                   elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"): 
        
                       output_tokens.append("'"+input_tokens[i + 1]) 
        
                       i += 2 
        
                   elif tok == "'": 
        
                       if has_left_single_quote: 
        
                           output_tokens.append("'") 
        
                       else: 
        
                           output_tokens.append("`") 
        
                       has_left_single_quote = not has_left_single_quote 
        
                       i += 1 
        
                   elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".": 
        
                       output_tokens.append("...") 
        
                       i += 3 
        
                   elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]): 
        
                       # $ 3 , 000 -> $ 3,000 
        
                       output_tokens[-1] += ','+input_tokens[i + 1] 
        
                       i += 2 
        
                   elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit(): 
        
                       # 3 . 03 -> $ 3.03 
        
                       output_tokens[-1] += '.'+input_tokens[i + 1] 
        
                       i += 2 
        
                   elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.': 
        
                       # U . N . -> U.N. 
        
                       k = i+3 
        
                       while k+2 < len(input_tokens): 
        
                           if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.': 
        
                               k += 2 
        
                           else: 
        
                               break 
        
                       output_tokens[-1] += ''.join(input_tokens[i:k]) 
        
                       i += 2 
        
                   elif tok == "-": 
        
                       if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-": 
        
                           output_tokens.append("--") 
        
                           i += 2 
        
                       elif i == len(input_tokens) - 1 or i == 0: 
        
                           output_tokens.append("-") 
        
                           i += 1 
        
                       elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation: 
        
                           output_tokens[-1] += "-" 
        
                           i += 1 
        
                           flag_prev_dash = True 
        
                       else: 
        
                           output_tokens.append("-") 
        
                           i += 1 
        
                   elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation: 
        
                       output_tokens[-1] += tok 
        
                       i += 1 
        
                   else: 
        
                       output_tokens.append(tok) 
        
                       i += 1 
        
                   prev_dash = flag_prev_dash 
        
               return " ".join(output_tokens)

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

	BLEU	METEOR	ROUGE-L
MASS-base reported in GLGE	20.1	24.4	49.4
MASS-base reproduce with fix_tokenization	20.69	24.92	49.21
MASS-base reproduce without fix_tokenization	22.54	25.03	50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply

	def fix_tokenization(text):
	input_tokens = text.split()
	output_tokens = []
	has_left_quote = False
	has_left_single_quote = False

	i = 0
	prev_dash = False
	while i < len(input_tokens):
	tok = input_tokens[i]
	flag_prev_dash = False
	if tok in _tok_dict.keys():
	output_tokens.append(_tok_dict[tok])
	i += 1
	elif tok == "\"":
	if has_left_quote:
	output_tokens.append("''")
	else:
	output_tokens.append("``")
	has_left_quote = not has_left_quote
	i += 1
	elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t":
	output_tokens[-1] = output_tokens[-1][:-1]
	output_tokens.append("n't")
	i += 2
	elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
	output_tokens.append("'"+input_tokens[i + 1])
	i += 2
	elif tok == "'":
	if has_left_single_quote:
	output_tokens.append("'")
	else:
	output_tokens.append("`")
	has_left_single_quote = not has_left_single_quote
	i += 1
	elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
	output_tokens.append("...")
	i += 3
	elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]):
	# $ 3 , 000 -> $ 3,000
	output_tokens[-1] += ','+input_tokens[i + 1]
	i += 2
	elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit():
	# 3 . 03 -> $ 3.03
	output_tokens[-1] += '.'+input_tokens[i + 1]
	i += 2
	elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.':
	# U . N . -> U.N.
	k = i+3
	while k+2 < len(input_tokens):
	if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.':
	k += 2
	else:
	break
	output_tokens[-1] += ''.join(input_tokens[i:k])
	i += 2
	elif tok == "-":
	if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
	output_tokens.append("--")
	i += 2
	elif i == len(input_tokens) - 1 or i == 0:
	output_tokens.append("-")
	i += 1
	elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
	output_tokens[-1] += "-"
	i += 1
	flag_prev_dash = True
	else:
	output_tokens.append("-")
	i += 1
	elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
	output_tokens[-1] += tok
	i += 1
	else:
	output_tokens.append(tok)
	i += 1
	prev_dash = flag_prev_dash
	return " ".join(output_tokens)