preprocess_sharc.py: func filter_answer() and get_bullets() may not work as expected

When I run preprocess_sharc.py, I find that filter_answer never filter out any token:

Lines 48 to 49 in f9b5d1e

    
           def filter_answer(answer): 
        
               return detokenize([a for a in answer if a['orig'] not in MATCH_IGNORE])

Also, get_bullets will always return a empty list:

e3/preprocess_sharc.py

Lines 107 to 118 in f9b5d1e

    
           def get_bullets(context): 
        
               indices = [i for i, c in enumerate(context) if c == '*'] 
        
               pairs = list(zip(indices, indices[1:] + [len(context)])) 
        
               cleaned = [] 
        
               for s, e in pairs: 
        
                   while not context[e-1].strip(): 
        
                       e -= 1 
        
                   while not context[s].strip() or context[s] == '*': 
        
                       s += 1 
        
                   if e - s > 2 and e - 2 < 45: 
        
                       cleaned.append((s, e-1)) 
        
               return cleaned

I suspect the possible reason could be revtok.tokenize will return a tokenized list with added spaces like Hello, which make token['orig'] fail to match tokens in MATCH_IGNORE and *.

Yes I think there is a bug here - the matching is supposed to be done as follows:

-    indices = [i for i, c in enumerate(context) if c == '*']
+    indices = [i for i, c in enumerate(context) if c['sub'] == '*']
     pairs = list(zip(indices, indices[1:] + [len(context)]))
     cleaned = []
     for s, e in pairs:
-        while not context[e-1].strip():
+        while not context[e-1]['sub'].strip():
             e -= 1
-        while not context[s].strip() or context[s] == '*':
+        while not context[s]['sub'].strip() or context[s]['sub'] == '*':
             s += 1
         if e - s > 2 and e - 2 < 45:
             cleaned.append((s, e-1))

I'm rerunning the training scripts to see if this changes results

Can you add your preprocessed files into docker environment? I will see if I can replicate your results with your processed files.

OK, after I retrained the model for the retrieval:

Best dev
{'dev_bleu_1': 0.5175,
 'dev_bleu_2': 0.4804,
 'dev_bleu_3': 0.4532,
 'dev_bleu_4': 0.4288,
 'dev_combined': 0.31851264,
 'dev_macro_accuracy': 0.7428,
 'dev_micro_accuracy': 0.6881,
 'dev_span_f1': 0.7758598345984843,
 'epoch': 3,
 'train_bleu_1': 0.5795,
 'train_bleu_2': 0.548,
 'train_bleu_3': 0.5282,
 'train_bleu_4': 0.5141,
 'train_combined': 0.44541623999999996,
 'train_loss_clf': 0.40613971317808967,
 'train_loss_retrieve': 0.2112928715934799,
 'train_loss_span_end': 0.011150807250163897,
 'train_loss_span_start': 0.017885973322182485,
 'train_macro_accuracy': 0.8664,
 'train_micro_accuracy': 0.834,
 'train_span_f1': 1.0}

For the editor

Best dev
{'dev_f1': 0.7611304248405354,
 'epoch': 11,
 'train_f1': 0.8762811226505095,
 'train_loss_after': 0.4524998854104321,
 'train_loss_before': 0.640820378238715}

And for the joint inference

{'bleu_1': 0.6645,
 'bleu_2': 0.6045,
 'bleu_3': 0.566,
 'bleu_4': 0.5395,
 'combined': 0.3957772,
 'macro_accuracy': 0.7336,
 'micro_accuracy': 0.6802}

fixed in 0c6b771

Can you add your preprocessed files into docker environment? I will see if I can replicate your results with your processed files.

Not sure if this helps but here's the content of my preprocessed sharc folder: https://drive.google.com/file/d/1hdmcEE5JRRvzB2qjFF1iEqL536XSowPt/view?usp=sharing

Thanks, now I can replicate your results with your trained model and processed files on a cuda8 environment.

{'bleu_1': 0.667,
 'bleu_2': 0.6041,
 'bleu_3': 0.5635,
 'bleu_4': 0.5362,
 'combined': 0.39335632000000004,
 'macro_accuracy': 0.7336,
 'micro_accuracy': 0.6802}

I am going to figure out where the problem is.

	def filter_answer(answer):
	return detokenize([a for a in answer if a['orig'] not in MATCH_IGNORE])

	def get_bullets(context):
	indices = [i for i, c in enumerate(context) if c == '*']
	pairs = list(zip(indices, indices[1:] + [len(context)]))
	cleaned = []
	for s, e in pairs:
	while not context[e-1].strip():
	e -= 1
	while not context[s].strip() or context[s] == '*':
	s += 1
	if e - s > 2 and e - 2 < 45:
	cleaned.append((s, e-1))
	return cleaned