
Factbank's align method fails on whitespace tokens

rudinger opened this issue · 3 comments

Fails with following exception:

Traceback (most recent call last):
  File "./", line 659, in <module>
    os.path.join(inp_, "tokens_tml.txt"))
  File "./", line 136, in __init__
    self.conll_txt = self.convert(tokens_tml)
  File "./", line 215, in convert
    dep_feats = self.get_dep_feats(toks, cur_sent)
  File "./", line 236, in get_dep_feats
    self.align(toks, sent)
  File "./", line 297, in align
Exception: Unknown case: (  891102, '  ', ' ')

This is my temporary hack/fix, and I'm not sure if this actually results in the right behavior.

    def align(self, toks, sent):
        Match between the spacy tokens in toks to the words in sent
        Might merge tokens in spacy in-place.
        toks_ind = 0
        sent_ind = 0
        ret = []
        while sent_ind < len(sent):
 #           logging.debug("sent_ind = {}, toks_ind = {}".format(sent_ind, toks_ind))
            cur_tok = str(toks[toks_ind])
            cur_word = sent[sent_ind][1]
 #           logging.debug("{} vs. {}".format(cur_tok, cur_word))
 #           logging.debug("flag = {}".format(cur_word.endswith(cur_tok)))
            print "toks: ", toks #RR
            print "cur_tok: ", cur_tok #RR
            print "cur_word: ", cur_word #RR
            ### hacky bug fix next 3 lines (other github issue) ###
            if cur_tok == "." and cur_word == ". . .":
                toks[toks_ind : toks_ind + 3].merge()
            ### another hacky bug fix next 4 lines (this github issue) ###
            if cur_tok.isspace() and cur_word.isspace():
                toks_ind += 1
                sent_ind += 1
            if (cur_tok == cur_word) or \
               (cur_word.endswith(cur_tok) and \
                (toks_ind >= (len(toks) -1) or ((cur_tok + str(toks[toks_ind + 1])) not in cur_word))):
                # rest of method...

See my response to #21, can you please attach some example sentences?

It's this "sentence" (from sentences.txt):

'wsj_0006.tml'|||1|||'  891102'

I think the issue is with the two spaces before the number, which wind up in the tokenization.

It seems that this error occurs when using a slightly different version of FactBank, which replaces some Uu labels with NA, which seems to be semantically identical.
Reverting to Uu labels solves the problem.