Factbank's align method fails on whitespace tokens
rudinger opened this issue · 3 comments
rudinger commented
Fails with following exception:
Traceback (most recent call last):
File "./readers.py", line 659, in <module>
os.path.join(inp_, "tokens_tml.txt"))
File "./readers.py", line 136, in __init__
self.conll_txt = self.convert(tokens_tml)
File "./readers.py", line 215, in convert
dep_feats = self.get_dep_feats(toks, cur_sent)
File "./readers.py", line 236, in get_dep_feats
self.align(toks, sent)
File "./readers.py", line 297, in align
cur_word)))
Exception: Unknown case: ( 891102, ' ', ' ')
This is my temporary hack/fix, and I'm not sure if this actually results in the right behavior.
def align(self, toks, sent):
"""
Match between the spacy tokens in toks to the words in sent
Might merge tokens in spacy in-place.
"""
toks_ind = 0
sent_ind = 0
ret = []
while sent_ind < len(sent):
# logging.debug("sent_ind = {}, toks_ind = {}".format(sent_ind, toks_ind))
cur_tok = str(toks[toks_ind])
cur_word = sent[sent_ind][1]
# logging.debug("{} vs. {}".format(cur_tok, cur_word))
# logging.debug("flag = {}".format(cur_word.endswith(cur_tok)))
print "toks: ", toks #RR
print "cur_tok: ", cur_tok #RR
print "cur_word: ", cur_word #RR
### hacky bug fix next 3 lines (other github issue) ###
if cur_tok == "." and cur_word == ". . .":
toks[toks_ind : toks_ind + 3].merge()
continue
### another hacky bug fix next 4 lines (this github issue) ###
if cur_tok.isspace() and cur_word.isspace():
toks_ind += 1
sent_ind += 1
continue
if (cur_tok == cur_word) or \
(cur_word.endswith(cur_tok) and \
(toks_ind >= (len(toks) -1) or ((cur_tok + str(toks[toks_ind + 1])) not in cur_word))):
# rest of method...
gabrielStanovsky commented
See my response to #21, can you please attach some example sentences?
rudinger commented
It's this "sentence" (from sentences.txt
):
'wsj_0006.tml'|||1|||' 891102'
I think the issue is with the two spaces before the number, which wind up in the tokenization.
gabrielStanovsky commented
It seems that this error occurs when using a slightly different version of FactBank, which replaces some Uu
labels with NA
, which seems to be semantically identical.
Reverting to Uu
labels solves the problem.