Unknown Token interfering with stopwords with word augmenter
vera-bernhard opened this issue · 0 comments
vera-bernhard commented
I think the token [UNK] used for tokens unknown to model interferes with the use of the unknown token to temporarily replace provided stopwords. In example 1, there is one token 汉 entirely unknown to the model which is then mistakenly replaced by the stopword test. If there's too many instances of the unknown tokens (s. example 2), it results in an index error as there aren't any stopwords to mistakenly substitute back. Is there are workaround?
example 1:
Code:
import nlpaug.augmenter.word as naw
model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉'
stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))
Output:
This is simple test test.
Expected behavior: don't replace 汉 at all.
example 2:
Code:
import nlpaug.augmenter.word as naw
model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉汉'
stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))
Output:
File "lib/python3.8/site-packages/nlpaug/augmenter/word/context_word_embs.py", line 542, in substitute_back_reserved_stopwords
doc.update_change_log(token_i, token=reserved_stopword_tokens[reserved_pos],
IndexError: list index out of range