makcedward/nlpaug

Unknown Token interfering with stopwords with word augmenter

vera-bernhard opened this issue · 0 comments

I think the token [UNK] used for tokens unknown to model interferes with the use of the unknown token to temporarily replace provided stopwords. In example 1, there is one token entirely unknown to the model which is then mistakenly replaced by the stopword test. If there's too many instances of the unknown tokens (s. example 2), it results in an index error as there aren't any stopwords to mistakenly substitute back. Is there are workaround?

example 1:

Code:

import nlpaug.augmenter.word as naw

model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉'

stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
        model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))

Output:

This is simple test test.

Expected behavior: don't replace at all.

example 2:

Code:

import nlpaug.augmenter.word as naw

model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉汉'

stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
        model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))

Output:

  File "lib/python3.8/site-packages/nlpaug/augmenter/word/context_word_embs.py", line 542, in substitute_back_reserved_stopwords
    doc.update_change_log(token_i, token=reserved_stopword_tokens[reserved_pos], 
IndexError: list index out of range