Unknown Token interfering with stopwords with word augmenter

Question

Unknown Token interfering with stopwords with word augmenter

vera-bernhard opened this issue 3 years ago · 0 comments

I think the token [UNK] used for tokens unknown to model interferes with the use of the unknown token to temporarily replace provided stopwords. In example 1, there is one token 汉 entirely unknown to the model which is then mistakenly replaced by the stopword test. If there's too many instances of the unknown tokens (s. example 2), it results in an index error as there aren't any stopwords to mistakenly substitute back. Is there are workaround?

example 1:

Code:

import nlpaug.augmenter.word as naw

model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉'

stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
        model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))

Output:

This is simple test test.

Expected behavior: don't replace 汉 at all.

example 2:

Code:

import nlpaug.augmenter.word as naw

model = 'bert-base-cased'
aug_p = 0.2
text = 'This is a test 汉汉'

stopwords = ['test']
aug = naw.ContextualWordEmbsAug(
        model_path=model, action="substitute", aug_p=aug_p, stopwords=stopwords)
print(aug.augment(text))

Output:

  File "lib/python3.8/site-packages/nlpaug/augmenter/word/context_word_embs.py", line 542, in substitute_back_reserved_stopwords
    doc.update_change_log(token_i, token=reserved_stopword_tokens[reserved_pos], 
IndexError: list index out of range