typo rate problem

Question

typo rate problem

Closed this issue 2 months ago · 4 comments

Hi,
Recently, I used IMDB movie review data set to perform Bert_with_love classification task, but I found that the classification accuracy did not decrease with the increase of oov proportion, but would remain unchanged or even increase. For example:

The file typo_test_10-typo_test_90 was generated using get_random_attack() of attacks.py.
Could you tell me the reason?
Thank you.

Answer 1 · 2024-08-01T21:55:54.000Z

Sorry for the late reply! I was busy preparing a rebuttal this week.

Regarding your question, I guess it is caused by the distinction of the IMDB dataset. I would suggest having a look at the corrupted samples. Also, it is useful to only attract keywords in the input (some stop words won't impact the prediction very much).

Answer 2 · 2024-08-02T13:25:02.000Z

Thanks for your answer, my approach is to randomly select each word in the dataset to attack, however, the selected word is likely to have little impact on the classification. Could you guide me on how to choose keywords for each line of text?

Answer 3 · 2024-08-02T17:50:44.000Z

Regarding the keywords 1) you can have a look at TF-IDF keyword extraction (link). It is possible to use other tools for keyword extraction 2) you can use gradients to find significant words, which is proposed by Adv-BERT.

Note that it's useful to repeat the experiment on another dataset to validate whether you meet the same issue.

Answer 4 · 2024-08-03T01:19:46.000Z

Thank you very much for your guidance.！I know what to do now.