jind11/TextFooler

The usage of '<oov>' is not consistent with the paper

plasmashen opened this issue · 3 comments

In paper, the importance score of the word is calculated by removing this word, but you use '<oov>' to replace this word to calculate the importance score in
https://github.com/jind11/TextFooler/blob/master/attack_classification.py#L216

Moreover, the '<oov>' will be tokenized into 4 tokens which may have attention affects with other words.
I'm wondering why such nonsensical '<oov>' is used?

hi, I have tested both methods: removing the word or replacing it with "" and the difference is not obvious. is in the vocab so I don't think it can be tokenized into 4 tokens. Let me know if you have more questions.

Where is the emdding.npz file, please? Or how is it generated?
7a678cd5f2a8398b7980d8aaa9d5aec
b9069123768ea397299dc7ed1419901

The readme file has explained how to obtain the embeddings:
Run the following code to pre-compute the cosine similarity scores between word pairs based on the counter-fitting word embeddings [https://drive.google.com/file/d/1bayGomljWb6HeYDMTDKXrh0HackKtSlx/view].

python comp_cos_sim_mat.py [PATH_TO_COUNTER_FITTING_WORD_EMBEDDINGS]