tigerchen52/LOVE

bert_text_classification

Closed this issue · 6 comments

Hello, I am trying to use BERT and LOVE for text classification recently. In your latest released code, I have some questions:

  1. Is the embedding vector of each word in the file love. emb generated by the pre-trained model love_bert_base_uncased?
  2. I found that the number of lines in love. emb is 57,459, while the number of lines in vocab.txt is 64,083. Why are they not consistent? I thought that love is used to generate embedding vectors for each word in vocab.
  3. I would like to know how to obtain the bert_emb in the function get_emb_for_text(text, bert_emb=None, embeddings=None, max_len=50)?
    Thank you !

Hello, I am trying to use BERT and LOVE for text classification recently. In your latest released code, I have some questions:

  1. Is the embedding vector of each word in the file love. emb generated by the pre-trained model love_bert_base_uncased?
  2. I found that the number of lines in love. emb is 57,459, while the number of lines in vocab.txt is 64,083. Why are they not consistent? I thought that love is used to generate embedding vectors for each word in vocab.
  3. I would like to know how to obtain the bert_emb in the function get_emb_for_text(text, bert_emb=None, embeddings=None, max_len=50)?
    Thank you !

Hi,

Thanks for asking!

  1. Yes, all the embeddings are created by using the corresponding love_bert_base_uncased
  2. love.emb contains embeddings for only OOV words and the vocab.txt has all the words in this experiment.
  3. bert_emb is the token matrix before attention layers, and you can get them by the following code
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# Load the pretrained BERT model
model = TextClassification.from_pretrained('bert-base-uncased')
bert_emb = model.get_input_embeddings()

Thank you very much for your reply. I still have two questions: 1. Does vocab.txt contain all the words in this dataset? 2. If I want to conduct text classification experiments on other datasets, is the process: firstly establish a vocab.txt, then filter out the words that will be divided into subwords when inputting bert, and then use the love model to generate love. emb files for these words?

Thank you very much for your reply. I still have two questions: 1. Does vocab.txt contain all the words in this dataset? 2. If I want to conduct text classification experiments on other datasets, is the process: firstly establish a vocab.txt, then filter out the words that will be divided into subwords when inputting bert, and then use the love model to generate love. emb files for these words?

The vocab.txt isn't necessary here. To run on other datasets, you can reference the get_emb.py file.

  1. given an input text, you can know which words are split into sub-tokens (unseen words) by this function get_token()
  2. keep the original embedding for single words and use love embeddings for these sub-tokens. see get_emb_for_text()
    This is the logic about how love can robustify language models.

Thank you for your answer. So, when I experiment on other datasets, I need to first use get_emb.py to collect the unseen words of the dataset, and then use love to generate embeddings for these words, and then produce the love. emb file. Is my understanding correct? A new dataset needs to generate the love. emb file in this way.

Thank you for your answer. So, when I experiment on other datasets, I need to first use get_emb.py to collect the unseen words of the dataset, and then use love to generate embeddings for these words, and then produce the love. emb file. Is my understanding correct? A new dataset needs to generate the love. emb file in this way.

Your understanding is totally correct.

Thanks for your help, really appreciate!