kohjingyu/fromage

Do you think bigscience/bloom can be a replacement of facebook/opt model ?

svjack opened this issue · 5 comments

svjack commented

If I want to replace the lm model in the project, do you prefer bigscience/bloom as a multilanguage replacement ?
Or you have some other recommendations ?
I want the model replaced can works in question-answer downstream works.
And I'm interesting about why the loss you use not related with qa tasks, but the model
can works in question-answer downstream works.
Does this only use the few-shot ability of Facebook/opt ?

And i see you use "bert" as a option in if-else judgment block in models.py
This mean you take "bert" as a replacement, Can you share a FrozenArgs
configuration of "bert" model ?

In principle, there is nothing special about OPT. Using BLOOM should also work, as long as you update the model API calls (if they are different, which they might not be).

The BERT models were something used early in development. We didn't train any BERT-like models in the final version, so I don't have any config files for them, sorry.

svjack commented

In principle, there is nothing special about OPT. Using BLOOM should also work, as long as you update the model API calls (if they are different, which they might not be).

The BERT models were something used early in development. We didn't train any BERT-like models in the final version, so I don't have any config files for them, sorry.

Why Bloom tokenizer when use padding to max_length it will placed the padding tokens to the head ?

native_tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m", 
                                                      use_fast=False)
caption = "a bear in the woods."
tokenized_data = native_tokenizer(
          caption,
          return_tensors="pt",
          padding='max_length',
          truncation=True,
          max_length=56)
tokens = tokenized_data.input_ids[0]
tokens

will produce

tensor([     3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,     68,  50507,    361,    368,
        165526,     17])

It pad the pad_token_id "3" to the head, not tail.
This is different with other models.
Why this occurred ?

I've never used the BLOOM models before, so I don't know what this issue is, sorry. I think this is something you will have to check with the authors of that model.

svjack commented

After training, below code init ret embedding

  with torch.no_grad():
      model.model.input_embeddings.weight[model.model.retrieval_token_idx, :].copy_(checkpoint['state_dict']['ret_input_embeddings.weight'].cpu().detach())

Which Naming rules used to induce ret_input_embeddings in the network in the source code ?

You can produce ret_input_embeddings by extracting the trained [RET] token embeddings as such:

state_dict['ret_input_embeddings.weight'] = state_dict['model.input_embeddings.weight'][args.retrieval_token_idx].clone()

The benefit of doing this is that we save space as we don't need to retain the frozen OPT embeddings, we just need to save the [RET] one.