How to fine tune masked language model task?

Question

How to fine tune masked language model task?

Closed this issue 3 years ago · 5 comments

Hi there,

I was reading through your example language modelling notebook and not sure how to adapt the same notebook (CLM) to MLM. Basically I was trying to do some this:

model_cls = AutoModelForMaskedLM
pretrained_model_name = 'roberta-base' # cause gpt2 cannot be used for MLM
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

However, I'm stuck at how to modify HF_CausalLMBeforeBatchTransform and there's no such equivalent transform for MaskedLM in the library

blocks = (
    HF_Seq2SeqBlock(before_batch_tfm=HF_CausalLMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model)), 
    noop
)

Could you have a look at this? Thank you in advance!

Answer 1 · 2021-06-03T01:46:14.000Z

Yah, that is in the TODO list at the moment :)

There are a number of masking strategies (prefix language modeling, BERT-style, deshuffling, MASS-style, replace spans, drop tokens, random spans, etc...). See the T5 paper, Table 3 here: https://arxiv.org/abs/1910.10683

What I envision is a HF_MLMBeforeBatchTransform that takes a MLM_{Whatever}Strategy class that knows how to modify the inputs/targets accordingly. That object would essentially do what you see here in the causal LM batch transform.

You want to give it a shot?

If so, here's some tips for how you might go about implementing this:

The notebook to modify is 01zb_data-seq2seq-language-modeling (see the Masked LM section at the bottom)
Start by creating an abstract base class called MLM_MaskingStrategy. Anything we start finding in common with strategies we just stick in here.
Then start with the most common corruption strategy, MLM_BertMaskingStrategy class that inherits from MLM_MaskingStrategy. We're going to want to pass it the samples and have it return to us our updated_samples, where the inputs have been corrupted (masked) and the targets are the original text.

I think the above approach will work. We can reduce the code further by making a CausalMaskingStrategy and then turning the batch transform into just HF_LMBeforeBatchTransform.

Answer 2 · 2021-06-04T19:45:17.000Z

Lmk if you give this a try.

If not, I'll try to work up the basic infrastructure for it and you can, if you choose to do so, maybe add some of the other strategies for MLM.

Answer 3 · 2021-06-05T04:24:30.000Z

Sorry I still couldn't make it work on my own based on your hints :(

Yes, I think a basic infrastructure would be very helpful. I think I can modify accordingly based on my need then. Thanks a lot!

Btw, I was also thinking of making use of the DataCollatorForLanguageModeling class in transformers.data.data_collator, do you think it can be used in your HF_LMBeforeBatchTransform?

Answer 4 · 2021-06-08T18:45:18.000Z

Ok take a look at the repo. You'll have to do a dev install as I haven't pushed a new release out yet.

The notebooks to check out:

Things that would be helpful:

Review the BERT-style masking code ... verify it conforms to paper and is there a way to make it more efficient?
Add other masking strategies that derive from LMStrategy. The T5 paper referenced in the notebooks/docs describes the core ones ... give it a go! you can do it if I can :)

Answer 5 · 2022-05-11T00:58:42.000Z

Closing this out.

Now that v.1 is out, feel free to PR any new masking strategies you want added. I'm hoping to get some time for this later this year. Thanks.