pbelcak/UltraFastBERT

Failure during evaluation after training

catid opened this issue · 9 comments

Following the training README I am using:

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data=pile-readymade

Followed by:

python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False

This fails:
image

Searching around it seems to have something to do with n_classes being too small, so I think the instructions are missing something.

It seems to fail during "Finetuning task mnli with 3 classes for 122715 steps." So I am guessing that there are either more or less classes in the model than expected?

Work-around is to edit training/cramming/config/eval/GLUE_sane.yaml to comment out the - mnli task.. maybe something is out of sync with the dataset?

Also had to disable stsb task

Hey, I reverted the commit that made changes to this behaviour; could you please check if it makes a difference?

I merged your test branch and re-ran the full test and it succeeds. Thanks for fixing!

So I am trying to reproduce the results in the paper and the results are not matching.

I followed the instructions as shown above, and the result is that crammed-bert (baseline) gets "Overall average metric on evaluation GLUE is 0.32" => 32%?

This is significantly worse than your paper result for baseline of 79.3.

The first time I ran the evaluation script without the mnli/stsb the results were much more in line with the paper. Can you help me set up the repo to reproduce your results? Maybe just sharing what commands to run will help.

I'm running using torchrun --nproc_per_node=2 to use two GPUs for training. Does that cause some issue?

Heya,

Just invoking the fine-tuning YAML configurations for each task individually works.

Just for a reference, an identity transformer (i.e. a transformer with all feedforward layers removed) trained under the crammedBERT schedule still achieves a GLUE score of 74.3 -- if you ever get less than that with FFs/FFFs in place, something is not right.

I've been trying to reproduce your positive results for the FFF layer structure. To simplify the comparison I've been using CIFAR-10 as a proxy problem.

Over the past week I put together a training framework for CIFAR-10 with a baseline transformer model (vit_tiny with mlp_dim=256). I've then introduced a number of variants of the transformer model using your UltraFastBERT implementation of FFF, some tweaks to it, and a community version of FFF written from scratch. The results are here:

image

So far we have yet to see the FFF layer improve upon a small mlp_dim=16 FFN network. The conditional computation does not seem to improve the network's ability to generalize to the validation set.

I'm currently suspecting that the UltraFastBERT result can be improved by replacing the FFF layers with a MLP layer that is with mlp_dim=16, which is obviously much smaller and easier to train/evaluate than a FFF layer.

Heya,

You can train BERT with a FFN with 16 neurons, but I am afraid you will be disappointed by the results.
Re your ViT: it is hard for me to follow what is going on without the details, but feel free to explore more in that direction.

I'm closing this since it is unrelated to the original issue.