arxyzan/data2vec-pytorch

Some Questions

bryanwong17 opened this issue · 3 comments

Hi @arxyzan ,

  1. Can you tell me what parts I need to change if my input size is 256 instead of 224?

  2. Is it mandatory to load encoder_checkpoint? Can I train my model from scratch?

  3. why is the config file named beit-pretraining.yaml for the vision task?

  4. Could you help me to solve problem below:

Epoch: 1/100 0%| | 0/18001 [00:02<?, ?batch/s]
Traceback (most recent call last):
File "/mnt/c/data2vec-pytorch/train.py", line 25, in
trainer.train()
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 131, in train
train_loss = self.train_epoch(epoch)
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 97, in train_epoch
loss = self.train_step(batch)
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 56, in train_step
x, y = self.model(src, trg, mask)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/c/data2vec-pytorch/data2vec/data2vec.py", line 90, in forward
y = self.ema.model(trg, ~mask, **kwargs)['encoder_states'] # fetch the last transformer layers outputs
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/c/data2vec-pytorch/vision/encoder.py", line 38, in forward
outputs = self.encoder(pixel_values=inputs, output_hidden_states=True, output_attentions=True, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 681, in forward
embedding_output = self.embeddings(pixel_values, bool_masked_pos)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 154, in forward
embeddings = self.patch_embeddings(pixel_values)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 206, in forward
embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Hello @bryanwong17, and thanks for your feedback.
Just keep in mind that Data2Vec is by itself a large pretrained model and only if you actually want to pretrain a model for vision/text/audio you can use it. Otherwise, if you want to finetune a model for a downstream task like image recognition, etc you have to use the pretrained weights and finetune from there. With all that:

  1. In order to change the input size, you can only do it for pretraining and you cannot change the model architecture for finetuning, because finetuning needs the pretrained model which is already trained with 224 pixel architecture. (this is not the case for text and audio cause they accept different input sizes)
  2. Loading from encoder_checkpoint only is there so the code would know which base model you are using based on the HuggingFace Hub path you provide. Actually no weight assigning or loading is happening there. It just loads the config file from that path in HF Hub and figure out what model you want to use. This way I was able to provide a general encoder class for vision by using transformers.AutoModel and transformers.AutoConfig. Otherwise one would have to provice one encoder class for any base architecture they wanted to use.
  3. Because the default data2vec for vision, uses the model BEiT as the base encoder model. If you want to use another model you can provide a new config file for that.
  4. Thanks for reporting this issue. I just fixed it. you can now try the code and it works fine.

Hi @arxyzan, thanks for quick response. My goal in training Data2Vec is to replace the Multiple Instance Learning (MIL) framework's feature extractor with Data2vec, allowing all extracted images (patches) to be loaded into the trained embedder. Do you suggest training datavec from scratch? So far, I have around 500k histopathology images (256 x 256)

Data2Vec and similar large image models are all trained on huge amount of data from ImageNet. Considering the fact that the domain of the data is fairly different from what you are working on, I don't think using a large model which is designed for large pretraining is the perfect decision unless you have this much data from your desired domain. I'm not familiar with histopathology data but I think you'd be better look for a base model that is suitable for that kind of data.