Error while training NLP

Question

Error while training NLP

swarada96 opened this issue a year ago · 5 comments

Traceback (most recent call last):
File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\train.py", line 24, in
trainer = trainers_dictmodality
File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\text\trainer.py", line 55, in init
self.test_loader = DataLoader(self.test_dataset, batch_size=cfg.train.val_batch_size,
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 355, in getattr
self._format_and_raise(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\base.py", line 231, in _format_and_raise
format_and_raise(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 351, in getattr
return self._get_impl(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 442, in _get_impl
node = self._get_child(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\basecontainer.py", line 73, in _get_child
child = self._get_node(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 480, in _get_node
raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key val_batch_size
full_key: train.val_batch_size
object_type=dict

Can you please help me with the omegaconf. What package version is used while training the datasets?

Answer 1 · 2023-05-30T06:44:00.000Z

Hello @swarada96. Thanks for your feedback.
The problem here is that the config property val_batch_size is not present in your config file. You can add it there to fix this. But I pushed some new changes in order to fix this. Update the repo and it should work fine.

Answer 2 · 2023-05-30T07:09:33.000Z

I updated the repo as per your suggestion and the it made a difference. Thank you for that. But, after using the updated file, my CUDA runs out of memory. What changes can you suggest me. Thanking you in advance.

PS F:\data2vec-pytorch-main> python train.py --config text/configs/roberta-pretraining.yaml
Found cached dataset wikitext (C:/Users/User/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732
210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.05it/s]
Found cached dataset wikitext (C:/Users/User/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732
210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 50.40it/s]

Epoch: 1/20 0%| | 0/56293 [00:00<?, ?batch/s]You're
using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than us
ing a method to encode the text followed by a call to the pad method to get a padded encoding.
Epoch: 1/20 0%| | 0/56293 [00:09<?, ?batch/s]
Traceback (most recent call last):
File "F:\data2vec-pytorch-main\train.py", line 25, in
trainer.train()
File "F:\data2vec-pytorch-main\text\trainer.py", line 145, in train
train_loss = self.train_epoch(epoch)
loss = self.train_step(batch)
File "F:\data2vec-pytorch-main\text\trainer.py", line 68, in train_step
x, y = self.model(src, trg, mask)
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "F:\data2vec-pytorch-main\data2vec\data2vec.py", line 83, in forward
x = self.encoder(src, mask, **kwargs)['encoder_out'] # fetch the last layer outputs
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "F:\data2vec-pytorch-main\text\encoder.py", line 38, in forward
outputs = self.encoder(inputs, output_hidden_states=True, output_attentions=True, **kwargs)
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 846, in forward
encoder_outputs = self.encoder(
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 520, in forward
layer_outputs = layer_module(
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 405, in forward
self_attention_outputs = self.attention(
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 332, in forward
self_outputs = self.self(
File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 234, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 4.00 GiB total capacity; 2.94 GiB already allocated; 0
bytes free; 3.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to
avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Answer 3 · 2023-05-30T07:11:17.000Z

@swarada96
You've got 4GB in total, It'd be better to set a lower batch_size

Answer 4 · 2023-06-05T21:40:40.000Z

Hello Aryan,

Do we have to create a file named dummy_data in order to save the split data ? I am getting the following error for vision encoding.

python train.py --config vision/configs/beit-pretraining.yaml
Traceback (most recent call last):
File "/home1/08351/sak3951/Work/data2vec-pytorch/train.py", line 24, in
trainer = trainers_dictmodality
File "/home1/08351/sak3951/Work/data2vec-pytorch/vision/trainer.py", line 31, in init
self.train_dataset = MIMPretrainingDataset(cfg, split='train')
File "/home1/08351/sak3951/Work/data2vec-pytorch/vision/dataset.py", line 23, in init
super(MIMPretrainingDataset, self).init(root=cfg.dataset.path[split])
File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 309, in init
super().init(
File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 144, in init
classes, class_to_idx = self.find_classes(self.root)
File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 218, in find_classes
return find_classes(directory)
File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 40, in find_classes
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
FileNotFoundError: [Errno 2] No such file or directory: 'vision/dummy_data'

Answer 5 · 2023-06-11T16:16:36.000Z

Hello again @swarada96, sorry for the delay.

The dummy_data is a random folder name containing all the image files. You can define or create your own directory of images.