mittagessen/kraken

RuntimeError with tensor dimension using rotrain

Opened this issue · 2 comments

Hello! I'm trying to train a reading order model. I'm running the same training script on both Google Colab and my university server. The kraken version is identical on both systems (5.2.9). Here is the script I'm using:

ketos rotrain \
  --level baselines \
  --device cuda:0 \
  --format-type page \
  --batch-size 512 \
  --epochs 1000 \
  --quit early \
  --lag 50 \
  --partition 0.9 \
  --logger tensorboard \
  --workers 0 \
  --precision 16 \
  --reading-order line_implicit \
  PATH_TO_GT

While the training starts successfully in Google Colab, it fails on the university server with the following error: RuntimeError: stack expects each tensor to be equal size, but got [18] at entry 0 and [16] at entry 8

I encountered a similar issue on Colab initially, and I thought it was due to inconsistencies in how baselines were defined (some baselines were defined with 2, some with 3, and some with 4 points). To resolve this, I filtered out all baselines except those defined by 2 points, and after that, the training started successfully in Colab. However, the same approach did not resolve the issue on the university server. When I try increasing the batch size on the server, I encounter a different error: RuntimeError: Trying to resize storage that is not resizable

Could you please help me understand why this tensor dimension mismatch error is persisting on the university server, even though the same adjustments worked in Google Colab? I would appreciate any guidance on overcoming these issues.

My Colab training just stopped with same old error TypeError: '>=' not supported between instances of 'int' and 'str' ! Here is the link to the Colab: https://colab.research.google.com/drive/1mAzFlNBIDEslgZl3ZlyB-ILxgjvIM9N6?usp=sharing
UPD: may be it was just an early stopping..?

To the data : https://msia.escriptorium.fr/media/users/334/export_doc2968_biblioteca_laurenziana_plutei_28_pagexml_202411081332.zip

And here are the university server tracebacks:

For batch = 512

stage 0/∞               0/1192 0:00:00 •    0.00it/s v_num: 11.000 early_stoppi…
                               -:--:--                             0/50 inf     
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/bin/ketos:8 in <module>      │
│                                                                              │
│   5 from kraken.ketos import cli                                             │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(cli())                                                      │
│   9                                                                          │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1157 in __call__                                             │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1078 in main                                                 │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1688 in invoke                                               │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1434 in invoke                                               │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:783 in invoke                                                │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/decorators.py:33 in new_func                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/kraken/ketos/ro.py:262 in rotrain                                          │
│                                                                              │
│   259 │   │   │   │   │   │   │   **val_check_interval)                      │
│   260 │                                                                      │
│   261 │   with threadpool_limits(limits=threads):                            │
│ ❱ 262 │   │   trainer.fit(model)                                             │
│   263 │                                                                      │
│   264 │   if model.best_epoch == -1:                                         │
│   265 │   │   logger.warning('Model did not improve during training.')       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/kraken/lib/train.py:129 in fit                                             │
│                                                                              │
│    126 │   │   with warnings.catch_warnings():                               │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWar │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')        │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                              │
│    130                                                                       │
│    131                                                                       │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                           │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:544 in fit                            │
│                                                                              │
│    541 │   │   self.state.fn = TrainerFn.FITTING                             │
│    542 │   │   self.state.status = TrainerStatus.RUNNING                     │
│    543 │   │   self.training = True                                          │
│ ❱  544 │   │   call._call_and_handle_interrupt(                              │
│    545 │   │   │   self, self._fit_impl, model, train_dataloaders, val_datal │
│    546 │   │   )                                                             │
│    547                                                                       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/call.py:44 in _call_and_handle_interrupt         │
│                                                                              │
│    41 │   try:                                                               │
│    42 │   │   if trainer.strategy.launcher is not None:                      │
│    43 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, │
│ ❱  44 │   │   return trainer_fn(*args, **kwargs)                             │
│    45 │                                                                      │
│    46 │   except _TunerExitException:                                        │
│    47 │   │   _call_teardown_hook(trainer)                                   │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:580 in _fit_impl                      │
│                                                                              │
│    577 │   │   │   model_provided=True,                                      │
│    578 │   │   │   model_connected=self.lightning_module is not None,        │
│    579 │   │   )                                                             │
│ ❱  580 │   │   self._run(model, ckpt_path=ckpt_path)                         │
│    581 │   │                                                                 │
│    582 │   │   assert self.state.stopped                                     │
│    583 │   │   self.training = False                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:987 in _run                           │
│                                                                              │
│    984 │   │   # ----------------------------                                │
│    985 │   │   # RUN THE TRAINER                                             │
│    986 │   │   # ----------------------------                                │
│ ❱  987 │   │   results = self._run_stage()                                   │
│    988 │   │                                                                 │
│    989 │   │   # ----------------------------                                │
│    990 │   │   # POST-Training CLEAN UP                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:1033 in _run_stage                    │
│                                                                              │
│   1030 │   │   │   with isolate_rng():                                       │
│   1031 │   │   │   │   self._run_sanity_check()                              │
│   1032 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anoma │
│ ❱ 1033 │   │   │   │   self.fit_loop.run()                                   │
│   1034 │   │   │   return None                                               │
│   1035 │   │   raise RuntimeError(f"Unexpected state {self.state}")          │
│   1036                                                                       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fit_loop.py:205 in run                             │
│                                                                              │
│   202 │   │   while not self.done:                                           │
│   203 │   │   │   try:                                                       │
│   204 │   │   │   │   self.on_advance_start()                                │
│ ❱ 205 │   │   │   │   self.advance()                                         │
│   206 │   │   │   │   self.on_advance_end()                                  │
│   207 │   │   │   │   self._restarting = False                               │
│   208 │   │   │   except StopIteration:                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fit_loop.py:363 in advance                         │
│                                                                              │
│   360 │   │   │   )                                                          │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):      │
│   362 │   │   │   assert self._data_fetcher is not None                      │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                    │
│   364 │                                                                      │
│   365 │   def on_advance_end(self) -> None:                                  │
│   366 │   │   trainer = self.trainer                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/training_epoch_loop.py:140 in run                  │
│                                                                              │
│   137 │   │   self.on_run_start(data_fetcher)                                │
│   138 │   │   while not self.done:                                           │
│   139 │   │   │   try:                                                       │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                             │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                      │
│   142 │   │   │   │   self._restarting = False                               │
│   143 │   │   │   except StopIteration:                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/training_epoch_loop.py:212 in advance              │
│                                                                              │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                        │
│   210 │   │   else:                                                          │
│   211 │   │   │   dataloader_iter = None                                     │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                          │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by th │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after res │
│   215 │   │   │   batch_idx = self.batch_idx + 1                             │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fetchers.py:133 in __next__                        │
│                                                                              │
│   130 │   │   │   │   self.done = not self.batches                           │
│   131 │   │   elif not self.done:                                            │
│   132 │   │   │   # this will run only when no pre-fetching was done.        │
│ ❱ 133 │   │   │   batch = super().__next__()                                 │
│   134 │   │   else:                                                          │
│   135 │   │   │   # the iterator is empty                                    │
│   136 │   │   │   raise StopIteration                                        │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fetchers.py:60 in __next__                         │
│                                                                              │
│    57 │   │   assert self.iterator is not None                               │
│    58 │   │   self._start_profiler()                                         │
│    59 │   │   try:                                                           │
│ ❱  60 │   │   │   batch = next(self.iterator)                                │
│    61 │   │   except StopIteration:                                          │
│    62 │   │   │   self.done = True                                           │
│    63 │   │   │   raise                                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/utilities/combined_loader.py:341 in __next__             │
│                                                                              │
│   338 │                                                                      │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                            │
│   340 │   │   assert self._iterator is not None                              │
│ ❱ 341 │   │   out = next(self._iterator)                                     │
│   342 │   │   if isinstance(self._iterator, _Sequential):                    │
│   343 │   │   │   return out                                                 │
│   344 │   │   out, batch_idx, dataloader_idx = out                           │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/utilities/combined_loader.py:78 in __next__              │
│                                                                              │
│    75 │   │   out = [None] * n  # values per iterator                        │
│    76 │   │   for i in range(n):                                             │
│    77 │   │   │   try:                                                       │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                       │
│    79 │   │   │   except StopIteration:                                      │
│    80 │   │   │   │   self._consumed[i] = True                               │
│    81 │   │   │   │   if all(self._consumed):                                │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/dataloader.py:630 in __next__                             │
│                                                                              │
│    627 │   │   │   if self._sampler_iter is None:                            │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/7675 │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]               │
│ ❱  630 │   │   │   data = self._next_data()                                  │
│    631 │   │   │   self._num_yielded += 1                                    │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \      │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and  │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/dataloader.py:674 in _next_data                           │
│                                                                              │
│    671 │                                                                     │
│    672 │   def _next_data(self):                                             │
│    673 │   │   index = self._next_index()  # may raise StopIteration         │
│ ❱  674 │   │   data = self._dataset_fetcher.fetch(index)  # may raise StopIt │
│    675 │   │   if self._pin_memory:                                          │
│    676 │   │   │   data = _utils.pin_memory.pin_memory(data, self._pin_memor │
│    677 │   │   return data                                                   │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/fetch.py:54 in fetch                               │
│                                                                              │
│   51 │   │   │   │   data = [self.dataset[idx] for idx in possibly_batched_i │
│   52 │   │   else:                                                           │
│   53 │   │   │   data = self.dataset[possibly_batched_index]                 │
│ ❱ 54 │   │   return self.collate_fn(data)                                    │
│   55                                                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/collate.py:265 in default_collate                  │
│                                                                              │
│   262 │   │   │   >>> default_collate_fn_map.update(CustoType, collate_custo │
│   263 │   │   │   >>> default_collate(batch)  # Handle `CustomType` automati │
│   264 │   """                                                                │
│ ❱ 265 │   return collate(batch, collate_fn_map=default_collate_fn_map)       │
│   266                                                                        │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/collate.py:127 in collate                          │
│                                                                              │
│   124 │                                                                      │
│   125 │   if isinstance(elem, collections.abc.Mapping):                      │
│   126 │   │   try:                                                           │
│ ❱ 127 │   │   │   return elem_type({key: collate([d[key] for d in batch], co │
│   128 │   │   except TypeError:                                              │
│   129 │   │   │   # The mapping type may not support `__init__(iterable)`.   │
│   130 │   │   │   return {key: collate([d[key] for d in batch], collate_fn_m │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/collate.py:127 in <dictcomp>                       │
│                                                                              │
│   124 │                                                                      │
│   125 │   if isinstance(elem, collections.abc.Mapping):                      │
│   126 │   │   try:                                                           │
│ ❱ 127 │   │   │   return elem_type({key: collate([d[key] for d in batch], co │
│   128 │   │   except TypeError:                                              │
│   129 │   │   │   # The mapping type may not support `__init__(iterable)`.   │
│   130 │   │   │   return {key: collate([d[key] for d in batch], collate_fn_m │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/collate.py:119 in collate                          │
│                                                                              │
│   116 │                                                                      │
│   117 │   if collate_fn_map is not None:                                     │
│   118 │   │   if elem_type in collate_fn_map:                                │
│ ❱ 119 │   │   │   return collate_fn_map[elem_type](batch, collate_fn_map=col │
│   120 │   │                                                                  │
│   121 │   │   for collate_type in collate_fn_map:                            │
│   122 │   │   │   if isinstance(elem, collate_type):                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/_utils/collate.py:162 in collate_tensor_fn                │
│                                                                              │
│   159 │   │   numel = sum(x.numel() for x in batch)                          │
│   160 │   │   storage = elem._typed_storage()._new_shared(numel, device=elem │
│   161 │   │   out = elem.new(storage).resize_(len(batch), *list(elem.size()) │
│ ❱ 162 │   return torch.stack(batch, 0, out=out)                              │
│   163                                                                        │
│   164                                                                        │
│   165 def collate_numpy_array_fn(batch, *, collate_fn_map: Optional[Dict[Uni │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: stack expects each tensor to be equal size, but got [18] at entry 
0 and [16] at entry 8

And for batch = 1000 (I believe that nothing else changed):

stage 0/∞               0/1197 0:00:00 •     0.00it/s v_num: 9.000 early_stoppi…
                               -:--:--                             0/50 inf     
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/bin/ketos:8 in <module>      │
│                                                                              │
│   5 from kraken.ketos import cli                                             │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(cli())                                                      │
│   9                                                                          │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1157 in __call__                                             │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1078 in main                                                 │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1688 in invoke                                               │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:1434 in invoke                                               │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/core.py:783 in invoke                                                │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/click/decorators.py:33 in new_func                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/kraken/ketos/ro.py:262 in rotrain                                          │
│                                                                              │
│   259 │   │   │   │   │   │   │   **val_check_interval)                      │
│   260 │                                                                      │
│   261 │   with threadpool_limits(limits=threads):                            │
│ ❱ 262 │   │   trainer.fit(model)                                             │
│   263 │                                                                      │
│   264 │   if model.best_epoch == -1:                                         │
│   265 │   │   logger.warning('Model did not improve during training.')       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/kraken/lib/train.py:129 in fit                                             │
│                                                                              │
│    126 │   │   with warnings.catch_warnings():                               │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWar │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')        │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                              │
│    130                                                                       │
│    131                                                                       │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                           │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:544 in fit                            │
│                                                                              │
│    541 │   │   self.state.fn = TrainerFn.FITTING                             │
│    542 │   │   self.state.status = TrainerStatus.RUNNING                     │
│    543 │   │   self.training = True                                          │
│ ❱  544 │   │   call._call_and_handle_interrupt(                              │
│    545 │   │   │   self, self._fit_impl, model, train_dataloaders, val_datal │
│    546 │   │   )                                                             │
│    547                                                                       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/call.py:44 in _call_and_handle_interrupt         │
│                                                                              │
│    41 │   try:                                                               │
│    42 │   │   if trainer.strategy.launcher is not None:                      │
│    43 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, │
│ ❱  44 │   │   return trainer_fn(*args, **kwargs)                             │
│    45 │                                                                      │
│    46 │   except _TunerExitException:                                        │
│    47 │   │   _call_teardown_hook(trainer)                                   │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:580 in _fit_impl                      │
│                                                                              │
│    577 │   │   │   model_provided=True,                                      │
│    578 │   │   │   model_connected=self.lightning_module is not None,        │
│    579 │   │   )                                                             │
│ ❱  580 │   │   self._run(model, ckpt_path=ckpt_path)                         │
│    581 │   │                                                                 │
│    582 │   │   assert self.state.stopped                                     │
│    583 │   │   self.training = False                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:987 in _run                           │
│                                                                              │
│    984 │   │   # ----------------------------                                │
│    985 │   │   # RUN THE TRAINER                                             │
│    986 │   │   # ----------------------------                                │
│ ❱  987 │   │   results = self._run_stage()                                   │
│    988 │   │                                                                 │
│    989 │   │   # ----------------------------                                │
│    990 │   │   # POST-Training CLEAN UP                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/trainer/trainer.py:1033 in _run_stage                    │
│                                                                              │
│   1030 │   │   │   with isolate_rng():                                       │
│   1031 │   │   │   │   self._run_sanity_check()                              │
│   1032 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anoma │
│ ❱ 1033 │   │   │   │   self.fit_loop.run()                                   │
│   1034 │   │   │   return None                                               │
│   1035 │   │   raise RuntimeError(f"Unexpected state {self.state}")          │
│   1036                                                                       │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fit_loop.py:205 in run                             │
│                                                                              │
│   202 │   │   while not self.done:                                           │
│   203 │   │   │   try:                                                       │
│   204 │   │   │   │   self.on_advance_start()                                │
│ ❱ 205 │   │   │   │   self.advance()                                         │
│   206 │   │   │   │   self.on_advance_end()                                  │
│   207 │   │   │   │   self._restarting = False                               │
│   208 │   │   │   except StopIteration:                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fit_loop.py:363 in advance                         │
│                                                                              │
│   360 │   │   │   )                                                          │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):      │
│   362 │   │   │   assert self._data_fetcher is not None                      │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                    │
│   364 │                                                                      │
│   365 │   def on_advance_end(self) -> None:                                  │
│   366 │   │   trainer = self.trainer                                         │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/training_epoch_loop.py:140 in run                  │
│                                                                              │
│   137 │   │   self.on_run_start(data_fetcher)                                │
│   138 │   │   while not self.done:                                           │
│   139 │   │   │   try:                                                       │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                             │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                      │
│   142 │   │   │   │   self._restarting = False                               │
│   143 │   │   │   except StopIteration:                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/training_epoch_loop.py:212 in advance              │
│                                                                              │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                        │
│   210 │   │   else:                                                          │
│   211 │   │   │   dataloader_iter = None                                     │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                          │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by th │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after res │
│   215 │   │   │   batch_idx = self.batch_idx + 1                             │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fetchers.py:133 in __next__                        │
│                                                                              │
│   130 │   │   │   │   self.done = not self.batches                           │
│   131 │   │   elif not self.done:                                            │
│   132 │   │   │   # this will run only when no pre-fetching was done.        │
│ ❱ 133 │   │   │   batch = super().__next__()                                 │
│   134 │   │   else:                                                          │
│   135 │   │   │   # the iterator is empty                                    │
│   136 │   │   │   raise StopIteration                                        │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/loops/fetchers.py:60 in __next__                         │
│                                                                              │
│    57 │   │   assert self.iterator is not None                               │
│    58 │   │   self._start_profiler()                                         │
│    59 │   │   try:                                                           │
│ ❱  60 │   │   │   batch = next(self.iterator)                                │
│    61 │   │   except StopIteration:                                          │
│    62 │   │   │   self.done = True                                           │
│    63 │   │   │   raise                                                      │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/utilities/combined_loader.py:341 in __next__             │
│                                                                              │
│   338 │                                                                      │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                            │
│   340 │   │   assert self._iterator is not None                              │
│ ❱ 341 │   │   out = next(self._iterator)                                     │
│   342 │   │   if isinstance(self._iterator, _Sequential):                    │
│   343 │   │   │   return out                                                 │
│   344 │   │   out, batch_idx, dataloader_idx = out                           │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/lightning/pytorch/utilities/combined_loader.py:78 in __next__              │
│                                                                              │
│    75 │   │   out = [None] * n  # values per iterator                        │
│    76 │   │   for i in range(n):                                             │
│    77 │   │   │   try:                                                       │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                       │
│    79 │   │   │   except StopIteration:                                      │
│    80 │   │   │   │   self._consumed[i] = True                               │
│    81 │   │   │   │   if all(self._consumed):                                │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/dataloader.py:630 in __next__                             │
│                                                                              │
│    627 │   │   │   if self._sampler_iter is None:                            │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/7675 │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]               │
│ ❱  630 │   │   │   data = self._next_data()                                  │
│    631 │   │   │   self._num_yielded += 1                                    │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \      │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and  │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/dataloader.py:1345 in _next_data                          │
│                                                                              │
│   1342 │   │   │   │   self._task_info[idx] += (data,)                       │
│   1343 │   │   │   else:                                                     │
│   1344 │   │   │   │   del self._task_info[idx]                              │
│ ❱ 1345 │   │   │   │   return self._process_data(data)                       │
│   1346 │                                                                     │
│   1347 │   def _try_put_index(self):                                         │
│   1348 │   │   assert self._tasks_outstanding < self._prefetch_factor * self │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/utils/data/dataloader.py:1371 in _process_data                       │
│                                                                              │
│   1368 │   │   self._rcvd_idx += 1                                           │
│   1369 │   │   self._try_put_index()                                         │
│   1370 │   │   if isinstance(data, ExceptionWrapper):                        │
│ ❱ 1371 │   │   │   data.reraise()                                            │
│   1372 │   │   return data                                                   │
│   1373 │                                                                     │
│   1374 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│                                                                              │
│ /sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-package │
│ s/torch/_utils.py:694 in reraise                                             │
│                                                                              │
│   691 │   │   │   # If the exception takes multiple arguments, don't try to  │
│   692 │   │   │   # instantiate since we don't know how to                   │
│   693 │   │   │   raise RuntimeError(msg) from None                          │
│ ❱ 694 │   │   raise exception                                                │
│   695                                                                        │
│   696                                                                        │
│   697 def _get_available_device_type():                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], 
collate_fn_map=collate_fn_map) for key in elem})
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], 
collate_fn_map=collate_fn_map) for key in elem})
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File 
"/sps/humanum/user/syatsyk/HTR_kraken/kraken_env/lib64/python3.9/site-packages/t
orch/utils/data/_utils/collate.py", line 161, in collate_tensor_fn
    out = elem.new(storage).resize_(len(batch), *list(elem.size()))
RuntimeError: Trying to resize storage that is not resizable