mees/calvin

Training error && multi GPU error

Opened this issue · 1 comments

Let me introduce some problem I encountered and the methods I used to try to solve it.

Environment:

  • ubuntu: 20.04.3 LTS
  • miniconda: 23.10.0
  • CUDA: 2 A800
    • 3
  • pip:
aiohttp         3.9.1

aiosignal        1.3.1

antlr4-python3-runtime  4.8

appdirs         1.4.4

async-timeout      4.0.3

attrs          23.1.0

Calvin          0.0.1    /fuyujie/calvin/calvin_models

calvin-env        0.0.1    /fuyujie/calvin/calvin_env

certifi         2023.11.17

cffi           1.16.0

charset-normalizer    3.3.2

click          8.1.7

cloudpickle       3.0.0

cmake          3.18.4

colorlog         6.7.0

contourpy        1.1.1

cycler          0.12.1

decorator        4.4.2

docker-pycreds      0.4.0

filelock         3.13.1

fonttools        4.45.1

freetype-py       2.4.0

frozenlist        1.4.0

fsspec          2023.10.0

gitdb          4.0.11

GitPython        3.1.40

gym           0.26.2

gym-notices       0.0.8

huggingface-hub     0.19.4

hydra-colorlog      1.2.0

hydra-core        1.1.1

idna           3.6

imageio         2.33.0

imageio-ffmpeg      0.4.9

importlib-metadata    6.8.0

importlib-resources   6.1.1

joblib          1.3.2

kiwisolver        1.4.5

lightning-lite      1.8.6

lightning-utilities   0.10.0

llvmlite         0.41.1

lxml           4.9.3

markdown-it-py      3.0.0

matplotlib        3.7.4

mdurl          0.1.2

moviepy         1.0.3

MulticoreTSNE      0.1

multidict        6.0.4

networkx         2.2

nltk           3.8.1

numba          0.58.1

numpy          1.24.4

numpy-quaternion     2022.4.3

nvidia-cublas-cu11    11.10.3.66

nvidia-cuda-nvrtc-cu11  11.7.99

nvidia-cuda-runtime-cu11 11.7.99

nvidia-cudnn-cu11    8.5.0.96

omegaconf        2.1.2

opencv-python      4.8.1.78

packaging        23.2

pandas          2.0.3

Pillow          10.1.0

pip           23.3.1

plotly          5.18.0

proglog         0.1.10

protobuf         4.25.1

psutil          5.9.6

pybullet         3.2.5

pycollada        0.6

pycparser        2.21

pyglet          2.0.10

Pygments         2.17.2

pyhash          0.9.3

PyOpenGL         3.1.0

pyparsing        3.1.1

pyrender         0.1.45

python-dateutil     2.8.2

pytorch-lightning    1.8.6

pytz           2023.3.post1

PyYAML          6.0.1

regex          2023.10.3

requests         2.31.0

rich           13.7.0

safetensors       0.4.0

scikit-learn       1.3.2

scipy          1.10.1

sentence-transformers  2.2.2

sentencepiece      0.1.99

sentry-sdk        1.37.1

setproctitle       1.3.3

setuptools        57.5.0

six           1.16.0

smmap          5.0.1

tacto          0.0.3    /fuyujie/calvin/calvin_env/tacto

tenacity         8.2.3

tensorboardX       2.6.2.2

termcolor        2.3.0

threadpoolctl      3.2.0

tokenizers        0.15.0

torch          1.13.1

torchmetrics       1.2.0

torchvision       0.14.1

tqdm           4.66.1

transformers       4.35.2

trimesh         4.0.5

typing_extensions    4.8.0

tzdata          2023.3

urdfpy          0.0.22

urllib3         2.1.0

wandb          0.16.0

wheel          0.41.2

yarl           1.9.3

zipp           3.17.0
  • ​command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm

    1.Wandb error

    Error:

  [[2023-12-04 01:09:51,860](https://github.com/mees/calvin/issues/60#)][calvin_env.envs.play_table_env][INFO] - Using calvin_env with commit 1431a46bd36bde5903fb6345e68b5ccc30def666.
  [[2023-12-04 01:09:51,861](https://github.com/mees/calvin/issues/60#)][calvin_agent.wrappers.calvin_env_wrapper][INFO] - Initialized PlayTableEnv for device cuda:0
  [[2023-12-04 01:09:51,876](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Start generating evaluation sequences.
  [[2023-12-04 01:10:07,176](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Done generating evaluation sequences.
  [[2023-12-04 01:10:07,180](https://github.com/mees/calvin/issues/60#)][calvin_agent.models.mcil][INFO] - Start validation epoch 0
  Exception in thread IntMsgThr:
  Traceback (most recent call last):
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
   self.run()
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
   self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
   self._loop_check_status(
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
   local_handle = request()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 766, in deliver_internal_messages
   return self._deliver_internal_messages(internal_message)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 490, in _deliver_internal_messages
   return self._deliver_record(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
   handle = mailbox._deliver_record(record, interface=self)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
   interface._publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
   self._sock_client.send_record_publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
   self.send_server_request(server_req)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
  Exception in thread NetStatThr:
   self._send_message(msg)
  Traceback (most recent call last):
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
   self._sendall_with_error_handle(header + data)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
   sent = self._sock.send(data)
   self.run()
  BrokenPipeError: [Errno 32] Broken pipe
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
   self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
   self._loop_check_status(
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
   local_handle = request()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 758, in deliver_network_status
   return self._deliver_network_status(status)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _deliver_network_status
   return self._deliver_record(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
   handle = mailbox._deliver_record(record, interface=self)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
   interface._publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
   self._sock_client.send_record_publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
   self.send_server_request(server_req)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
   self._send_message(msg)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
   self._sendall_with_error_handle(header + data)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
   sent = self._sock.send(data)
  BrokenPipeError: [Errno 32] Broken pipe
  Exception in thread ChkStopThr:
  Traceback (most recent call last):
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
   self.run()
    File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
   self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
   self._loop_check_status(
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
   local_handle = request()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 750, in deliver_stop_status
   return self._deliver_stop_status(status)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 468, in _deliver_stop_status
   return self._deliver_record(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
   handle = mailbox._deliver_record(record, interface=self)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
   interface._publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
   self._sock_client.send_record_publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
   self.send_server_request(server_req)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
   self._send_message(msg)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
   self._sendall_with_error_handle(header + data)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
   sent = self._sock.send(data)
  BrokenPipeError: [Errno 32] Broken pipe
  Sanity Checking DataLoader 0:   0%|                                       | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm']
  Traceback (most recent call last):
    File "training.py", line 68, in train
   trainer.fit(model, datamodule=datamodule, ckpt_path=chk)  # type: ignore
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
   call._call_and_handle_interrupt(
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
   return trainer_fn(*args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
   self._run(model, ckpt_path=self.ckpt_path)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
   results = self._run_stage()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
   self._run_train()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
   self._run_sanity_check()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check
   val_loop.run()
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
   self.advance(*args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
   dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
   self.advance(*args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
   output = self._evaluation_step(**kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
   output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
   output = fn(*args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
   return self.model.validation_step(*args, **kwargs)
    File "/fuyujie/calvin/calvin_models/calvin_agent/models/mcil.py", line 345, in validation_step
   else self.language_goal(dataset_batch["lang"])
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1215, in _call_impl
   hook_result = hook(self, input, result)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/wandb_torch.py", line 349, in after_forward_hook
   wandb.run.summary["graph_%i" % graph_idx] = self
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 52, in __setitem__
   self.update({key: val})
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 74, in update
   self._update(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 128, in _update
   self._update_callback(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
   return func(self, *args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1388, in _summary_update_callback
   self._backend.interface.publish_summary(summary_record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 259, in publish_summary
   pb_summary_record = self._make_summary(summary_record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 237, in _make_summary
   json_value = self._summary_encode(item.value, path_from_root)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 210, in _summary_encode
   val_to_json(self._run, path_from_root, value, namespace="summary")
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/utils.py", line 164, in val_to_json
   val.bind_to_run(run, key, namespace)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/data_types.py", line 1452, in bind_to_run
   super().bind_to_run(*args, **kwargs)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/base_types/media.py", line 134, in bind_to_run
   _datatypes_callback(media_path)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/_globals.py", line 19, in _datatypes_callback
   _glob_datatypes_callback(fname)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1417, in _datatypes_callback
   self._backend.interface.publish_files(files)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 276, in publish_files
   self._publish_files(files)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _publish_files
   self._publish(rec)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
   self._sock_client.send_record_publish(record)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
   self.send_server_request(server_req)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
   self._send_message(msg)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
   self._sendall_with_error_handle(header + data)
    File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
   sent = self._sock.send(data)
  BrokenPipeError: [Errno 32] Broken pipe

Attempted method:

①Because I'm in China, I use the clash in my server. So first I guessed it's my network problem, so I try the demo in the wandb officical website, like this:

import random
import wandb

wandb.login()

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 🐝 1️⃣ Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="basic-intro", 
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}", 
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })
  
  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset
      
      # 🐝 2️⃣ Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})
      
  # Mark the run as finished
  wandb.finish()

And it works well

②Then I tried to modify the training.py

I commented two places about logger:
1

and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:

2

2. multi GPU error

command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm trainer.devices=-1

error:

[rank: 0] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).

[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).

[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm', 'trainer.devices=-1']

Traceback (most recent call last):

 File "training.py", line 68, in train

  trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit

  call._call_and_handle_interrupt(

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt

  return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch

  return function(*args, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl

  self._run(model, ckpt_path=self.ckpt_path)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run

  self.strategy.setup_environment()

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment

  self.setup_distributed()

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed

  _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection

  torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group

  store, rank, world_size = next(rendezvous_iterator)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler

  store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)

 File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store

  return TCPStore(

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).



Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

/root/miniconda3/envs/calvin/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown

 warnings.warn('resource_tracker: There appear to be %d '

Thanks so much for your attention and help!

  1. This doesn't seem to be caused by calvin. Did you try running wandb in dryrun? Setting the environment variable WANDB_MODE="dryrun" should turn off the sync. Alternatively you can also use the tensorboard logger by adding the argument logger=tb_logger when you start a training.

By default, there are rollout callbacks enabled which are run during the validation, this could be a reason for why it seemed like it got stuck. Try disabling all rollout callbacks by setting the arguments ~callbacks/rollout and ~callbacks/rollout_lh. I can also recommend not using the shared memory dataloader when debugging, so also set datamodule/datasets=vision_lang.

  1. This again doesn't seem to be caused by our code. Did you successfully run other PyTorch projects with distributed training using ddp?