Training error && multi GPU error
Opened this issue · 1 comments
Let me introduce some problem I encountered and the methods I used to try to solve it.
Environment:
aiohttp 3.9.1
aiosignal 1.3.1
antlr4-python3-runtime 4.8
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
Calvin 0.0.1 /fuyujie/calvin/calvin_models
calvin-env 0.0.1 /fuyujie/calvin/calvin_env
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.18.4
colorlog 6.7.0
contourpy 1.1.1
cycler 0.12.1
decorator 4.4.2
docker-pycreds 0.4.0
filelock 3.13.1
fonttools 4.45.1
freetype-py 2.4.0
frozenlist 1.4.0
fsspec 2023.10.0
gitdb 4.0.11
GitPython 3.1.40
gym 0.26.2
gym-notices 0.0.8
huggingface-hub 0.19.4
hydra-colorlog 1.2.0
hydra-core 1.1.1
idna 3.6
imageio 2.33.0
imageio-ffmpeg 0.4.9
importlib-metadata 6.8.0
importlib-resources 6.1.1
joblib 1.3.2
kiwisolver 1.4.5
lightning-lite 1.8.6
lightning-utilities 0.10.0
llvmlite 0.41.1
lxml 4.9.3
markdown-it-py 3.0.0
matplotlib 3.7.4
mdurl 0.1.2
moviepy 1.0.3
MulticoreTSNE 0.1
multidict 6.0.4
networkx 2.2
nltk 3.8.1
numba 0.58.1
numpy 1.24.4
numpy-quaternion 2022.4.3
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
omegaconf 2.1.2
opencv-python 4.8.1.78
packaging 23.2
pandas 2.0.3
Pillow 10.1.0
pip 23.3.1
plotly 5.18.0
proglog 0.1.10
protobuf 4.25.1
psutil 5.9.6
pybullet 3.2.5
pycollada 0.6
pycparser 2.21
pyglet 2.0.10
Pygments 2.17.2
pyhash 0.9.3
PyOpenGL 3.1.0
pyparsing 3.1.1
pyrender 0.1.45
python-dateutil 2.8.2
pytorch-lightning 1.8.6
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
requests 2.31.0
rich 13.7.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentence-transformers 2.2.2
sentencepiece 0.1.99
sentry-sdk 1.37.1
setproctitle 1.3.3
setuptools 57.5.0
six 1.16.0
smmap 5.0.1
tacto 0.0.3 /fuyujie/calvin/calvin_env/tacto
tenacity 8.2.3
tensorboardX 2.6.2.2
termcolor 2.3.0
threadpoolctl 3.2.0
tokenizers 0.15.0
torch 1.13.1
torchmetrics 1.2.0
torchvision 0.14.1
tqdm 4.66.1
transformers 4.35.2
trimesh 4.0.5
typing_extensions 4.8.0
tzdata 2023.3
urdfpy 0.0.22
urllib3 2.1.0
wandb 0.16.0
wheel 0.41.2
yarl 1.9.3
zipp 3.17.0
-
command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm
1.Wandb error
Error:
[[2023-12-04 01:09:51,860](https://github.com/mees/calvin/issues/60#)][calvin_env.envs.play_table_env][INFO] - Using calvin_env with commit 1431a46bd36bde5903fb6345e68b5ccc30def666.
[[2023-12-04 01:09:51,861](https://github.com/mees/calvin/issues/60#)][calvin_agent.wrappers.calvin_env_wrapper][INFO] - Initialized PlayTableEnv for device cuda:0
[[2023-12-04 01:09:51,876](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Start generating evaluation sequences.
[[2023-12-04 01:10:07,176](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Done generating evaluation sequences.
[[2023-12-04 01:10:07,180](https://github.com/mees/calvin/issues/60#)][calvin_agent.models.mcil][INFO] - Start validation epoch 0
Exception in thread IntMsgThr:
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 766, in deliver_internal_messages
return self._deliver_internal_messages(internal_message)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 490, in _deliver_internal_messages
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
Exception in thread NetStatThr:
self._send_message(msg)
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
self.run()
BrokenPipeError: [Errno 32] Broken pipe
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 758, in deliver_network_status
return self._deliver_network_status(status)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _deliver_network_status
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 750, in deliver_stop_status
return self._deliver_stop_status(status)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 468, in _deliver_stop_status
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm']
Traceback (most recent call last):
File "training.py", line 68, in train
trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
self._run_sanity_check()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check
val_loop.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/fuyujie/calvin/calvin_models/calvin_agent/models/mcil.py", line 345, in validation_step
else self.language_goal(dataset_batch["lang"])
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1215, in _call_impl
hook_result = hook(self, input, result)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/wandb_torch.py", line 349, in after_forward_hook
wandb.run.summary["graph_%i" % graph_idx] = self
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 52, in __setitem__
self.update({key: val})
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 74, in update
self._update(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 128, in _update
self._update_callback(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
return func(self, *args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1388, in _summary_update_callback
self._backend.interface.publish_summary(summary_record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 259, in publish_summary
pb_summary_record = self._make_summary(summary_record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 237, in _make_summary
json_value = self._summary_encode(item.value, path_from_root)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 210, in _summary_encode
val_to_json(self._run, path_from_root, value, namespace="summary")
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/utils.py", line 164, in val_to_json
val.bind_to_run(run, key, namespace)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/data_types.py", line 1452, in bind_to_run
super().bind_to_run(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/base_types/media.py", line 134, in bind_to_run
_datatypes_callback(media_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/_globals.py", line 19, in _datatypes_callback
_glob_datatypes_callback(fname)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1417, in _datatypes_callback
self._backend.interface.publish_files(files)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 276, in publish_files
self._publish_files(files)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _publish_files
self._publish(rec)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Attempted method:
①Because I'm in China, I use the clash in my server. So first I guessed it's my network problem, so I try the demo in the wandb officical website, like this:
import random
import wandb
wandb.login()
# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
# 🐝 1️⃣ Start a new run to track this script
wandb.init(
# Set the project where this run will be logged
project="basic-intro",
# We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
name=f"experiment_{run}",
# Track hyperparameters and run metadata
config={
"learning_rate": 0.02,
"architecture": "CNN",
"dataset": "CIFAR-100",
"epochs": 10,
})
# This simple block simulates a training loop logging metrics
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
acc = 1 - 2 ** -epoch - random.random() / epoch - offset
loss = 2 ** -epoch + random.random() / epoch + offset
# 🐝 2️⃣ Log metrics from your script to W&B
wandb.log({"acc": acc, "loss": loss})
# Mark the run as finished
wandb.finish()
And it works well
②Then I tried to modify the training.py
I commented two places about logger:
and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:
2. multi GPU error
command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm trainer.devices=-1
error:
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm', 'trainer.devices=-1']
Traceback (most recent call last):
File "training.py", line 68, in train
trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run
self.strategy.setup_environment()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment
self.setup_distributed()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/root/miniconda3/envs/calvin/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Thanks so much for your attention and help!
- This doesn't seem to be caused by calvin. Did you try running wandb in dryrun? Setting the environment variable
WANDB_MODE="dryrun"
should turn off the sync. Alternatively you can also use the tensorboard logger by adding the argumentlogger=tb_logger
when you start a training.
By default, there are rollout callbacks enabled which are run during the validation, this could be a reason for why it seemed like it got stuck. Try disabling all rollout callbacks by setting the arguments ~callbacks/rollout
and ~callbacks/rollout_lh
. I can also recommend not using the shared memory dataloader when debugging, so also set datamodule/datasets=vision_lang
.
- This again doesn't seem to be caused by our code. Did you successfully run other PyTorch projects with distributed training using ddp?