Issues with multiple samplers on torch 1.13
isratnisa opened this issue ยท 8 comments
๐ Bug
Training script for link prediction does not work with multiple sampler for PyTorch 1.13. So far, three different bugs were found. In summary:
- MAG-LSC (4 partitions), #sampler= 1: throws a CUDA OOM issue
- MAG-LSC (4 partitions), #sampler= 4: throws a
KeyError: 'dataloader-0'
error fromdgl/distributed/dist_context.py
- OGBN-MAG (4 partitions), #sampler= 1: throws a IndexError: index out of range in self
Note:
- Both datasets work fine for #sampler=0
- Both datasets work fine for pytorch 1.12 with #sampler=0/1/4
- OGBN-MAG (1 partition) works fine with 0, 1 or 4 samplers
Details
Bug 1:
Run command:
python3 -u ~/dgl/tools/launch.py \
> --workspace /graph-storm/python/graphstorm/run/gsgnn_lp \
> --num_trainers 4 \
> --num_servers 1 \
> --num_samplers 1 \
> --part_config /data/mag-lsc-lp-4p/mag-lsc.json \
> --ip_config /data/ip_list_p4_metal.txt \
> --ssh_port 2222 \
> --graph_format csc,coo \
> "python3 gsgnn_lp.py --cf /data/mag_lsc_lp_p4.yaml --node-feat-name paper:feat"
Error:
Traceback (most recent call last):
File "gsgnn_lp.py", line 154, in <module>
main(args)
File "gsgnn_lp.py", line 120, in main
trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
File "/graph-storm/python/graphstorm/trainer/lp_trainer.py", line 140, in fit
loss = model(blocks, pos_graph, neg_graph,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/graph-storm/python/graphstorm/model/lp_gnn.py", line 93, in forward
encode_embs = self.compute_embed_step(blocks, node_feats)
File "/graph-storm/python/graphstorm/model/gnn.py", line 527, in compute_embed_step
embs = self.node_input_encoder(input_feats, input_nodes)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/graph-storm/python/graphstorm/model/embed.py", line 239, in forward
emb = input_feats[ntype].float() @ self.input_projs[ntype]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.14 GiB (GPU 3; 14.58 GiB total capacity; 1.81 GiB already allocated; 1.94 GiB free; 11.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
Bug 2:
Run command:
python3 -u ~/dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/mag-lsc-lp-4p/mag-lsc.json --ip_config /data/ip_list_p4_metal.txt --ssh_port 2222 --graph_format csc,coo "python3 gsgnn_lp.py --cf /data/mag_lsc_lp_p4.yaml --node-feat-name paper:feat"
Error:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__
new_fd = os.dup(fd)
OSError: [Errno 9] Bad file descriptor
Client [64406] waits on 172.31.31.233:60149
Machine (0) group (0) client (79) connect to server successfuly!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Bug 3:
Command:
python3 -u ~/dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/ogbn-mag-lp-4p/ogbn-mag.json --ip_config /data/ip_list_p4_metal.txt --ssh_port 2222 --graph_format csc,coo "python3 gsgnn_lp.py --cf /graph-storm/tests/regression-tests/OGBN-MAG/mag_lp_4p.yaml --node-feat-name paper:feat"
Error:
Machine (2) group (0) client (52) connect to server successfuly!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
collate_fn_dict[dataloader_name](collate_args),
File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dist_dataloader.py", line 516, in collate
return self._collate_with_negative_sampling(items)
File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dist_dataloader.py", line 441, in _collate_with_negative_sampling
pair_graph = self.g.edge_subgraph(items, relabel_nodes=False)
File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 1283, in edge_subgraph
subg[etype] = self.find_edges(edge, etype)
File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 1235, in find_edges
edges = gpb.map_to_homo_eid(edges, etype)
File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/graph_partition_book.py", line 781, in map_to_homo_eid
end_diff = F.gather_row(typed_max_eids, partids) - ids
File "/usr/local/lib/python3.8/dist-packages/dgl/backend/pytorch/tensor.py", line 238, in gather_row
return th.index_select(data, 0, row_index.long())
IndexError: index out of range in self
Process SpawnProcess-2:
Environment
- DGL Version: 1.01
- PyTorch: 1.13
- CUDA/cuDNN version: 11.6
Related issues: dmlc/dgl#5480, dmlc/dgl#5528,
for bug 1, I think it's worth trying with the suggestion: max_split_size_mb
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.14 GiB (GPU 3; 14.58 GiB total capacity; 1.81 GiB already allocated; 1.94 GiB free; 11.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
Hi @isratnisa , to verify if it is a same issue as dmlc/dgl#5480, can you please try revert the problematic commit (pytorch/pytorch@b25a1ce) or rebuild PyT from TOT to see if it works?
Hi, I am facing the same error with #199 (comment).
I have tried both torch=1.12.1
and torch=1.12.0
, the error remains when setting num_samplers>0
.
Run command:
root@ip-172-31-5-112:/graphstorm# python3 -m graphstorm.run.launch --workspace /data --part-config /data/7days_subsample_1000_dense_construct/Cramer.json --ip-config /data/ip_list_1_machine.txt --num-trainers 1 --num-servers 1 --num-samplers 1 --ssh-port 2222 main.py --cf /data/code_dev_tmp_dir/local_machine_test_nc.yaml
Error:
Client[0] in group[0] is exiting...
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
fd, size = storage._share_fd_cpu_()
RuntimeError: unable to open shared memory object </torch_3796335_4048025375_792> in read-write mode: Too many open files (24)
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 359, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3796200) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-20_01:56:16
host : ip-172-31-5-112.us-west-2.compute.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3796200)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 127.0.0.1 'cd /data; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/data/7days_subsample_1000_dense_construct/Cramer.json DGL_IP_CONFIG=/data/ip_list_1_machine.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=96 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python/: ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=1234 main.py --cf /data/code_dev_tmp_dir/local_machine_test_nc.yaml --ip-config /data/ip_list_1_machine.txt --part-config /data/7days_subsample_1000_dense_construct/Cramer.json --verbose False)'' returned non-zero exit status 1.
Task failed
Too many open files (24)
could you check the limit of open files on all machines with ulimit -n
?
Hi @isratnisa , I wonder if you are using the docker when getting this error? I tried without docker using pytorch=1.13.1
it works well. However, when getting inside the docker, I have this error ...
In fact, I also tried pytorch=1.12.0
or even older versions, I still have this error inside the docker. So I think this may not be a pytorch
issue?
Too many open files (24)
could you check the limit of open files on all machines withulimit -n
?
I get the following numbers by checking the open files limits. It seems like this is not the root cause given that those numbers are quite large:
root@ip-172-31-5-112:/# ulimit -n
1048576
Besides, I have also tried torch==1.12.0
without using docker, the distributed learning with multiple sampler works for a few epochs. However, I still get a timeout error when using num_trainer=4
, num_sampler=8
, num_server=1
on single machine (i.e., the standalone mode):
WARNING: We do not export the state of sparse optimizer
Part 0 | Epoch 00002 | Batch 000 | Loss: 1.4303 | Time: 1.4251
Part 0 | Epoch 00002 | Batch 020 | Loss: 1.3768 | Time: 1.2607
Part 0 | Epoch 00002 | Batch 040 | Loss: 1.3559 | Time: 3.6379
Part 0 | Epoch 00002 | Batch 060 | Loss: 1.3488 | Time: 1.1648
Part 0 | Epoch 00002 | Batch 080 | Loss: 1.3189 | Time: 6.6632
Part 0 | Epoch 00002 | Batch 100 | Loss: 1.3186 | Time: 1.1465
Epoch 2 take 234.96398901939392
{'precision_recall': 0.6888486811967163}
successfully save the model to ~/workspaceresults/models_pretrain/epoch-2
Time on save model 54.03713631629944
WARNING: We do not export the state of sparse optimizer
Part 0 | Epoch 00003 | Batch 000 | Loss: 1.3021 | Time: 1.8931
Part 0 | Epoch 00003 | Batch 020 | Loss: 1.3017 | Time: 1.2372
Part 0 | Epoch 00003 | Batch 040 | Loss: 1.2778 | Time: 3.9902
Part 0 | Epoch 00003 | Batch 060 | Loss: 1.2740 | Time: 1.2194
Part 0 | Epoch 00003 | Batch 080 | Loss: 1.2633 | Time: 5.3252
Part 0 | Epoch 00003 | Batch 100 | Loss: 1.2589 | Time: 1.1126
Epoch 3 take 234.03102159500122
{'precision_recall': 0.634166373923565}
Part 0 | Epoch 00004 | Batch 000 | Loss: 1.2546 | Time: 1.7848
Part 0 | Epoch 00004 | Batch 020 | Loss: 1.2476 | Time: 1.2330
Part 0 | Epoch 00004 | Batch 040 | Loss: 1.2364 | Time: 5.1134
Part 0 | Epoch 00004 | Batch 060 | Loss: 1.2418 | Time: 1.1508
Part 0 | Epoch 00004 | Batch 080 | Loss: 1.2219 | Time: 6.7140
Part 0 | Epoch 00004 | Batch 100 | Loss: 1.2376 | Time: 1.2580
Epoch 4 take 235.376788854599
{'precision_recall': 0.6728968917865032}
Traceback (most recent call last):
File "main_ssl.py", line 156, in <module>
main(args)
File "main_ssl.py", line 134, in main
trainer.fit(
File "/home/ubuntu/workspace/ssl_utils.py", line 147, in fit
loss.backward()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backwar
d
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.31.5.11
2]:10154
Traceback (most recent call last):
File "main_ssl.py", line 156, in <module>
Client[26] in group[0] is exiting...
main(args)
File "main_ssl.py", line 134, in main
trainer.fit(
File "/home/ubuntu/workspace/ssl_utils.py", line 147, in fit
loss.backward()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backwar
d
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.31.5.11
2]:911
Client[12] in group[0] is exiting...
Traceback (most recent call last):
File "main_ssl.py", line 156, in <module>
main(args)
File "main_ssl.py", line 134, in main
trainer.fit(
File "/home/ubuntu/workspace/ssl_utils.py", line 204, in fit
self.save_topk_models(model, epoch, None, score, save_model_path)
File "/usr/local/lib/python3.8/dist-packages/graphstorm-0.1.0.post1-py3.8.egg/graphstorm/trainer/gsgnn_
trainer.py", line 263, in save_topk_models
File "/usr/local/lib/python3.8/dist-packages/graphstorm-0.1.0.post1-py3.8.egg/graphstorm/trainer/gsgnn_
trainer.py", line 190, in save_model
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2791, in barrier
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
This seems like a deadlock when saving models, not really multi-sampler issue though.
Torch 2.0.1 resolves the issue. Verified with 2.0.1+cu117
.