awslabs/graphstorm

Issues with multiple samplers on torch 1.13

isratnisa opened this issue ยท 8 comments

๐Ÿ› Bug

Training script for link prediction does not work with multiple sampler for PyTorch 1.13. So far, three different bugs were found. In summary:

  1. MAG-LSC (4 partitions), #sampler= 1: throws a CUDA OOM issue
  2. MAG-LSC (4 partitions), #sampler= 4: throws a KeyError: 'dataloader-0' error from dgl/distributed/dist_context.py
  3. OGBN-MAG (4 partitions), #sampler= 1: throws a IndexError: index out of range in self

Note:

  • Both datasets work fine for #sampler=0
  • Both datasets work fine for pytorch 1.12 with #sampler=0/1/4
  • OGBN-MAG (1 partition) works fine with 0, 1 or 4 samplers

Details

Bug 1:

Run command:

python3 -u  ~/dgl/tools/launch.py \
>         --workspace /graph-storm/python/graphstorm/run/gsgnn_lp \
>         --num_trainers 4 \
>         --num_servers 1 \
>         --num_samplers 1 \
>         --part_config /data/mag-lsc-lp-4p/mag-lsc.json \
>         --ip_config /data/ip_list_p4_metal.txt \
>         --ssh_port 2222 \
>         --graph_format csc,coo \
>         "python3 gsgnn_lp.py --cf /data/mag_lsc_lp_p4.yaml --node-feat-name paper:feat"

Error:

Traceback (most recent call last):
  File "gsgnn_lp.py", line 154, in <module>
    main(args)
  File "gsgnn_lp.py", line 120, in main
    trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
  File "/graph-storm/python/graphstorm/trainer/lp_trainer.py", line 140, in fit
    loss = model(blocks, pos_graph, neg_graph,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/lp_gnn.py", line 93, in forward
    encode_embs = self.compute_embed_step(blocks, node_feats)
  File "/graph-storm/python/graphstorm/model/gnn.py", line 527, in compute_embed_step
    embs = self.node_input_encoder(input_feats, input_nodes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/embed.py", line 239, in forward
    emb = input_feats[ntype].float() @ self.input_projs[ntype]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.14 GiB (GPU 3; 14.58 GiB total capacity; 1.81 GiB already allocated; 1.94 GiB free; 11.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

Bug 2:

Run command:

python3 -u  ~/dgl/tools/launch.py         --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         --num_trainers 4         --num_servers 1         --num_samplers 4         --part_config /data/mag-lsc-lp-4p/mag-lsc.json         --ip_config /data/ip_list_p4_metal.txt         --ssh_port 2222         --graph_format csc,coo         "python3 gsgnn_lp.py --cf /data/mag_lsc_lp_p4.yaml --node-feat-name paper:feat"

Error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd
    return resource_sharer.DupFd(fd)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
  File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__
    new_fd = os.dup(fd)
OSError: [Errno 9] Bad file descriptor
Client [64406] waits on 172.31.31.233:60149
Machine (0) group (0) client (79) connect to server successfuly!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'

Bug 3:

Command:
python3 -u  ~/dgl/tools/launch.py  --workspace /graph-storm/python/graphstorm/run/gsgnn_lp  --num_trainers 4   --num_servers 1 --num_samplers 4 --part_config /data/ogbn-mag-lp-4p/ogbn-mag.json --ip_config /data/ip_list_p4_metal.txt --ssh_port 2222         --graph_format csc,coo "python3 gsgnn_lp.py --cf /graph-storm/tests/regression-tests/OGBN-MAG/mag_lp_4p.yaml  --node-feat-name paper:feat"

Error:

Machine (2) group (0) client (52) connect to server successfuly!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dist_dataloader.py", line 516, in collate
    return self._collate_with_negative_sampling(items)
  File "/usr/local/lib/python3.8/dist-packages/dgl/dataloading/dist_dataloader.py", line 441, in _collate_with_negative_sampling
    pair_graph = self.g.edge_subgraph(items, relabel_nodes=False)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 1283, in edge_subgraph
    subg[etype] = self.find_edges(edge, etype)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_graph.py", line 1235, in find_edges
    edges = gpb.map_to_homo_eid(edges, etype)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/graph_partition_book.py", line 781, in map_to_homo_eid
    end_diff = F.gather_row(typed_max_eids, partids) - ids
  File "/usr/local/lib/python3.8/dist-packages/dgl/backend/pytorch/tensor.py", line 238, in gather_row
    return th.index_select(data, 0, row_index.long())
IndexError: index out of range in self
Process SpawnProcess-2:

Environment

  • DGL Version: 1.01
  • PyTorch: 1.13
  • CUDA/cuDNN version: 11.6

for bug 1, I think it's worth trying with the suggestion: max_split_size_mb

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.14 GiB (GPU 3; 14.58 GiB total capacity; 1.81 GiB already allocated; 1.94 GiB free; 11.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

Hi @isratnisa , to verify if it is a same issue as dmlc/dgl#5480, can you please try revert the problematic commit (pytorch/pytorch@b25a1ce) or rebuild PyT from TOT to see if it works?

Hi, I am facing the same error with #199 (comment).

I have tried both torch=1.12.1 and torch=1.12.0, the error remains when setting num_samplers>0.

Run command:

root@ip-172-31-5-112:/graphstorm# python3 -m graphstorm.run.launch            --workspace /data            --part-config /data/7days_subsample_1000_dense_construct/Cramer.json            --ip-config /data/ip_list_1_machine.txt            --num-trainers 1            --num-servers 1            --num-samplers 1            --ssh-port 2222            main.py --cf /data/code_dev_tmp_dir/local_machine_test_nc.yaml

Error:

Client[0] in group[0] is exiting...
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
    fd, size = storage._share_fd_cpu_()
RuntimeError: unable to open shared memory object </torch_3796335_4048025375_792> in read-write mode: Too many open files (24)
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 359, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd
    return resource_sharer.DupFd(fd)
  File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3796200) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-20_01:56:16
  host      : ip-172-31-5-112.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3796200)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 127.0.0.1 'cd /data; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/data/7days_subsample_1000_dense_construct/Cramer.json DGL_IP_CONFIG=/data/ip_list_1_machine.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=96 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python/: ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=1234 main.py --cf /data/code_dev_tmp_dir/local_machine_test_nc.yaml --ip-config /data/ip_list_1_machine.txt --part-config /data/7days_subsample_1000_dense_construct/Cramer.json --verbose False)'' returned non-zero exit status 1.
Task failed

Too many open files (24) could you check the limit of open files on all machines with ulimit -n?

Hi @isratnisa , I wonder if you are using the docker when getting this error? I tried without docker using pytorch=1.13.1 it works well. However, when getting inside the docker, I have this error ...

In fact, I also tried pytorch=1.12.0 or even older versions, I still have this error inside the docker. So I think this may not be a pytorch issue?

Too many open files (24) could you check the limit of open files on all machines with ulimit -n?

I get the following numbers by checking the open files limits. It seems like this is not the root cause given that those numbers are quite large:

root@ip-172-31-5-112:/# ulimit -n
1048576

Besides, I have also tried torch==1.12.0 without using docker, the distributed learning with multiple sampler works for a few epochs. However, I still get a timeout error when using num_trainer=4, num_sampler=8, num_server=1 on single machine (i.e., the standalone mode):

WARNING: We do not export the state of sparse optimizer
Part 0 | Epoch 00002 | Batch 000 | Loss: 1.4303 | Time: 1.4251
Part 0 | Epoch 00002 | Batch 020 | Loss: 1.3768 | Time: 1.2607
Part 0 | Epoch 00002 | Batch 040 | Loss: 1.3559 | Time: 3.6379
Part 0 | Epoch 00002 | Batch 060 | Loss: 1.3488 | Time: 1.1648
Part 0 | Epoch 00002 | Batch 080 | Loss: 1.3189 | Time: 6.6632
Part 0 | Epoch 00002 | Batch 100 | Loss: 1.3186 | Time: 1.1465
Epoch 2 take 234.96398901939392
{'precision_recall': 0.6888486811967163}
successfully save the model to ~/workspaceresults/models_pretrain/epoch-2
Time on save model 54.03713631629944
WARNING: We do not export the state of sparse optimizer
Part 0 | Epoch 00003 | Batch 000 | Loss: 1.3021 | Time: 1.8931
Part 0 | Epoch 00003 | Batch 020 | Loss: 1.3017 | Time: 1.2372
Part 0 | Epoch 00003 | Batch 040 | Loss: 1.2778 | Time: 3.9902
Part 0 | Epoch 00003 | Batch 060 | Loss: 1.2740 | Time: 1.2194
Part 0 | Epoch 00003 | Batch 080 | Loss: 1.2633 | Time: 5.3252
Part 0 | Epoch 00003 | Batch 100 | Loss: 1.2589 | Time: 1.1126
Epoch 3 take 234.03102159500122
{'precision_recall': 0.634166373923565}
Part 0 | Epoch 00004 | Batch 000 | Loss: 1.2546 | Time: 1.7848
Part 0 | Epoch 00004 | Batch 020 | Loss: 1.2476 | Time: 1.2330
Part 0 | Epoch 00004 | Batch 040 | Loss: 1.2364 | Time: 5.1134
Part 0 | Epoch 00004 | Batch 060 | Loss: 1.2418 | Time: 1.1508
Part 0 | Epoch 00004 | Batch 080 | Loss: 1.2219 | Time: 6.7140
Part 0 | Epoch 00004 | Batch 100 | Loss: 1.2376 | Time: 1.2580
Epoch 4 take 235.376788854599
{'precision_recall': 0.6728968917865032}
Traceback (most recent call last):
  File "main_ssl.py", line 156, in <module>
    main(args)
  File "main_ssl.py", line 134, in main
    trainer.fit(
  File "/home/ubuntu/workspace/ssl_utils.py", line 147, in fit
    loss.backward()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backwar
d
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.31.5.11
2]:10154
Traceback (most recent call last):
  File "main_ssl.py", line 156, in <module>
Client[26] in group[0] is exiting...
    main(args)
  File "main_ssl.py", line 134, in main
    trainer.fit(
  File "/home/ubuntu/workspace/ssl_utils.py", line 147, in fit
    loss.backward()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backwar
d
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.31.5.11
2]:911
Client[12] in group[0] is exiting...
Traceback (most recent call last):
  File "main_ssl.py", line 156, in <module>
    main(args)
  File "main_ssl.py", line 134, in main
    trainer.fit(
  File "/home/ubuntu/workspace/ssl_utils.py", line 204, in fit
    self.save_topk_models(model, epoch, None, score, save_model_path)
  File "/usr/local/lib/python3.8/dist-packages/graphstorm-0.1.0.post1-py3.8.egg/graphstorm/trainer/gsgnn_
trainer.py", line 263, in save_topk_models
  File "/usr/local/lib/python3.8/dist-packages/graphstorm-0.1.0.post1-py3.8.egg/graphstorm/trainer/gsgnn_
trainer.py", line 190, in save_model
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2791, in barrier
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

This seems like a deadlock when saving models, not really multi-sampler issue though.

Torch 2.0.1 resolves the issue. Verified with 2.0.1+cu117.