LukeLIN-web/DSP-PPOPP2023

running the single GPU version of the DSP with some issues

Opened this issue · 0 comments

1
--------------Running DSP sampler on ogb-product with 1 GPUs--------------------
1
Using backend: pytorch
[03:08:42] /root/projects/dsdgl/src/ds/core.cc:59: Rank [0] initializing DS context
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:20: Coordinator initializing port: 12210
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:47: Rank 0 try to connect to root on addr tcp://gpunode1:33147
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:68: Get the info of peer 0 on address tcp://gpunode1:33147
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:86: My rank is 0, my device id is 0
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:20: Coordinator initializing port: 17211
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:47: Rank 0 try to connect to root on addr tcp://gpunode1:36719
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:68: Get the info of peer 0 on address tcp://gpunode1:36719
[03:08:42] /root/projects/dsdgl/src/ds/coordinator.cc:86: My rank is 0, my device id is 0
[03:08:42] /root/projects/dsdgl/src/ds/core.cc:73: Enable kernel control? 0
[03:08:42] /root/projects/dsdgl/src/ds/core.cc:100: Rank 0 successfully builds nccl communicator
[03:08:42] /root/projects/dsdgl/src/ds/core.cc:106: Enable profiler? 0
[03:08:47] /root/projects/dsdgl/src/ds/sampling.cc:241: Rank: 0, # train ids before rebalance: 196615
[03:08:52] /root/projects/dsdgl/src/ds/cache_graph.cc:22: Cache graph ratio: 100
[03:08:53] /root/projects/dsdgl/src/ds/cache_graph.cc:66: [Rank] 0 Cached nodes: 2449029 Cached edges: 126167309
[03:08:53] /root/projects/dsdgl/src/ds/cache_graph.cc:67: [Rank] 0 Host nodes: 0 host edges: 0
[03:08:54] /root/projects/dsdgl/src/ds/core.cc:130: Set local stream: 0x560657ae26f0
[03:08:54] /root/projects/dsdgl/src/ds/core.cc:130: Set local stream: 0x560657ae26f0
Start rank 0 with args: Namespace(dataset=None, graph_cache_gb=-1, graph_cache_ratio=100, graph_name='test', in_feats=256, n_ranks=1, part_config='/data/dsp/ogb-product1/ogb-product.json', sample_only=True)
loaded node feats
loaded graph
Host memory usage after load partition 7.528128512 GB
rank 0, # global: 2449029, # local: 2449029

in feats: 100

#labels: 47
Rank 0, subgraph nodes 0.002449029 B, subgraph edges 0.126167309 B
Rank 0, pytorch memory usage after move train_g to device : 1.069690368 GB
current thread: 139691514598336
current thread: 139679106066176
Using backend: pytorch
/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/ds/utils.py:209: UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1631630839582/work/aten/src/ATen/native/TensorAdvancedIndexing.cpp:1187.)
train_nid = th.masked_select(
Traceback (most recent call last):
File "/root/projects/DSP_AE/dsp/sampling.py", line 133, in
mp.spawn(run, args=(args,), nprocs=args.n_ranks, join=True)
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/projects/DSP_AE/dsp/sampling.py", line 86, in run
for step, (input_nodes, seeds, blocks) in enumerate(dataloader):
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 35, in fetch
return self.collate_fn(data)
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/dataloading/dataloader.py", line 513, in collate
blocks = self.block_sampler.sample_blocks(self.g, items)
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/ds/utils.py", line 41, in sample_blocks
frontier, seeds = self.sample_neighbors(self.g, self.num_vertices,
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/ds/sampling.py", line 25, in sample_neighbors
rets = _CAPI_DGLDSSampleNeighbors(None, num_vertices, device_min_vids, device_min_eids, nodes,
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 188, in call
check_call(_LIB.DGLFuncCall(
File "/root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/_ffi/base.py", line 64, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [03:08:54] /root/projects/dsdgl/src/array/cuda/array_op_impl.cu:219: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: no kernel image is available for execution on the device
Stack trace:
[bt] (0) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x114c892) [0x7f0a91c25892]
[bt] (1) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::Full<(DLDeviceType)2, long>(long, long, DLContext)+0x1c1) [0x7f0a91c2d261]
[bt] (2) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::Full(long, long, DLContext)+0xda) [0x7f0a910eac7a]
[bt] (3) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(dgl::ds::Partition(dgl::runtime::NDArray, dgl::runtime::NDArray)+0x4d) [0x7f0a924112bd]
[bt] (4) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x8cfb6f) [0x7f0a913a8b6f]
[bt] (5) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x8d0244) [0x7f0a913a9244]
[bt] (6) /root/miniconda3/envs/dsp/lib/python3.8/site-packages/dgl-0.7.2-py3.8-linux-x86_64.egg/dgl/libdgl.so(DGLFuncCall+0x73) [0x7f0a91a672a3]
[bt] (7) /root/miniconda3/envs/dsp/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f0c770279dd]
[bt] (8) /root/miniconda3/envs/dsp/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f0c77027067]