LinkNeighborLoader Fails W/ MultiProcessing
puririshi98 opened this issue ยท 7 comments
๐ Describe the bug
Mysterious error output:
E File "/opt/conda/lib/python3.8/multiprocessing/util.py", line 452, in spawnv_passfds
E return _posixsubprocess.fork_exec(
E ValueError: bad value(s) in fds_to_keep
Environment
NVIDIA Container (in slack msg due to NDA)
test_tabformer_pyg.zip
The dataset can be downloaded here and the transactions.tgz can be unzipped for the desired dataset. It should be placed at /workspace/data/tabformer/card_transaction.v1.csv
Then the test can be run w/:
py.test -s test_tabformer_pyg.py -v
It should work out of the box since our container is on an older PyG version still using the original torch-sparse implementation of hetero neighbor sampling, but if you update to the latest master branch source build of pyg-lib and PyG then it will start using pyg-lib for hetero neighbor sampling and then you get the bad value(s) in fds_to_keep
error. Not sure what's going on
Synced offline. Is this error still present?
I continued to mess w/ the pyg distributed sampling example: I modified it to use linkneighborloader instead of neighborsampler so we are more similar to the gnn tool trainer code. Testing w/ the latest pyg-lib and PyG.
Findings: If i switch num_workers from 0 to anything > 0, I can replicate the failure for homogeneous and heterogeneous synthetic data:
Homo repro: https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/homo_linkneighbor.py
Hetero Repro:
https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/hetero_linkneighbor.py
This implies the error comes from the interaction between torch.multiprocessing.spawn, pyg-lib neighborsampling, and the num_workers of the LinkNeighborLoader used.
And you have no trouble running normal NeighborLoader
with num_workers>0
? This suggests that this isn't an issue with the pyg-lib
implementation (they both re-use the same logic here). I will try to look into it.
And you have no trouble running normal NeighborLoader with num_workers>0
yes that is the case.
no bug: https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/hetero_neighbor.py
yes bug: https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/hetero_linkneighbor.py
Can reproduce, looking into it.
Fixed via pyg-team/pytorch_geometric#5978.