pyg-team/pyg-lib

LinkNeighborLoader Fails W/ MultiProcessing

puririshi98 opened this issue ยท 7 comments

๐Ÿ› Describe the bug

Mysterious error output:

E         File "/opt/conda/lib/python3.8/multiprocessing/util.py", line 452, in spawnv_passfds
E           return _posixsubprocess.fork_exec(
E       ValueError: bad value(s) in fds_to_keep

bad_fds_to_keep.txt

Environment

NVIDIA Container (in slack msg due to NDA)
test_tabformer_pyg.zip
The dataset can be downloaded here and the transactions.tgz can be unzipped for the desired dataset. It should be placed at /workspace/data/tabformer/card_transaction.v1.csv

Then the test can be run w/:
py.test -s test_tabformer_pyg.py -v

It should work out of the box since our container is on an older PyG version still using the original torch-sparse implementation of hetero neighbor sampling, but if you update to the latest master branch source build of pyg-lib and PyG then it will start using pyg-lib for hetero neighbor sampling and then you get the bad value(s) in fds_to_keep error. Not sure what's going on

Synced offline. Is this error still present?

I continued to mess w/ the pyg distributed sampling example: I modified it to use linkneighborloader instead of neighborsampler so we are more similar to the gnn tool trainer code. Testing w/ the latest pyg-lib and PyG.
Findings: If i switch num_workers from 0 to anything > 0, I can replicate the failure for homogeneous and heterogeneous synthetic data:

Homo repro: https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/homo_linkneighbor.py
Hetero Repro:
https://github.com/puririshi98/rgcn_pyg_lib_forward_bench/blob/main/hetero_linkneighbor.py

This implies the error comes from the interaction between torch.multiprocessing.spawn, pyg-lib neighborsampling, and the num_workers of the LinkNeighborLoader used.

And you have no trouble running normal NeighborLoader with num_workers>0? This suggests that this isn't an issue with the pyg-lib implementation (they both re-use the same logic here). I will try to look into it.

Can reproduce, looking into it.