microsoft/DeepGNN

AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'

nabihach opened this issue · 2 comments

  • Issue is labeled using the label menu on the right side.

Environment

  • Python version: (python -V) 3.7.7
  • deepgnn-ge Version: (python -m pip show deepgnn-ge) 0.1.55.1
  • deepgnn-torch Version: (python -m pip show deepgnn-torch) 0.1.55.1
  • deepgnn-tf Version: (python -m pip show deepgnn-tf) not installed
  • OS: (Windows, Linux, ...) Windows 10 Enterprise

Issue Details

  • What you did - code sample or commands run

I installed deepgnn-torch via pip in a virtual environment. Then I cloned the deepgnn repository, cd-ed into the examples/pytorch/gat/ and then ran bash run.sh

  • Expected behavior

I expected the training script to run without issues.

  • Actual behavior

I see the error
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'

Full stack trace:

$ bash run.sh
+ DEVICE=cpu
++ dirname run.sh
+ DIR_NAME=.
+ GRAPH=/tmp/cora
+ python -m deepgnn.graph_engine.data.citation --data_dir /tmp/cora
c:\users\myid\appdata\local\programs\python\python37\lib\runpy.py:125: RuntimeWarning: 'deepgnn.graph_engine.data.citation' found in sys.modules after import of package 'deepgnn.graph_engine.data', but prior to execution of 'deepgnn.graph_engine.data.citation'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
[2022-09-09 14:46:04,150] {convert.py:100} INFO - worker 0 try to generate partition: 0 - 1
[2022-09-09 14:46:04,151] {_adl_reader.py:124} INFO - [1,0] Input files: ['C:/Users/myid/AppData/Local/Temp/cora\\graph.json']
[2022-09-09 14:46:04,782] {dispatcher.py:143} INFO - record processed: 1000
[2022-09-09 14:46:05,257] {dispatcher.py:143} INFO - record processed: 2000
[2022-09-09 14:46:05,657] {local.py:44} INFO - Graph data path: C:/Users/myid/AppData/Local/Temp/cora. Partitions [0]. Storage type 0. Config path . Stream False.
[2022-09-09 14:46:05,707] {local.py:52} INFO - Loaded snark graph. Node counts: [140, 500, 1000, 1068]. Edge counts: [10556]
graph data: C:/Users/myid/AppData/Local/Temp/cora
+ MODEL_DIR=/tmp/model_fix
+ rm -rf /tmp/model_fix
+ [[ cpu == \g\p\u ]]
+ python ./main.py --data_dir /tmp/cora --mode train --seed 123 --backend snark --graph_type local --converter skip --batch_size 140 --learning_rate 0.005 --num_epochs 180 --sample_file /tmp/cora/train.nodes --node_type 0 --model_dir /tmp/model_fix --metric_dir /tmp/model_fix --save_path /tmp/model_fix --eval_file /tmp/cora/test.nodes --eval_during_train_by_steps 1 --feature_idx 0 --feature_dim 1433 --label_idx 1 --label_dim 1 --head_num 8,1 --num_classes 7 --neighbor_edge_types 0 --attn_drop 0.6 --ffd_drop 0.6 --log_by_steps 1 --use_per_step_metrics
[2022-09-09 14:46:08,646] {factory.py:38} INFO - GE_OMP_NUM_THREADS=1
[2022-09-09 14:46:08,647] {factory.py:38} INFO - apex_opt_level=O2
[2022-09-09 14:46:08,647] {factory.py:38} INFO - attn_drop=0.6
[2022-09-09 14:46:08,647] {factory.py:38} INFO - backend=snark
[2022-09-09 14:46:08,647] {factory.py:38} INFO - batch_size=140
[2022-09-09 14:46:08,647] {factory.py:38} INFO - client_rank=None
[2022-09-09 14:46:08,647] {factory.py:38} INFO - clip_grad=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - config_path=
[2022-09-09 14:46:08,647] {factory.py:38} INFO - converter=skip
[2022-09-09 14:46:08,647] {factory.py:38} INFO - data_dir=C:/Users/myid/AppData/Local/Temp/cora
[2022-09-09 14:46:08,647] {factory.py:38} INFO - data_parallel_num=2
[2022-09-09 14:46:08,647] {factory.py:38} INFO - dim=256
[2022-09-09 14:46:08,647] {factory.py:38} INFO - disable_ib=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - enable_adl_uploader=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - enable_ssl=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - eval_during_train_by_steps=1
[2022-09-09 14:46:08,647] {factory.py:38} INFO - eval_file=C:/Users/myid/AppData/Local/Temp/cora/test.nodes
[2022-09-09 14:46:08,647] {factory.py:38} INFO - fanouts=[10, 10]
[2022-09-09 14:46:08,647] {factory.py:38} INFO - featenc_config=None
[2022-09-09 14:46:08,647] {factory.py:38} INFO - feature_dim=1433
[2022-09-09 14:46:08,648] {factory.py:38} INFO - feature_idx=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - feature_type=float
[2022-09-09 14:46:08,648] {factory.py:38} INFO - ffd_drop=0.6
[2022-09-09 14:46:08,648] {factory.py:38} INFO - fp16=amp
[2022-09-09 14:46:08,648] {factory.py:38} INFO - ge_start_timeout=30
[2022-09-09 14:46:08,648] {factory.py:38} INFO - gpu=False
[2022-09-09 14:46:08,648] {factory.py:38} INFO - grad_max_norm=1.0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - graph_type=local
[2022-09-09 14:46:08,648] {factory.py:38} INFO - head_num=[8, 1]
[2022-09-09 14:46:08,648] {factory.py:38} INFO - hidden_dim=8
[2022-09-09 14:46:08,648] {factory.py:38} INFO - job_id=aa812d6f
[2022-09-09 14:46:08,648] {factory.py:38} INFO - l2_coef=0.0005
[2022-09-09 14:46:08,648] {factory.py:38} INFO - label_dim=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - label_idx=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - learning_rate=0.005
[2022-09-09 14:46:08,648] {factory.py:38} INFO - local_rank=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - log_by_steps=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_id=None
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_samples=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_saved_ckpts=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - meta_dir=
[2022-09-09 14:46:08,648] {factory.py:38} INFO - metric_dir=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - mode=train
[2022-09-09 14:46:08,649] {factory.py:38} INFO - model_args=
[2022-09-09 14:46:08,649] {factory.py:38} INFO - model_dir=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - neighbor_count=10
[2022-09-09 14:46:08,649] {factory.py:38} INFO - neighbor_edge_types=[0]
[2022-09-09 14:46:08,649] {factory.py:38} INFO - node_type=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_classes=7
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_epochs=180
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_ge=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_negs=5
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_parallel=2
[2022-09-09 14:46:08,649] {factory.py:38} INFO - partitions=[0]
[2022-09-09 14:46:08,649] {factory.py:38} INFO - prefetch_factor=2
[2022-09-09 14:46:08,649] {factory.py:38} INFO - prefetch_size=16
[2022-09-09 14:46:08,649] {factory.py:38} INFO - sample_file=C:/Users/myid/AppData/Local/Temp/cora/train.nodes
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_ckpt_by_epochs=1
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_ckpt_by_steps=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_path=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - seed=123
[2022-09-09 14:46:08,649] {factory.py:38} INFO - server_idx=None
[2022-09-09 14:46:08,649] {factory.py:38} INFO - servers=
[2022-09-09 14:46:08,649] {factory.py:38} INFO - skip_ge_start=False
[2022-09-09 14:46:08,649] {factory.py:38} INFO - sort_ckpt_by_mtime=False
[2022-09-09 14:46:08,649] {factory.py:38} INFO - ssl_cert=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - storage_type=0
[2022-09-09 14:46:08,650] {factory.py:38} INFO - strategy=RandomWithoutReplacement
[2022-09-09 14:46:08,650] {factory.py:38} INFO - stream=False
[2022-09-09 14:46:08,650] {factory.py:38} INFO - sync_dir=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - trainer=base
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_process_num=1
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_store_name=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_threads_num=12
[2022-09-09 14:46:08,650] {factory.py:38} INFO - use_per_step_metrics=True
[2022-09-09 14:46:08,650] {factory.py:38} INFO - user_name=10.0.0.200
[2022-09-09 14:46:08,650] {factory.py:38} INFO - warmup=0.0002
[2022-09-09 14:46:08,654] {local.py:44} INFO - Graph data path: C:/Users/myid/AppData/Local/Temp/cora. Partitions [0]. Storage type 0. Config path . Stream False.
[2022-09-09 14:46:08,666] {local.py:52} INFO - Loaded snark graph. Node counts: [140, 500, 1000, 1068]. Edge counts: [10556]
[2022-09-09 14:46:08,666] {main.py:37} INFO - Creating GAT model with seed:123.
[2022-09-09 14:46:08,668] {base_model.py:39} INFO - [BaseModel] feature_type: FeatureType.FLOAT, feature_idx:0, feature_dim:0.
[2022-09-09 14:46:08,672] {trainer.py:472} INFO - [1,0] Max steps per epoch:-1
[2022-09-09 14:46:08,672] {utils.py:107} INFO - 0, input_layer.att_head-0.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,672] {utils.py:107} INFO - 1, input_layer.att_head-0.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 2, input_layer.att_head-0.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 3, input_layer.att_head-0.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 4, input_layer.att_head-0.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 5, input_layer.att_head-0.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 6, input_layer.att_head-1.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 7, input_layer.att_head-1.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 8, input_layer.att_head-1.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 9, input_layer.att_head-1.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 10, input_layer.att_head-1.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 11, input_layer.att_head-1.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 12, input_layer.att_head-2.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 13, input_layer.att_head-2.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 14, input_layer.att_head-2.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 15, input_layer.att_head-2.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 16, input_layer.att_head-2.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 17, input_layer.att_head-2.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 18, input_layer.att_head-3.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 19, input_layer.att_head-3.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 20, input_layer.att_head-3.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 21, input_layer.att_head-3.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 22, input_layer.att_head-3.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 23, input_layer.att_head-3.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 24, input_layer.att_head-4.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 25, input_layer.att_head-4.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 26, input_layer.att_head-4.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 27, input_layer.att_head-4.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 28, input_layer.att_head-4.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 29, input_layer.att_head-4.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 30, input_layer.att_head-5.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 31, input_layer.att_head-5.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 32, input_layer.att_head-5.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 33, input_layer.att_head-5.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 34, input_layer.att_head-5.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 35, input_layer.att_head-5.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 36, input_layer.att_head-6.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 37, input_layer.att_head-6.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 38, input_layer.att_head-6.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 39, input_layer.att_head-6.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 40, input_layer.att_head-6.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 41, input_layer.att_head-6.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 42, input_layer.att_head-7.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 43, input_layer.att_head-7.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 44, input_layer.att_head-7.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 45, input_layer.att_head-7.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 46, input_layer.att_head-7.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 47, input_layer.att_head-7.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 48, out_layer.att_head-0.bias: torch.Size([7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 49, out_layer.att_head-0.w.weight: torch.Size([7, 64]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 50, out_layer.att_head-0.attn_l.weight: torch.Size([1, 7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 51, out_layer.att_head-0.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 52, out_layer.att_head-0.attn_r.weight: torch.Size([1, 7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 53, out_layer.att_head-0.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,675] {utils.py:116} INFO - parameter count: 92391
[2022-09-09 14:46:08,675] {logging_utils.py:84} INFO - Training worker started. Model: GAT.
Traceback (most recent call last):
  File "./main.py", line 126, in <module>
    _main()
  File "./main.py", line 121, in _main
    init_args_fn=init_args,
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\factory.py", line 134, in run_dist
    eval_dataloader_for_training,
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 100, in run
    self._train(model)
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 171, in _train
    self._train_one_epoch(model, epoch)
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 174, in _train_one_epoch
    for i, data in enumerate(self.dataset):
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 444, in __iter__
    return self._get_iterator()
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in __init__
    w.start()
  File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'

Thank you for reporting this bug. I have reproduced this issue: it is only on Windows and only on GAT. The issue is the use of FileNodeSampler in the TorchDeepGNNDataset. I'll work on a fix on Monday, if you want to get this code running before it gets merged, you can edit the create_dataset function inside of examples/pytorch/gat/main.py to be.

def create_dataset(
    args: argparse.Namespace,
    model: BaseModel,
    rank: int = 0,
    world_size: int = 1,
    backend: GraphEngineBackend = None,
):
    return TorchDeepGNNDataset(
        sampler_class=GENodeSampler,
        backend=backend,
        sample_num=-1,
        num_workers=world_size,
        worker_index=rank,
        node_types=np.array([args.node_type], dtype=np.int32),
        batch_size=args.batch_size,
        query_fn=model.q.query_training,
        prefetch_queue_size=10,
        prefetch_worker_size=2,
        strategy=SamplingStrategy.RandomWithoutReplacement,
    )

Thanks @coledie, that worked like a charm.
Just FYI: I'm running into the same problem with other examples too (e.g. examples/pytorch/graphsage, examples/pytorch/h etgnn, etc). It would be great if you could push a fix for those as well.