Running machine translation using different GNNs

Question

Running machine translation using different GNNs

smith-co opened this issue 2 years ago · 10 comments

❓ Questions and Help

I am running the NMT example on the same dataset with GNN variants:

GCN
GGNN
GraphSage

While the execution runs with GCN, I get Out-of-Memory (OOM) for GGNN and GraphSage. Can anyone help me with this query?

Answer 1 · 2022-03-30T14:05:29.000Z

Please try a smaller batch_size or try another GPU with larger memory.

Answer 2 · 2022-03-30T19:47:53.000Z

@AlanSwift I already tried a smaller batch size. What I find suprising is:

It runs for GCN and GAT.
But it gives Out-of-Memory (OOM) for GGNN and GraphSage.

Its the same dataset. But GGNN and GraphSage fails to run while GCN and GAT works.

So GGNN/GraphSage needs more resource for some reason? Super interested to know why?

Answer 3 · 2022-03-31T16:46:08.000Z

We haven't investigated the memory efficiency for dgl :).
It seems that GGNN and GraphSage need more GPU memory.

Answer 4 · 2022-04-01T03:38:21.000Z

@AlanSwift I get this OOM error at runtime for GGNN:

  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 226, in forward
    return self.encoder_decoder(batch_graph=batch_graph, oov_dict=oov_dict, tgt_seq=tgt_seq)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 173, in encoder_decoder
    batch_graph = self.gnn_encoder(batch_graph)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 557, in forward
    h = self.models(dgl_graph, (feat_in, feat_out), etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 442, in forward
    return self.model(graph, node_feats, etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 210, in forward
    graph_in.apply_edges(
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/heterograph.py", line 4300, in apply_edges
    edata = core.invoke_edge_udf(g, eid, etype, func)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/core.py", line 85, in invoke_edge_udf
    return func(ebatch)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 212, in <lambda>
    "W_e*h": self.linears_in[i](edges.src["h"])
RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 3; 14.76 GiB total capacity; 11.83 GiB already allocated; 447.75 MiB free; 12.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea?

Answer 5 · 2022-04-01T03:55:18.000Z

@AlanSwift came across this discussion at DGL: Memory consumption of the GGNN module

Answer 6 · 2022-04-02T05:53:54.000Z

It seems the dgl sacrifices memory efficiency for time efficiency. We will pay attention to this problem. Thank you for letting us know it!

Answer 7 · 2022-04-06T06:37:57.000Z

@AlanSwift can you please provide me with a fix/suggestion 🙏

Answer 8 · 2022-04-19T15:15:05.000Z

@AlanSwift, this is interesting. I also faced the same problem. Wondering do you have any solution to this?

Answer 9 · 2022-05-12T17:18:50.000Z

@AlanSwift do you have a plan to address the GGNN implementation limitation?

Answer 10 · 2022-05-12T17:23:31.000Z

Currently, this is not on my plan since it is related to the DGL.