PaddlePaddle/PGL

[PGLbox] how is data sampled?

EugeneNasonov opened this issue ยท 6 comments

How do you partition nodes into batches and sample neighbors when training? Neighbor sampling could be done with methods of GraphGpuWrapper that you use in your graph.py file. However, it is never called explicitly in the code.

Hi, our data process code including partition batches and sample neighbors, is actually written in C++ language. You can see our code at: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/data_feed.cu#L1530

Hi, our data process code including partition batches and sample neighbors, is actually written in C++ language. You can see our code at: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/data_feed.cu#L1530

Thanks! Your answer is very on point, and it almost answers my question. I do understand that the back-end logic is implemented in C++/CUDA. However, I'm also trying to understand what interface you use to call this sampler. For example, from what I see, the only Paddle class related to GPU Graphs that is used in your project is GraphGpuWrapper: https://github.com/PaddlePaddle/PGL/blob/main/apps/PGLBox/src/graph.py#L84
Its C++ implementation is seen here:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/fleet/heter_ps/graph_gpu_wrapper.cu
I don't see GraphDataGenerator being used.
Therefore, I guess that there is some other class that uses this GraphDataGenerator. Which one?

When I examine training logic
https://github.com/PaddlePaddle/PGL/blob/main/apps/PGLBox/src/cluster_train_and_infer.py#L69
The pass generator doesn't do that job:
https://github.com/PaddlePaddle/PGL/blob/main/apps/PGLBox/src/dataset.py#L304
Incidentally, could you describe me what the pass generator does? It's very unobvious for me, but I think it allocates memory/threads reserved for actual data sampling.

I don't really see where we get data from to feed it into the executor class. So I don't really see how a batch is prepared and fed into the learning part of code. I think it's very important to understand because it's basically one of points of interest of this work: how data is stored and moved around.

You are right, in our python code, we only directly use GraphGpuWrapper class. This object is mainly used for graph processing, graph storing and other basic sample functions. As for GraphDataGenerator, it is mainly used for pass dataset generation. You can see it being used at https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/data_feed.h#L1339, in DataFeed class.

As for pass generator, you can think of one pass as multi-batches. GraphDataGenerator has a member function called GenerateBatch, we use it to generate a pass(multi-batches) and feed the original tensor memory into data holder that python training code should use.

I personally suggest that if you want to figure out the whole graph data logic, you can start from apps/PGLBox/src/dataset.py in our code to understand the essential C++ meaning of dataset and how load_into_memory is called from C++ , and its internal execution logic, etc., I believe you can better understand the entire data flow.

Thank you for your patience answering my questions! I more or less understand how code works after your answers and trying to figure out things independently. I think I can figure out the rest on my own. Thank you again!