[RFC] vectorize reducer
chenqin opened this issue · 11 comments
Hi Folks,
We got some first hand results from running reducer in parallel with openmp. In light of hacky nature and thread "abuse" in previously proposed solution. I think it might worth a bit discussion before we move forward.
Why we address this problem, really?
Short answer is we have seen user running larger reducer exceed general guided depth 6-8. Per iteration speed got hit hard in distributed training as we sync on every depth on each iteration.
What is the problem, precisely?
compiler will have hard time figure out data-flow and alignment when it tried to vectorize it.
How do we plan to address this problem?
We have 3 options
-
OpenMP simd pragma
pro is this is a proven solution which might less likely see bugs or issues given it's well supported.
more detail here http://hpac.rwth-aachen.de/teaching/pp-16/material/08.OpenMP-4.pdf con is this would introduce openmp4 dependency. -
use vector class https://github.com/vectorclass/version2 pro is we can use this as testing ground and optimize xgb as a whole. The con is this is a pretty new project, given it's experimental nature, we might run into various of bugs and issues.
-
hand tune loop and use compiler vectorization, pro is we control how we do it under gcc, con is this will make code less maintainable on various of platforms.
https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment-with-compiler-optimization-disabl
For now, I would propose we choose least resistance path and go with openmp simd.
Thanks,
Chen
Can we try simple async first?
Can we elaborate asynchronous suggestion more? We might look at different things.
Alllreduce semantic is synchronized blocking call. It’s generally higher performance if we allow higher out of order executions or simply drop slowest updates like PS does.
However, this put ourselves in the risk of miss out key updates and affects one tree. Yes, those trees compliment each other so eventually might be okay. But we need to call out loud if we decide to do so.
For networking layer, we can definitely do reduce in streaming fashion. If that’s what you suggested.
I'm not sure which loop you want to vectorize. Could you be more specific? We can use async to achieve something similar to omp. In my experience in some cases it's faster than omp because we have explicit control over blocking.
I'm not sure which loop you want to vectorize. Could you be more specific?
The loop I want to vectorize does 152,524,800 x 2 simple customized reductions per node. Underneath, it implemented a customized reducer function do simple numerical operation
[94] allreduce (xgboost_build/src/tree/updater_histmaker.cc::739::CreateHist#16x152524800) finished version 12, seq 21, take 31.749563 seconds
https://github.com/chenqin/xgboost/blob/release_0.81/src/tree/updater_histmaker.cc#L739
this->histred_.Allreduce(dmlc::BeginPtr(this->wspace_.hset[0].data),
this->wspace_.hset[0].data.size());
/*! \brief same as add, reduce is used in All Reduce */
inline static void Reduce(GradStats& a, const GradStats& b) { // NOLINT(*)
a.Add(b);
}
It's a bit tricky to run full test without large scale job, some findings below
- a small performance increase by enable this on my local with speed_test running pre-defined OP.(ubuntu amd ryzen 3)
- run gcc -fopt-info-loop-optimized=loop.miss and compare if loop vectorized
We can use async to achieve something similar to omp.
Are you suggesting chunk and do async collection? I can see it benefit relative sizeable workload.
Let's looking at how rabit network layer works. It might be another place to implement async.
Every node has two set of connections tree toplogy and ring topology at same time. When all parties do allreduce, depending on which topology user choose, every node receive and send data to it's connected peers based on ranking number.
Of each given link, it has a buffer of 256MB by default, we use socket polling and check if a socket has data ingress or egress, if so, it will write to link buffer.
depending on the link with least data, it run reduce only on those chunk of data and back to listening mode until all data reduced.
So it's event driven with socket polling and not blocked by slowest peer when traffic congestion happens.
I think we might add async and reduce on out of sequence. polling already happens in kernel and low level where userspace only read flag from time to time and copy payload to buffer and run reduction of whatever they get.
The catch is sendrecv buffer is shared and if we run async per link, it might contenting with same piece of memory with other async links.
@chenqin Thanks for your detailed explanation. Looking through your PR helps. Will take a closer look later.
Updates:
we might also consider optimize serialize reduce as well. but it will require code change in xgboost side in order to boost speed. We can do another pr after this get in.
moved to https://github.com/chenqin/rabit/tree/vectorize, will send pr after
we can merge
#102 and dmlc/xgboost#4808 sequentially
one at a time.
the code leading to the We got some first hand results from running reducer in parallel with openmp.
has been removed due to the correctness issue
what's the current data-support as the motivation?
the code leading to the
We got some first hand results from running reducer in parallel with openmp.
has been removed due to the correctness issuewhat's the current data-support as the motivation?
Good point, working on getting datapoints. It's not urgent compare to boottrap_cache one
Closing, as Rabit have been moved into dmlc/xgboost. See discussion in dmlc/xgboost#5995.