Infiniband AllReduceHalvingAndDoubling error
Closed this issue · 23 comments
Hi! Thanks for your time! I'm running into this error when using infiniband:
what(): [enforce fail at /sampa/home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:470] wc->status == IBV_WC_SUCCESS. 8 vs 0. Memory region send for slot 0: local access error
while doing this:
pContext = std::make_shared<gloo::rendezvous::Context>(Postoffice::Get()->my_rank(), Postoffice::Get()->num_workers());
gloo::rendezvous::FileStore fileSync("/shared/Gloo");
pContext->connectFullMesh(fileSync, dev);
Key key = keys[0];
std::vector<float*> ptrs(1,(float*)vKeyAddress.at(key));
int elementCount = keySize.at(key) / sizeof(float);
gloo::AllreduceHalvingDoubling<float> reducer(pContext, ptrs, elementCount);
reducer.run();
This error is generally associated with incorrect memory registration - but i just need to know first whether i am using Gloo correctly?
Thanks!
Hi! Everything looks good, but the float*
cast looks suspicious. I don't know what the Key
class in your code base does, but if it's anything but an address to a float, then it won't work.
You can verify it should work by passing a pointer to a float on the stack, as long as you run the algorithm on that same stack. For example:
float value = 1.0f;
std::vector<float*> ptrs(1, &value);
gloo::AllreduceHalvingDoubling<float> reducer(pContext, ptrs, 1);
reducer.run();
Thanks for your suggestion.
std::vector<float> test(10, 1.0f);
std::vector<float*> ptrs(1,test.data());
int elementCount = test.size();
printf("ptr[0] == %p. key = %d, sz = %d\n", ptrs[0], key, elementCount);
gloo::AllreduceHalvingDoubling<float> reducer(pContext, ptrs, elementCount);
reeducer.run();
This piece of code is in a function that's called many times, and the pContext is reused. It only works for the first time i call it, and the second time it throws Remote Access Error.
Is there any restriction on how this should be used? Is there anything wrong with this code?
Thanks for your patience.
Correct me if im wrong. I see the signature of AllReduceHalvingDoubling accepting a pointer vector, and given that Gloo is used in Caffe2, does that mean the parameter exchange in Caffe2 happens all at once after the backward pass?
Many thanks.
In the piece of code you list you can also reuse the algorithm instance, i.e. reducer
. One of the goals for gloo is to avoid memory copies if at all possible, so it will happily have run()
called many times on the same instance. This also means the input buffers must be long lived of course, which is typically not done by keeping them on the stack :)
That said, it should not give you an error when you call it the second time around. The context supports any number of algorithms using the same communication context, so this should work. Can you include a stack trace of the error in case there is any useful information in there?
The elements in the pointer vector all participate in the algorithm. An allreduce that takes 4 pointers and works with a context of size 8 will perform allreduce over all 32 pointers across all 8 processes. This is how we can pass a pointer for every instance of a layer in a data parallel mode in Caffe2 and not have to do a 3-stage allreduce in Caffe2 itself (first local reduce, inter-machine allreduce, local broadcast). For Caffe2 we run allreduce for every layer independently.
Thanks for your continued help! That Remote Access Error problem seems to be gone.
Let me give more context:
I am trying to port GLOO into another framework (MXNET), and here is what i am doing:
-
I'm creating a Context object using file for each layer in the network, and use that Context to create an AllReduce object for each layer.
for(int i= 0; i< Layers; i++) { auto pContext = std::make_shared<gloo::rendezvous::Context>(Postoffice::Get()->my_rank(), Postoffice::Get()->num_workers()); std::stringstream ss; ss << "/shared/Gloo/layer"<<i; boost::filesystem::create_directory(ss.str()); gloo::rendezvous::FileStore fileSync(ss.str()); pContext->connectFullMesh(fileSync, dev); std::vector<float*> layerAddr; layerAddr.push_back(vLayerAddress.at(i)); //address to the gradient for that layer Reducers.emplace_back(pContext, layerAddr, keySize.at(i)/sizeof(float));
}
-
MXNET uses Push and Pull to transfer data from/to parameter server. During a Push, I use Gloo to do a AllReduce, so the data is aggregated across workers instead of sending to servers.
Push(int layer, float* val, size_t count) { ... Reducers.at(layer).run(); ... }
I noticed during step 2, sometimes there is a heap corrupt. Since MXNET can push many keys at once, I'm wondering how many concurrent AllReduce can Gloo do? By that I mean can the AllReduce of layer 1 happen concurrently with AllReduce of layer 2?
Also please correct me if I am still misusing Gloo.
Many thanks :)
Thanks for the context. And you're using it correctly. One note: if the order of algorithm creation is identical across all participants, you can also have all algorithms share the same context.
Regarding the heap corruption: can the float ptr change after initialization has run? I.e. can the algorithm try to use a stale pointer?
Hmmm i just verified that they don't change.
When I comment out the .run() call everything seems to be fine. :/
I don't know if this is relevant, i also get these from running with >1 machines.
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /****/home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:409] recvCompletionHandlers_[slot] != nullptr.
sometimes i also get retry counter exceeded errors.
what(): [enforce fail at /****/home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:448] wc->status == IBV_WC_SUCCESS. 12 vs 0. Memory region recv for slot 2: transport retry counter exceeded
Does this ring any bell to you? The names are garbled but we can still tell those are from Gloo?
//home/liangluo/train/mxnet/libmxnet.so(_ZNSt8_Rb_treeIiSt4pairIKiSt10unique_ptrIN4gloo9transport7ibverbs12MemoryRegionESt14default_deleteIS6_EEESt10_Select1stISA_ESt4lessIiESaISA_EE8_M_eraseEPSt13_Rb_tree_nodeISA_E+0x2f4)[0x7f860b860314]
//home/liangluo/train/mxnet/libmxnet.so(_ZN4gloo9transport7ibverbs4Pair16handleCompletionEP6ibv_wc+0x6e6)[0x7f860b85a056]
//home/liangluo/train/mxnet/libmxnet.so(_ZN4gloo9transport7ibverbs4Pair15pollCompletionsEv+0x6a)[0x7f860b85aa3a]
//home/liangluo/train/mxnet/libmxnet.so(_ZN4gloo9transport7ibverbs4Pair21handleCompletionEventEv+0x14c)[0x7f860b85bdec]
/****/home/liangluo/train/mxnet/libmxnet.so(_ZN4gloo9transport7ibverbs6Device4loopEv+0x309)[0x7f860b855e19]
/usr/lib64/libstdc++.so.6(+0xb5230)[0x7f85f7ccd230]
/usr/lib64/libpthread.so.0(+0x7dc5)[0x7f8621815dc5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7f8620e3b76d]
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /sampa/home/liangluo/gloo/gloo/transport/ibverbs/buffer.cc:167] offset + length <= size_. 294912 vs 33
These def look like a race condition.
One of my specific question is can two algorithms run concurrently? Is algorithm.run() a blocking call? Thanks for your patience.
Sorry. I am providing as many details i can to provide clues...
Valgrind seems very unhappy with pollCompletion in the ibverbs routines.
==25481== Use of uninitialised value of size 8
==25481== at 0x46D29A8C: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==25481== by 0x46D1C5F8: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==25481== by 0x1591CBDD: gloo::transport::ibverbs::Pair::pollCompletions() (in /.../home/liangluo/mxnet/lib/libmxnet.so)
==25481== by 0x1591DFCB: gloo::transport::ibverbs::Pair::handleCompletionEvent() (in /.../home/liangluo/mxnet/lib/libmxnet.so)
==25481== by 0x15917FF8: gloo::transport::ibverbs::Device::loop() (in /.../home/liangluo/mxnet/lib/libmxnet.so)
==25481== by 0x293CC22F: ??? (in /usr/lib64/libstdc++.so.6.0.19)
==25481== by 0x5207DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==25481== by 0x5C1C76C: clone (in /usr/lib64/libc-2.17.so)
Thanks for all the details. I'm thinking we're dealing with more than 1 issue here, so let's try and untangle.
- Are you sure
keySize.at(i)/sizeof(float)
is the correct size and there is not another dereference hiding somewhere? I.e. does thefloat*
point tokeySize.at(i)
bytes of memory? - Which commit hash of Gloo are you using? Latest?
- These algorithms are not made to work with context->size == 1. I believe most algorithms have an assert on a context size being >= 2, but it is possible some don't.
- The failure related to retransmit timeout, or the spurious recv, can be explained by you using the same path for rendezvous for every run (or so it seems). After rendezvous, the files will stick around, and when you run the program another time with the same path, it will use information from the previous run to try and connect to its peers. This is something we should improve, because it's not the first time I've seen an issue related to this.
- The uninitialized value in
pollCompletions
seems benign. I don't spot a real problem in that function.
Thanks for your patience!
- Yes. This is the correct length of buffer.
- Yes! It's the latest.
- Yes. Worker > 2. By >1 machine i mean running multiple processes with the same card.
- This shouldnt be a problem - i always remove the folders prior to start. In fact, Gloo complains about the old folders if a new context is created.
- Thanks! However, i see this problem during Context creation as well:
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /sampa/home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:409] recvCompletionHandlers_[slot] != nullptr.
I also tried your suggestion of creating algorithms in the same order but reusing the same context, but i got this:
what(): [enforce fail at /..../home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:448] wc->status == IBV_WC_SUCCESS. 9 vs 0. Memory region recv for slot 15: remote invalid request error
Specifically, I noticed this:
If many threads in the same process calls reducer_for_layer1.run() at the same time, it works fine,
but as soon as there are mixed calls to reducers_for_layer1.run() and reducer_for_layer2.run(), it shows the segfault page.
I malloced a huge buffer for Gloo to use, so it should not touch any actual buffer in MXNET. So the buffer problem you are concerned with should not exist.
Thanks for the info. This is starting to get pretty weird... I haven't seen the "remote invalid request error" error ever before. Can you share which version of ibverbs and OFED you're using? Also can you paste the output of show_gids
on your machines?
Re: parallelism: it is not expected for a single algorithm instance to be run in parallel. So the example where you say you run reducer_for_layer1.run()
is called in parallel then that indicates misuse. Since the algorithm is pinned to some piece of memory, and uses that memory as workspace, there is no way this can run from 2 threads and have predictable behavior.
Can you try running the benchmark tool with 2 or more threads? I'd like to rule out problems with the underlying network/system. If the benchmark tool runs without problems then I expect the problem to be related to how you're using Gloo.
Thank you for your help!
Are you suggesting that running reducer_for_layer1.run() and reducer_for_layer2.run() at the same time is fine? I get problems whenever there are more than one reducer running.
I just need to make sure i'm doing expected things first! Thanks again.
I managed to get Gloo to run with a single reducer instance.
I'm happy to help! After all, if this is a bug in Gloo, we should fix it, and if it is a problem with error reporting that should be improved, we should fix it as well :)
Yes, you can run different algorithm instances at the same time. This is the case even if they share a context. If they don't even share the context, then they are completely independent. Then there is no state whatsoever being shared, or interaction of any kind, between the two.
Do you have any update for this issue? Has the issue persisted, also with newer code?
I decided to move on to Caffe2, and the new errors are reported in the new issues. Thanks! -- I'm using CentOS which makes things a bit different.
Thanks for the update. Closing this issue.