Bug: MinibatchSource.GetNextMinibatch hangs (multiple serializers?)
nietras opened this issue · 12 comments
We often experience hangs when calling e.g.:
var minibatchData = trainMinibatchSource.GetNextMinibatch(minibatchSize, d);
At which point the whole training stops and never resumes. This is incredibly frustrating and seems to occur primarily when using multiple ImageDeserializer
s and/or CTFDeserializer
s. The exact circumstances are not known, nor is it deterministic. Sometimes it happens, sometimes it doesn't.
Seems like a threading/synchronization bug.
We would very much like this bug to be resolved ASAP as we do multi-channel multi-target learning where we often have multiple ImageDeserializer
s and a CTFDeserializer
with a targets
vector and mask
vector.
cc: @mdabros
Hi,
I would like to help. Is it possible to get some repro mechanism? E.g A script that produce the error, even sporadically?
Hi @jaliyae,
Thanks for the response.
We are trying to reproduce the issue in this repo: CntkSerializerIssue.
We have not succeeded yet, but at least you can get an idea of how we use CNTK, when the problem occurs. We will continue to extend the example, and hopefully find a repro mechanism.
I tripped over the following comment (in #3029):
OpenCV 3.4 support is added because this version fixes an issue where exif parsing errors would cause the application to hang when decoding a jpeg file.
Could there be a similar issue with png
files? We use png
files. I think OpenCV is still version 3.1?
Although, if this was a hang on exif
parsing it should be deterministic, so probably not that.
A partial call stack (without pdb, since I don't have Cntk pdb files and not sure where they could be? Does CNTK have a symbol server?):
ntdll.dll!NtWaitForAlertByThreadId�()
ntdll.dll!RtlSleepConditionVariableSRW()
KernelBase.dll!SleepConditionVariableSRW�()
[Inline Frame] msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait_for(Concurrency::details::stl_critical_section_interface *) Line 216
at f:\dd\vctools\crt\crtw32\stdcpp\thr\primitives.h(216)
msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait(Concurrency::details::stl_critical_section_interface * lock) Line 210
at f:\dd\vctools\crt\crtw32\stdcpp\thr\primitives.h(210)
msvcp140.dll!do_wait(_Cnd_internal_imp_t * cond, _Mtx_internal_imp_t * mtx, const xtime * target) Line 77
at f:\dd\vctools\crt\crtw32\stdcpp\thr\cond.c(77)
Cntk.Core-2.5.1.dll!00007ffe5ffd0e2f()
Cntk.Core-2.5.1.dll!00007ffe5ffd5fbb()
Cntk.Core-2.5.1.dll!00007ffe5ffd37fc()
Cntk.Core-2.5.1.dll!00007ffe6031dcc1()
Cntk.Core-2.5.1.dll!00007ffe6031e203()
Cntk.Core.CSBinding-2.5.1.dll!00007ffe60b42dd1()
[Managed to Native Transition]
Cntk.Core.Managed-2.5.1.dll!CNTK.MinibatchSource.GetNextMinibatch(uint minibatchSizeInSamples, CNTK.DeviceDescriptor device)
from the call in C# being:
var minibatchData = trainMinibatchSource.GetNextMinibatch(minibatchSize, d);
UPDATE: Better stack trace of external code.
In ReaderShim
there is a wait()
call that might be relevant (or not) need pdb to be sure:
Below seems to indicate one can disable prefetching, perhaps something we could try.
Thank you for the debug information. We have fixed one prefetch issue in the reader recently and it could be related. Is it possible to try a latest version from here
Sounds good that you already fixed one prefetch issue.
I have made a build with the latest version of the nightly packages: 2.6.0-rc0.dev20180731.
It is currently running a long training session. I will report back with the findings.
@jaliyae Sadly, the problem is still there even when using version 2.6.0-rc0.dev20180731
@jaliyae I wanted to try to run with prefetch = false
, so we can verify prefetch is the issue, but I have no idea how to set this parameter? Any advise on that? It seems to have to come from MinibatchSourceConfig
but this isn't a CNTKDictionary
itself, and the Internal::ToDictionary
does not seem to create any "prefetch"
entry, so how can we specify this?
Links to relevant code, assuming CompositeMinibatchSource
is the correct place:
And the ToDictionary
implementation:
This dictionary is then converted to a ConfigParameters
via:
And this is then used to create and initialize the ReaderShim
:
Hi all, I have been hitting this issue since March of 2018 :( The deadlock occurs when the bundling chunk futures are waited on, for example here: https://github.com/Microsoft/CNTK/blob/master/Source/Readers/ReaderLib/Bundler.cpp#L284
I think the root cause is a bug in the Windows implementation of std::async (see https://stackoverflow.com/questions/50898954/possible-stdasync-implementation-bug-windows#50899052), so it might be worth implementing our own lightweight thread-pool here or using an existing third-party library (Boost maybe?)