microsoft/CNTK

Bug: MinibatchSource.GetNextMinibatch hangs (multiple serializers?)

nietras opened this issue · 12 comments

We often experience hangs when calling e.g.:

var minibatchData = trainMinibatchSource.GetNextMinibatch(minibatchSize, d);

At which point the whole training stops and never resumes. This is incredibly frustrating and seems to occur primarily when using multiple ImageDeserializers and/or CTFDeserializers. The exact circumstances are not known, nor is it deterministic. Sometimes it happens, sometimes it doesn't.

Seems like a threading/synchronization bug.

We would very much like this bug to be resolved ASAP as we do multi-channel multi-target learning where we often have multiple ImageDeserializers and a CTFDeserializer with a targets vector and mask vector.

cc: @mdabros

Hi,

I would like to help. Is it possible to get some repro mechanism? E.g A script that produce the error, even sporadically?

Hi @jaliyae,

Thanks for the response.
We are trying to reproduce the issue in this repo: CntkSerializerIssue.
We have not succeeded yet, but at least you can get an idea of how we use CNTK, when the problem occurs. We will continue to extend the example, and hopefully find a repro mechanism.

I tripped over the following comment (in #3029):

OpenCV 3.4 support is added because this version fixes an issue where exif parsing errors would cause the application to hang when decoding a jpeg file.

Could there be a similar issue with png files? We use png files. I think OpenCV is still version 3.1?

Although, if this was a hang on exif parsing it should be deterministic, so probably not that.

A partial call stack (without pdb, since I don't have Cntk pdb files and not sure where they could be? Does CNTK have a symbol server?):

ntdll.dll!NtWaitForAlertByThreadId�()
ntdll.dll!RtlSleepConditionVariableSRW()
KernelBase.dll!SleepConditionVariableSRW�()
[Inline Frame] msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait_for(Concurrency::details::stl_critical_section_interface *) Line 216
	at f:\dd\vctools\crt\crtw32\stdcpp\thr\primitives.h(216)
msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait(Concurrency::details::stl_critical_section_interface * lock) Line 210
	at f:\dd\vctools\crt\crtw32\stdcpp\thr\primitives.h(210)
msvcp140.dll!do_wait(_Cnd_internal_imp_t * cond, _Mtx_internal_imp_t * mtx, const xtime * target) Line 77
	at f:\dd\vctools\crt\crtw32\stdcpp\thr\cond.c(77)
Cntk.Core-2.5.1.dll!00007ffe5ffd0e2f()
Cntk.Core-2.5.1.dll!00007ffe5ffd5fbb()
Cntk.Core-2.5.1.dll!00007ffe5ffd37fc()
Cntk.Core-2.5.1.dll!00007ffe6031dcc1()
Cntk.Core-2.5.1.dll!00007ffe6031e203()
Cntk.Core.CSBinding-2.5.1.dll!00007ffe60b42dd1()
[Managed to Native Transition]
Cntk.Core.Managed-2.5.1.dll!CNTK.MinibatchSource.GetNextMinibatch(uint minibatchSizeInSamples, CNTK.DeviceDescriptor device)

from the call in C# being:

var minibatchData = trainMinibatchSource.GetNextMinibatch(minibatchSize, d);

UPDATE: Better stack trace of external code.

In ReaderShim there is a wait() call that might be relevant (or not) need pdb to be sure:

https://github.com/Microsoft/CNTK/blob/624bf7d82b341863a282c416110df71a0b3ea302/Source/Readers/ReaderLib/ReaderShim.cpp#L127

Below seems to indicate one can disable prefetching, perhaps something we could try.

https://github.com/Microsoft/CNTK/blob/624bf7d82b341863a282c416110df71a0b3ea302/Source/Readers/ReaderLib/ReaderShim.cpp#L74

Thank you for the debug information. We have fixed one prefetch issue in the reader recently and it could be related. Is it possible to try a latest version from here

Sounds good that you already fixed one prefetch issue.

I have made a build with the latest version of the nightly packages: 2.6.0-rc0.dev20180731.

It is currently running a long training session. I will report back with the findings.

@jaliyae Sadly, the problem is still there even when using version 2.6.0-rc0.dev20180731

@jaliyae I wanted to try to run with prefetch = false, so we can verify prefetch is the issue, but I have no idea how to set this parameter? Any advise on that? It seems to have to come from MinibatchSourceConfig but this isn't a CNTKDictionary itself, and the Internal::ToDictionary does not seem to create any "prefetch" entry, so how can we specify this?

Hi all, I have been hitting this issue since March of 2018 :( The deadlock occurs when the bundling chunk futures are waited on, for example here: https://github.com/Microsoft/CNTK/blob/master/Source/Readers/ReaderLib/Bundler.cpp#L284

I think the root cause is a bug in the Windows implementation of std::async (see https://stackoverflow.com/questions/50898954/possible-stdasync-implementation-bug-windows#50899052), so it might be worth implementing our own lightweight thread-pool here or using an existing third-party library (Boost maybe?)