klusta-team/klustakwik

Concurrency error in OpenMP sections of EStep/MStep

Closed this issue · 8 comments

In KK::EStep and KK:MStep, the loop counter i is shared among all of the parallel threads. (See, for instance, the loop counter 'i' used in the loop on line 891 of klustakwik.cpp.) This should be private.

Making them shared has several consequences:

  1. Correctness -- for instance, when computing Mahal, 'i' may be updated by another thread leading to some values in Root being skipped.
  2. klustakwik may access memory beyond the end of the vectors, since another thread may increment the counter between evaluation of the loop guard and access to the vector elements. Aside from correctness, klustakwik will occasionally issue an assertion failure if compiled in debug mode with a compiler that does bounds checking on vector accesses (such as the Microsoft C++ compiler)

The problematic accesses are, I believe on lines 491, 493, and 891 of the current version of klustakwik.cpp.

Thanks for this!

As far as I can tell you are right, but I'm surprised that we missed this because no problems came up in testing and we haven't had any reports of problems with the OpenMP version. Have you actually experienced crashes / incorrect results using this version? Could it be that the compiler was being smarter than we think and making them private?

I've actually seen assertion failures when compiling with debugging enabled
using MSVC++.

As for whether the compiler might be making them private, while it's
possible, it seems unlikely -- the compiler has no way of knowing your
intentions, so it generally shouldn't deviate from the semantics defined by
the OpenMP spec.

It seems highly likely that this problem could have gone undetected: this
will not generally cause crashes unless you have compiled with a version of
the C++ standard library that does bounds checking (e.g., MSVC++ with
debugging enabled). Even when it doesn't cause crashes, it has the
potential to cause some of the computations to be incorrect, but given that
the 3 relevant loops (that use the shared counter) are small and will run
quickly, it's also likely that this symptom would be rare enough or small
enough in magnitude to escape detection.

On Mon, Apr 20, 2015 at 8:39 AM, Dan Goodman notifications@github.com
wrote:

Thanks for this!

As far as I can tell you are right, but I'm surprised that we missed this
because no problems came up in testing and we haven't had any reports of
problems with the OpenMP version. Have you actually experienced crashes /
incorrect results using this version? Could it be that the compiler was
being smarter than we think and making them private?


Reply to this email directly or view it on GitHub
#38 (comment)
.

  • jed

Yeah I'm pretty sure you're right. If you happen to have a version of this where you've corrected the problem, could you submit a patch or pull request? Otherwise, I'll do this later today or tomorrow. I think the changes should be pretty minor. I'll probably remove the declarations of i and j etc. at the top of the methods so that this sort of mistake can't reoccur.

I don't have access to the patched source code right now, but it was a
pretty minimal change -- I could probably recreate the patch and send it
this evening if you like. But your change sounds like a good idea either
way.

On Mon, Apr 20, 2015 at 9:17 AM, Dan Goodman notifications@github.com
wrote:

Yeah I'm pretty sure you're right. If you happen to have a version of this
where you've corrected the problem, could you submit a patch or pull
request? Otherwise, I'll do this later today or tomorrow. I think the
changes should be pretty minor. I'll probably remove the declarations of i
and j etc. at the top of the methods so that this sort of mistake can't
reoccur.


Reply to this email directly or view it on GitHub
#38 (comment)
.

  • jed

No don't worry about it - it's a small change that won't take long to do, was just asking in case you had it easily to hand.

I've made a pull request to fix this issue (#39), could you take a look? I'll merge tomorrow if I don't hear anything.

Looks good to me -- I don't see any more potential concurrency issues after
that change.

On Tue, Apr 21, 2015 at 6:54 AM, Dan Goodman notifications@github.com
wrote:

I've made a pull request to fix this issue (#39
#39), could you take a
look? I'll merge tomorrow if I don't hear anything.


Reply to this email directly or view it on GitHub
#38 (comment)
.

  • jed

OK, gonna go ahead and merge that then.