ocaml-multicore/domainslib

Possible data races in Task.parallel_scan

OlivierNicole opened this issue · 8 comments

Once again I report a ThreadSanitizer data race warning. I haven't had time to flesh it out yet, but I intend to look at it more closely soon.

How to reproduce

Pin Domainslib to revision b8de1f71. Pin qcheck to revision f09f42d0. Switch to the ThreadSanitizer-instrumenting compiler and build and execute the following test (from https://github.com/jmid/multicoretests/):

open QCheck
open Domainslib

let test_parallel_scan =
  Test.make ~name:"test Task.parallel_scan" ~count:1000
    (pair (int_bound 10) small_nat)
    (fun (num_additional_domains,array_size) ->
       let pool = Task.setup_pool ~num_additional_domains () in
       let a = Task.run pool (fun () -> Task.parallel_scan pool (+) (Array.make array_size 1)) in
       Task.teardown_pool pool;
       a = Array.init array_size (fun i -> i + 1))
;;
QCheck_runner.run_tests_main [ test_parallel_scan ]

You get the following two reports, both in Task.parallel_scan:

  1. The line numbers are always a little off; the code address offset reported for the item #0 of backtraces is always off by one; here, it seems that the possibly racy accesses are the writes in prefix_s.
  2. A current limitation of the the ThreadSanitizer instrumentation is that backtrace information is not recorded in OCaml code, so the “previous” backtraces are probably incomplete and may be incorrect.

The first race is between:

prefix_sum.(i) <- op prefix_sum.(i - 1) elements.(i)

and the initializing write at
let prefix_s = Array.copy elements in

so, according to the memory model, not a real race, as there is supposed to be a synchronization between an initializing write and the sharing of a value, according to the PLDI'18 paper. I don't understand yet why TSan didn't see this synchronization, though.

The second race is between two writes in the prefix_s array, at

domainslib/lib/task.ml

Lines 231 to 235 in 3a74279

parallel_for pool ~chunk_size:1 ~start:0 ~finish:(p - 1)
~body:(fun i ->
let s = (i * n) / (p ) in
let e = (i + 1) * n / (p ) - 1 in
scan_part op elements prefix_s s e);
and

domainslib/lib/task.ml

Lines 238 to 242 in 3a74279

for i = 2 to p do
let ind = i * n / p - 1 in
x := op prefix_s.(ind) !x;
prefix_s.(ind) <- !x
done;

The thing is that the parallel_for actually waits for all operations to complete (via await), so this is probably another false alarm. Good for Domainslib!

The first race is between:

prefix_sum.(i) <- op prefix_sum.(i - 1) elements.(i)

and the initializing write at

let prefix_s = Array.copy elements in

so, according to the memory model, not a real race, as there is supposed to be a synchronization between an initializing write and the sharing of a value, according to the PLDI'18 paper. I don't understand yet why TSan didn't see this synchronization, though.

It might be worthwhile to check whether the Array.copy primitive has the necessary synchronisation. A release fence.

The copying is performed in caml_array_gather; in the case of small arrays (which is the case in this test), the array is allocated in the minor heap and a plain memcpy is used.
https://github.com/ocaml/ocaml/blob/50b09531f658c1ebe502cb566dee52c3e32d3c61/runtime/array.c#L448-L460

Maybe some code paths do not go through the fence when that array is promoted to the major heap. (Is it always promoted?) I don't know where the code that handles this is in the runtime.

There is in fact no question of promotion, as it was pointed out to me that it is no longer necessary in the stop-the-world minor collection design that was retained. We still need a fence, though, and indeed there isn't one in caml_array_gather (used by Array.copy, append and concat).

I'm taking the liberty of pasting here an explanation by @kayceesrk from a discussion on Slack:

Generally, there are only two ways an object can be shared with another domain.

  1. Through the closure that is used in Domain.spawn
  2. Assigning to a shared reference. This goes through caml_modify which has the release fence.

Since caml_initialize will precede a caml_modify or Domain.spawn, which has the necessary fence, caml_initialize doesn’t need the fence.

Looking closer, the race here is between the initializing write (a plain store) in Array.copy and a read in caml_modify. Indeed, before doing a store, caml_modify does a plain load (argument *fp):
https://github.com/ocaml/ocaml/blob/357b42accc160c699219575ab8b952be9594e1d9/runtime/memory.c#L151
and that is done before the fence.

From my (not very experienced) point of view this one is technically a real race… maybe @sadiqj, @maranget or @damiendoligez have an opinion about it?

No longer having this race report from the latest versions of ocaml-tsan, I’ll blame it on an insufficient instrumentation of the runtime back then, and close this.