ROCm/rocSOLVER

Behaviour with `hipStreamNonBlocking` streams in handles

msimberg opened this issue · 7 comments

Is it safe to use streams created with hipStreamNonBlocking in rocblas_handles?

We have a test case using rocsolver_potrf where it seems like using a stream with hipStreamNonBlocking leads to incorrect results. Unfortunately the test case is not particularly small or self-contained at the moment, so before I attempt to minimize the test case I wanted to ask here if it's possible that some (or all) rocsolver functions actually assume that the stream synchronizes with the default stream (as it would without hipStreamNonBlocking)? Possibly for the workspace allocations or similar?

cgmb commented

The null stream should not be used for anything within rocSOLVER. If you've identified a case where that is required, I believe it is a bug. Though, note that the stream restrictions discussed ROCm/rocBLAS#1253 also apply to use with rocSOLVER.

Thanks @cgmb. Yeah, I figured ROCm/rocBLAS#1253 applies here as well. However, in this case, there's no switching of streams going on. In that case I'll try to reduce the test case to something a bit more manageable.

@cgmb here's a fairly minimal reproducing test case: https://gist.github.com/msimberg/3e708e6a55bb18a1479fedfd5da9299a. It only ever fails with hipStreamNonBlocking and tiles bigger than 64×64.

If you have any questions about the test please ask. Questions about the test input I'll have to defer to @rasolca or @albestro, but I think that shouldn't be very important as the test passes without hipStreamNonBlocking.

Small update: I can reproduce the failure on the above test case with HIP 5.0.2, but no longer with 5.1.0 or 5.1.3. I have not tried 5.2.0 yet because it's not available on spack.

However, I can still reproduce failures consistently in our miniapp which uses potrf (the miniapp does a blocked cholesky decomposition). Not using hipStreamNonBlocking gives correct results.

I unfortunately do not have another small reproducer, but while I try to create one I have another couple of questions:

  • Do you know if something was explicitly fixed in 5.1.0 that might have changed the behaviour of the above test, or is it likely just an "accidental fix" i.e. something unrelated changed which just happened to have an effect on the test?
  • Regarding this comment: "The null stream should not be used for anything within rocSOLVER". Since I'm unsure how closely tied the rocBLAS and rocSOLVER development is, is this comment explicitly only about rocSOLVER or are you implicitly including rocBLAS there as well?
cgmb commented

I don't have all the answers, but @qjojo will take a look into this.

Regarding this comment: "The null stream should not be used for anything within rocSOLVER". Since I'm unsure how closely tied the rocBLAS and rocSOLVER development is, is this comment explicitly only about rocSOLVER or are you implicitly including rocBLAS there as well?

We work closely with the rocBLAS team. The libraries are very tightly coupled and we apply the same design patterns across both. Neither rocBLAS nor rocSOLVER should be using the default stream for anything. In fact, I was reminded of this issue when I saw a bug report for that exact problem in rocblas dot.

Hi! Just commenting to note that I also came across this while trying to port a blocked cholesky miniapp to ROCm.
(My minimal example is the backlinked cupy/cupy#7050 above, which reproduces this with a random matrix that's SPD with high probability)

The issue appears most frequently with matrices exceeding (500, 500)

qjojo commented

When building against the lastest ROCm version I'm not able to reproduce this issue, so it's likely been fixed by some recent changes to synchronization in the ROCm math libraries. Additionally the provided example dereferences an allocation from hipMalloc which is undefined behaviour, to use pinned host memory on an accelerator you'll need to use hipHostMalloc. Changing hipMalloc to hipHostMalloc in the example has it running with no issues.