Problem with MKLShim context and hipfftSetStream

Question

Problem with MKLShim context and hipfftSetStream

Closed this issue a year ago · 4 comments

Something appears to be wrong with how H4I-MKLShim is setting a new stream for hipfftSetStream. When implementing hipfftSetStream, I used a similar process to what is used for H4I-HipBLAS and H4I-HipSOLVER.

I have one test case that tests hipfftSetStream (tests/hipfft_real_1d_stream). It runs correctly on OLCF Frontier but not on Sunspot. The final error should be on the order of 1.0e-7 but is ~1.4 instead.

The rest of the test cases run correctly on Sunspot.

Follow the instructions in README.md to build H4I-MKLShim, H4I-HipFFT, and the H4I-HipFFT tests in H4I-HipFFT/tests. Running the tests/hipfft_real_1d_stream case on Sunspot reproduces the error.

Answer 1 · 2023-12-01T03:37:49.000Z

@dsnichols, I looked into the test case, following are few questions I have for better understanding the issue.

What happens if you don't change the stream?
I can see before reading the data/result in the test you have called hipDeviceSynchronize(). Ideally device synchronize will synchronize all in-order active streams. Are you saying it is not happening?
As per your description looks like test passes on one system but fails on other system. If that is true then what is the rationality behind suspecting MKLShim may have issue? Shim layer does not do anything system specific.
Since it is system specific and I don't have Sunspot system, can you share me chipStar log?

Answer 2 · 2023-12-01T19:45:26.000Z

@Sarbojit2019 ,

If the two hipfftSetStream calls (one for plan_r2c and one for plan_c2r) in lines 33 and 34 are commented out in lines, then, if I remember correctly, the case runs correctly. I readily acknowledge that I may be doing something wrong with my function that sets the stream. However, I modeled my function after those used by H4I-HipBlas and H4I-HipSolver.
Some of the hipDeviceSynchronize calls may be "extra" and could possibly be removed as ChipStar matures. I had some experiences on the JLSE Iris nodes when hipMalloc, hipMemcpy, and the hipfftExecute*** operations would get out of sync, and the hipDeviceSynchronize calls solved that problem. However, it is necessary to have either the hipDeviceSynchronize or the hipStreamSynchronize calls immediately after the hipfftExecute calls to ensure the foward and backward transforms are finished before progressing to the next operation. I keep them before/after the hipMemcpy calls as an extra fence. I'm confident that the hipDeviceSynchronize call is functioning correctly, but I'm not sure what's happening with the stream. I may not have the correct understanding of how hip streams relate to cycle streams. Regardless, I need some help because the extra stream is not working correctly for this case.
Since the test codes work on a rocm/hip system (OLCF Frontier) and a cuda system (OLCF Summit), then it is a problem with something related to MKLShim since MKLShim is where all the Intel SYCL "magic" occurs. Without trying to use a new stream, all the test codes work, but when trying to use a new stream, then this one case fails. So, either I'm doing something wrong in my hipfftSetStream function (which is modeled after the functions in H4I-HipBlas and H4I-Solver), or there is something wrong with how the new stream is being created and/or handled in the MKLShim context, or there's a problem with both what I'm doing and what MKLShim is doing. Regardless, I need some help.
I'm not sure what a chipStar log is, but if you can tell me how to create the chipStar log, then I'll be happy to share it.

Answer 3 · 2023-12-02T01:00:24.000Z

@dsnichols,
Thanks for your elaborated response.

You can go through https://github.com/CHIP-SPV/chipStar/blob/main/docs/Using.md to enable chipStart trace.
Are you available on slack? Asking this to avoid long response time and debug the issue quicker.

Answer 4 · 2023-12-13T04:52:32.000Z

Worked with @dsnichols offline and resolved the issue successfully.

Root cause: L0 runtime has command batching feature which is default enabled hence wait on queue may not give expected result if task is not submitted in the queue. Event wait is preferred over queue wait as it makes sure the event is indeed completed. Changing queue wait to event wait helped resolved the issue.

Closing the ticket now.