cublas backend of MatMul does not work with stream parallelism

Question

cublas backend of MatMul does not work with stream parallelism

Opened this issue 4 months ago · 0 comments

We should run cublas in an appropriate stream, and this further require to create a different cublas handle for each stream. Since we cache cublas in GPUContext, we should make the cache available for multiple streams.