microsoft/microxcaling

Custom CUDA code vs. Pytorch CPU/GPU

Closed this issue · 1 comments

Thanks for making such a nice library openly available.

I have a couple of questions regarding the following claim in the REAME of the project:
"The custom CUDA code is faster, and in the case of MX more numerically accurate than pytorch GPU."

I understand that the custom CUDA code is faster, but why is it more numerically accurate than Pytorch GPU?
What about Pytorch CPU, is it also less accurate than the custom CUDA code?

The Pytorch code performs bit shifts by multiplying or dividing by powers of two. This can somehow lead to inaccuracies when the shift is large. We've documented the following on V100 GPUs:

# RShift x by 16 bits
>>> x = torch.tensor([1.], dtype=torch.float32, device='cuda')
>>> e = torch.tensor([16.], dtype=torch.float32, device='cuda')
>>> x * 2**e
tensor([65535.9961], device='cuda:0')   # should be 65536

The inaccuracy isn't a big deal in actual deep learning. But it would trip some unit tests. So we always treat CPU or custom CUDA as the golden reference, not Pytorch GPU.