/test

Primary LanguageCuda

bash run.sh

using m=10000000 n=64
 CUDA kernel takes 10 ms
 verified 
Input shape torch.Size([10000000, 64])
TORCH max takes 24.503946 ms

Tested on sm70 and sm86(RTX 4060Ti)