Issues
- 4
[QST] Why hopper-mixed-gemm's Bandwidth Utilization only have ~9% MBU in H100 SXM5?
#1794 opened by ZZBoom - 0
[QST] kInternalError while increasing warp count in older SIMT GEMM kernels.
#1800 opened by Shreya-gaur - 4
- 0
[QST] Split-k in hopper gather scatter gemm
#1798 opened by susavlsh10 - 1
- 1
[QST] Understanding double buffering in GEMM kernels
#1789 opened by phantaurus - 1
[FEA]print_layout can not print 3D case!
#1778 opened by ziyuhuang123 - 5
[FEA] transpose in epilogue/prologue
#1780 opened by xiaonans - 2
[QST] CUDA driver version and runtime version mis-match
#1788 opened by RuokaiYin - 1
[FEA] gather/scatter on other dims
#1779 opened by xiaonans - 1
[QST] CuTe / Cutlass 1D Convolution
#1758 opened by jeromeku - 0
Which Visual Studio 2022 BuildTools MSVC is the best version for Cuda 11.8 and Cuda 12.4 and so
#1793 opened by FurkanGozukara - 0
[BUG] SM90_U32x4_STSM_N for SM90
#1792 opened by jcao-ai - 1
[BUG] Simple matrix rotation could not compile
#1783 opened by lucifer1004 - 1
- 4
- 1
- 0
- 0
- 0
[DOC] Misleading comment in example 05_batched_gemm
#1773 opened by lucifer1004 - 0
- 1
[QST] Gemm results are different with tile_description?
#1769 opened by hxdtest - 1
[BUG]The results from different print statements are jumbled together and messy.
#1752 opened by ziyuhuang123 - 5
- 0
[QST] Is there an example for implementing gemm problem size like [b, m, k] * [k, n] in the folder `examples`?
#1764 opened by hxdtest - 1
[QST] How to compile and run `examples/35_gemm_softmax` ?
#1728 opened by hxdtest - 2
[FEA] CUDA API [cudaGetDriverEntryPointByVersion]
#1755 opened by SunNy820828449 - 2
`02_pytorch_extension_grouped_gemm.ipynb` No kernel configuration found for supported data type and layout combination (<DataType.bf16: 16>
#1757 opened by hxdtest - 0
[QST] Questions about correctness test and layout
#1756 opened by haeunlee99 - 4
[FEA] FP8 Convolution
#1750 opened by MustafaFayez - 0
[QST]Why we have three GEMM in cutlass
#1751 opened by ziyuhuang123 - 2
- 1
[QST]What is @ in cute's step?
#1744 opened by ziyuhuang123 - 1
- 1
- 0
[QST] some confusion about layout
#1746 opened by zhoutianzi666 - 2
[QST] GEMV implementation with CuTe
#1737 opened by DD-DuDa - 2
[QST] Value mismatches between GEMM kernel-fusion outputs and numpy outputs
#1739 opened by phantaurus - 0
[QST]cute's local_tile and step
#1745 opened by ziyuhuang123 - 1
[QST]How to use append?
#1741 opened by ziyuhuang123 - 9
[BUG] Release 3.5.0 build failing on Windows using CUDA 12.6, and VS2022 17.11
#1732 opened by levicki - 2
[QST]What's the difference between: pipeline.producer_commit and pipeline.producer_get_barrier
#1729 opened by ziyuhuang123 - 1
[QST] how to fix the compiling error: static assertion failed with "Vectors implied by the thread map must be divisible by the access type."
#1740 opened by alephchang - 1
- 13
[QST] SegFault when performing TiledCopy
#1735 opened by phantaurus - 1
- 4
- 0
[QST]Why rowMajor for A and B is different?
#1730 opened by ziyuhuang123 - 4
- 0
[QST]Why sm90 mma has prologue and mainloop?
#1725 opened by ziyuhuang123