TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered
huanpengchu opened this issue · 5 comments
I want to use INT8 matmul , and the code/output is as follows:
Code
import bitblas
import torch
bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
M=16, # M dimension
N=2048, # N dimension
K=1024, # K dimension
A_dtype="int8", # activation A dtype
W_dtype="int8", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float32", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
fast_decoding=False, # setting for fast decoding
)
matmul = bitblas.Matmul(config=matmul_config)
input_tensor = torch.rand((16, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(0, 7, (2048, 1024), dtype=torch.int8).cuda()
print(weight_tensor.shape)
weight_tensor_int4 = matmul.transform_weight(weight_tensor)
output_tensor = matmul(input_tensor, weight_tensor_int4)
And the error:
2024-08-29 10:43:56 [BitBLAS:INFO]: Auto detected target: nvidia/nvidia-a100 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.007 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:28 [BitBLAS:INFO]: Tile Dict: [16, 256] Shared memory exceeds the static capacity, use dynamic shared memory. 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None Traceback (most recent call last): File "/workspace/chuhuanpeng/quant_cogview/pytorch_dit_int8/pytorch_dit_single/temp_bitblas.py", line 74, in <module> matmul = bitblas.Matmul(config=matmul_config) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 292, in __init__ self.dispatch_tir(target, from_database, source_format, enable_tuning) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 348, in dispatch_tir self.hardware_aware_finetune() File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune self.optimized_func = self.apply_fast_tuning( File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning self.pass_context = best.config.pass_context AttributeError: 'NoneType' object has no attribute 'config'
However, the code works when M=1
hi @huanpengchu , we recommend accum dtype to use int32 :)
hi @huanpengchu , we recommend accum dtype to use int32 :)
and how to use FP8 matmul kernel (4090)?
if(quant_mode=='bitblas_fp8'):
dtype="e4m3_float8"
weight_type=torch.float8_e4m3fn
elif(quant_mode=='bitblas_int8'):
dtype="int8"
weight_type=torch.int8
matmul_config = bitblas.MatmulConfig(
M=4444, # M dimension
N=cout, # N dimension
K=cin, # K dimension
A_dtype=dtype, # activation A dtype
W_dtype=dtype, # weight W dtype
accum_dtype="int32" if 'int' in quant_mode else "float32", # accumulation dtype
out_dtype="float16", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
fast_decoding=False, # setting for fast decoding
)
matmul = bitblas.Matmul(config=matmul_config)
return matmul,weight_type
but output ValueError: TensorIntrin 'mma_ldmatrix_e4m3_a_dyn' is not registered
hi @huanpengchu , thanks for reporting this! Following the changes in PR #133, which set the default memory scope from shared to shared.dyn, the FP8 Tensorcore has not been fully tested. As a result, we missed the registeration for FP8 ldmatrix operations on dynamic shared memory, leading to this issue.
I just made a fix, checkout pr #162
you can install the latest update by pip install git+https://github.com/microsoft/BitBLAS.git