microsoft/BitBLAS

TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered

huanpengchu opened this issue · 5 comments

I want to use INT8 matmul , and the code/output is as follows:

Code

import bitblas
import torch
bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
    M=16,  # M dimension
    N=2048,  # N dimension
    K=1024,  # K dimension
    A_dtype="int8",  # activation A dtype
    W_dtype="int8",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
    fast_decoding=False,  # setting for fast decoding
)

matmul = bitblas.Matmul(config=matmul_config)


input_tensor = torch.rand((16, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(0, 7, (2048, 1024), dtype=torch.int8).cuda()
print(weight_tensor.shape)
weight_tensor_int4 = matmul.transform_weight(weight_tensor)

output_tensor = matmul(input_tensor, weight_tensor_int4)

And the error:

2024-08-29 10:43:56 [BitBLAS:INFO]: Auto detected target: nvidia/nvidia-a100 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.007 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:28 [BitBLAS:INFO]: Tile Dict: [16, 256] Shared memory exceeds the static capacity, use dynamic shared memory. 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None Traceback (most recent call last): File "/workspace/chuhuanpeng/quant_cogview/pytorch_dit_int8/pytorch_dit_single/temp_bitblas.py", line 74, in <module> matmul = bitblas.Matmul(config=matmul_config) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 292, in __init__ self.dispatch_tir(target, from_database, source_format, enable_tuning) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 348, in dispatch_tir self.hardware_aware_finetune() File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune self.optimized_func = self.apply_fast_tuning( File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning self.pass_context = best.config.pass_context AttributeError: 'NoneType' object has no attribute 'config'

However, the code works when M=1

hi @huanpengchu , we recommend accum dtype to use int32 :)

hi @huanpengchu , we recommend accum dtype to use int32 :)

and how to use FP8 matmul kernel (4090)?

 if(quant_mode=='bitblas_fp8'):
        dtype="e4m3_float8"
        weight_type=torch.float8_e4m3fn
    elif(quant_mode=='bitblas_int8'):
        dtype="int8"
        weight_type=torch.int8
    matmul_config = bitblas.MatmulConfig(
                M=4444,  # M dimension
                N=cout,  # N dimension
                K=cin,  # K dimension
                A_dtype=dtype,  # activation A dtype
                W_dtype=dtype,  # weight W dtype
                accum_dtype="int32" if 'int' in quant_mode else "float32",  # accumulation dtype
                out_dtype="float16",  # output dtype
                layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
                with_bias=False,  # bias
                # configs for weight only quantization
                group_size=None,  # setting for grouped quantization
                with_scaling=False,  # setting for scaling factor
                with_zeros=False,  # setting for zeros
                zeros_mode=None,  # setting for how to calculating zeros
                fast_decoding=False,  # setting for fast decoding
            )
    matmul = bitblas.Matmul(config=matmul_config)
    return matmul,weight_type

but output ValueError: TensorIntrin 'mma_ldmatrix_e4m3_a_dyn' is not registered

hi @huanpengchu , thanks for reporting this! Following the changes in PR #133, which set the default memory scope from shared to shared.dyn, the FP8 Tensorcore has not been fully tested. As a result, we missed the registeration for FP8 ldmatrix operations on dynamic shared memory, leading to this issue.

I just made a fix, checkout pr #162

you can install the latest update by pip install git+https://github.com/microsoft/BitBLAS.git