performance of float16 with fast tuning
Opened this issue · 9 comments
Hello,
I tried to run a fast tuning of GEMM with float16:
from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T
M = 8
N = 152064
K = 3584
@tvm.script.ir_module
class MatmulNT:
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [M, K], dtype="float16")
B = T.match_buffer(b, [N, K], dtype="float16")
C = T.match_buffer(c, [M, N], dtype="float16")
for i, j, k in T.grid(M, N, K):
with T.block("B"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
C[vi, vj] = tvm.tir.const(0, "float16")
C[vi, vj] = C[vi, vj] + A[vi, vk].astype("float16") * B[
vj, vk
].astype("float16")
ir_module = MatmulNT
func = ir_module["main"]
target = tvm.target.Target("nvidia/nvidia-a100")
arch = CUDA(target)
# Tune with SIMT Cuda Core
policy = DefaultPolicy(func=func, arch=arch)
try:
tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
tags = None
# Tune with Tensor Core if possible
if tags:
policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)
configs = policy.emit_config(topk=20)
cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
print("[BitBLAS] The best latency of top 1 is {:.3f} ms".format(cpresults[0].latency))
print("[BitBLAS] The best latency of top 20 is {:.3f} ms".format(best.latency))
But I got results that are not as expected:
[BitBLAS] The best latency of top 1 is 11.767 ms
[BitBLAS] The best latency of top 20 is 5.987 ms
For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done
0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |
The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.
@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670).
The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running:
tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)
@LeiWang1999
Thank you for the quick reply. When I enable the tensor core utilization, I received a error from tvm as below:
Traceback (most recent call last):
File "/root/workspace/tuning_work/bitblas/bitblas_tuning.py", line 47, in <module>
cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 293, in apply_and_build
return apply_and_build_parallel(
File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 144, in __init__
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd03871a050>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
self._lock.acquire()
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'
could you tell me how to solve this?
Looks like it's a environment related issues, maybe you could try disable parallel_build.
@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs.
@klxy0304 , would you mind append bitblas.set_log_level("Debug")
before M=8
?
@LeiWang1999 sure, after I appended it,the log is:
Traceback (most recent call last):
File "/ossfs/workspace/bitblas_tune.py", line 48, in <module>
cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 293, in apply_and_build
return apply_and_build_parallel(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/concurrent/futures/thread.py", line 144, in __init__
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd1d8a96020>
Traceback (most recent call last):
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
self._lock.acquire()
^^^^^^^^^^
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'
@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here:
(https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/base/roller/policy/default.py#L46-L47)
Is this caused by the definition of MatmulNT?
@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow (N * K) of the maximum INT32 value.
We should implement a Pass to cast all index into int64 datatype in case at least one index is out of the default maximum value of integer 32:
- implement pass:
LegalizeIndexDataType