V0.7.0 Test Plan

Question

V0.7.0 Test Plan

yukirora opened this issue 2 years ago · 0 comments

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Done
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done
Hopper	1* 8 * H100	PyTorch 1.x	CUDA11.8	Done

single-node Micro-benchmark Test

tensort-inference

Fix Transformers version to avoid Tensorrt-inference failure (#441)

cublas-function/cudnn-function

Support list of custom config string in cudnn-functions and cublas-functions (#414)

Support correctness check in cublas-functions (#450, #452)

mem-bw

Add wait time option to resolve mem-bw unstable issue (#438)

SuperBench Improvement

Support non-zero return code (#410, #411,#425)

Support log flushing to the result file during runtime (#445)

Update sb version to include revision hash and date (#427)

Hopper GPU and FP8 related benchmarks

docker building

Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)

micro-benchmark

Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)

Support cuBLASLt FP16 and FP8 GEMM (#451, #455)

Debug ome Cublas and cudnn kernels crash issue

model-benchmark

Support FP8 in Bert model training (#446)

New in bug bash

[x]

[x]

multiple-node test

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
NDm A100 v4	32 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done

distributed Micro-benchmark test

ib-traffic

Support pair-wise pattern in IB validation benchmark(#453 )

Support 'pattern' in 'mpi' mode to run tasks in parallel(#447)

nccl-bw

Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode(#437, #458)

Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark(#454)

New in bug bash

[x]

[x]