Investigate tuning difference with Full and Exhaustive
pfultz2 opened this issue · 7 comments
So I ran bert with Full, and see these results:
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,128,4,4,64,4,1,1:
Fastest time: 0.404553
Slowest time: 0.414574
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,32,4,64,16,4,1,1:
Fastest time: 0.190852
Slowest time: 0.200733
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,256,8,4,64,8,1,1:
Fastest time: 0.18019
Slowest time: 0.188544
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,16,64,8,1,1:
Fastest time: 1.07512
Slowest time: 1.0977
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,64,64,1,1,1:
Fastest time: 0.670955
Slowest time: 0.679869
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 256,64,8,128,64,1,1,1:
Fastest time: 0.0541399
Slowest time: 0.0620074
And running it with Exhaustive it shows these configs:
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 16,256,2,8,8,4,0,1:
Fastest time: 0.4017
Slowest time: 4.88479
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,256,1,64,8,4,0,1:
Fastest time: 0.188217
Slowest time: 8.79304
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,2,16,32,8,0,1:
Fastest time: 0.178298
Slowest time: 0.198162
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 256,64,4,32,64,1,0,1:
Fastest time: 1.07209
Slowest time: 1.53007
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,1,8,8,8,0,1:
Fastest time: 0.665448
Slowest time: 0.807237
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,16,1,4,4,1,1,1:
Fastest time: 0.0520503
Slowest time: 0.0751473
I added a branch that will also print out the problem and solution config for the slowest as well. I am running it now, but it might not be done until tomorrow when I am out. You can run it off of my branch with(assuming the bert onnx file is in /onnx directory):
MIGRAPHX_MLIR_TUNE_EXHAUSTIVE=1 MIGRAPHX_ENABLE_MLIR=1 ./bin/driver perf /onnx/bert_base_cased_1.onnx --exhaustive-tune --fp16 --fill1 input_ids --input-dim @input_ids 32 384
@pfultz2 Do you expect Full and Exhaustive produce the same fastest solution?
Since you dont do any pruning of slow configs, then yes it should produce the same result.
However, two runs might produce different results if there are two configs that have similar performance. Either way, the time should be close.
The invariant we want to see here is that, for all tuning configs T and any problem description P, if T is in the exhaustive tuning set and T applied to P compiles, then T is in the full tuning set of P.
The invariant we want to see here is that, for all tuning configs T and any problem description P, if T is in the exhaustive tuning set and T applied to P compiles, then T is in the full tuning set of P.
I can confirm that this is true for all 6 problem descriptions in the bert_base_cased_1 test case.
Branch: mlir-test-perf
Need to add tuning keys
@pfultz2 The following are the tuning keys I got on lockhart5.
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 3072 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 2304 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 768 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 32 -n 768 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 32 -m 384 -n 768 -k 3072
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB true -g 384 -m 384 -n 384 -k 64