Investigate tuning difference with Full and Exhaustive

Question

Investigate tuning difference with Full and Exhaustive

pfultz2 opened this issue 2 years ago · 7 comments

So I ran bert with Full, and see these results:

Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,128,4,4,64,4,1,1: 
Fastest time: 0.404553
Slowest time: 0.414574
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,32,4,64,16,4,1,1: 
Fastest time: 0.190852
Slowest time: 0.200733
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 4,256,8,4,64,8,1,1: 
Fastest time: 0.18019
Slowest time: 0.188544
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,16,64,8,1,1: 
Fastest time: 1.07512
Slowest time: 1.0977
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 64,64,8,64,64,1,1,1: 
Fastest time: 0.670955
Slowest time: 0.679869
Benchmarking gpu::mlir_op: 458 configs
Fastest solution: 256,64,8,128,64,1,1,1: 
Fastest time: 0.0541399
Slowest time: 0.0620074

And running it with Exhaustive it shows these configs:

Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 16,256,2,8,8,4,0,1: 
Fastest time: 0.4017
Slowest time: 4.88479
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,256,1,64,8,4,0,1: 
Fastest time: 0.188217
Slowest time: 8.79304
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,2,16,32,8,0,1: 
Fastest time: 0.178298
Slowest time: 0.198162
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 256,64,4,32,64,1,0,1: 
Fastest time: 1.07209
Slowest time: 1.53007
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 8,64,1,8,8,8,0,1: 
Fastest time: 0.665448
Slowest time: 0.807237
Benchmarking gpu::mlir_op: 30240 configs
Fastest solution: 4,16,1,4,4,1,1,1: 
Fastest time: 0.0520503
Slowest time: 0.0751473

I added a branch that will also print out the problem and solution config for the slowest as well. I am running it now, but it might not be done until tomorrow when I am out. You can run it off of my branch with(assuming the bert onnx file is in /onnx directory):

MIGRAPHX_MLIR_TUNE_EXHAUSTIVE=1 MIGRAPHX_ENABLE_MLIR=1 ./bin/driver perf /onnx/bert_base_cased_1.onnx --exhaustive-tune --fp16 --fill1 input_ids --input-dim @input_ids 32 384

Answer 1 · 2023-08-30T16:53:48.000Z

@pfultz2 Do you expect Full and Exhaustive produce the same fastest solution?

Answer 2 · 2023-08-30T17:37:44.000Z

@pfultz2 Do you expect Full and Exhaustive produce the same fastest solution?

Since you dont do any pruning of slow configs, then yes it should produce the same result.

However, two runs might produce different results if there are two configs that have similar performance. Either way, the time should be close.

Answer 3 · 2023-08-30T19:18:40.000Z

The invariant we want to see here is that, for all tuning configs T and any problem description P, if T is in the exhaustive tuning set and T applied to P compiles, then T is in the full tuning set of P.

Answer 4 · 2023-09-12T15:50:12.000Z

The invariant we want to see here is that, for all tuning configs T and any problem description P, if T is in the exhaustive tuning set and T applied to P compiles, then T is in the full tuning set of P.

I can confirm that this is true for all 6 problem descriptions in the bert_base_cased_1 test case.

Answer 5 · 2023-09-13T15:52:35.000Z

Branch: mlir-test-perf

Need to add tuning keys

Answer 6 · 2023-09-13T17:16:56.000Z

@pfultz2 The following are the tuning keys I got on lockhart5.

gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 3072 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 2304 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 12288 -n 768 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 1 -m 32 -n 768 -k 768
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB false -g 32 -m 384 -n 768 -k 3072
gfx90a 110 -t f16 -out_datatype f16 -transA false -transB true -g 384 -m 384 -n 384 -k 64

Answer 7 · 2023-09-19T14:27:19.000Z

@pfultz2 We have merged the the PR, please try with latest rocMLIR commit hash. Reopen if you observe it again.