CNugteren/CLBlast

New tuning results

CNugteren opened this issue ยท 135 comments

(See the README for details)

This is the place to post new tuning results. If you compiled with -DTUNERS=ON, ran one of the tuners on your device (or all perhaps?), and feel that these results should be included in the next release of CLBlast, please post them here.

You can do this by attaching the JSON files to this issue (archived in a .ZIP file).

Here are some tuning results from an NVIDIA Titan Black, AMD Radeon HD 7970 and an ARM Mali T-628.

Just to let you know about JSON files, GitHub says "Unfortunately, we donโ€™t support that file type. Choose Files Try again with a PNG, GIF, JPG, DOCX, PPTX, XLSX, TXT, PDF, or ZIP."
Archive.zip

Thanks for the tuning results! However, they seem to be ran with non-default settings (using specific values for alpha and beta). Could you perhaps run them again with the default settings?

By the way, the latest version already includes results for Tahiti (the HD 7970) and the ARM Mali T-628, so perhaps those are superfluous.

(I've updated the post regarding JSON-files and GitHub)

Here are the results for AMD's Pitcairn (R9 270X). I'll also upload the results for Hawaii (R9 290X), but I am getting an error during Xgemm. I'll open another issue for that.
pitcairn.zip

Thanks! The results for Pitcairn are added to the development branch.

Hawaii (AMD R9 290X):
hawaii.zip

And i7 4790k:
i7-4790k.zip

The results for Hawaii will be added. As for the i7 results: the zip archive seems to include only a Makefile?

Sorry, I messed up that zip. As I do not have those files any more, I'll send them when I manage to do that tuning.

@fonghou Thanks! The tuning results are added to the database. They are currently in the development branch but will be automatically included in the next release.

Here are the results for the Intel i5-4210U iGPU:
Device name: 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (OpenCL 1.2 beignet 1.2 (git-1b076ec))
i5-4210U_GPU.zip

@OursDesCavernes Added, thanks!

gcp commented

GTX 670, GTX 750 (non-Ti), and GTX 1070 tunings attached. One of the GEMV tunings took ages (or hung) on the latter two, but curiously enough not on the (older) first card. Luckily, it looks like GEMV is the last one to be tuned so these are fairly complete anyway.

gtx670.tar.gz
gtx1070.tar.gz
gtx750.tar.gz

@gcp Thanks for running all the tuners on those devices! The results are added to CLBlast, currently in the development branch but they will be automatically included in the next release. Indeed, I saw long compilation times for GEMV kernels on NVIDIA as well - it is the last one to be tuned for exactly this reason. NVIDIA promises to reduce compilation times significantly with CUDA 8.0, so hopefully that also fixes these kernels.

gcp commented

Intel HD530 (desktop Skylake iGPU)
IntelHD530.zip

@gcp Thanks, they are added.

Issue #83 caused a complete re-write of the third GEMV kernel (XgemvFastRot), so I had to throw away the corresponding tuning results. If it's not too much effort, I welcome updated clblast_xgemv_fast_rot_*.json tuning results based on the development branch. The other GEMV tuning results are still valid and included in CLBlast. Thanks!

Intel(R) HD Graphics 5500 BroadWell U-Processor GT2:
hd5500.zip
Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile:
hd4400.zip

@OursDesCavernes Thanks, HD5500 is added and HD4400 is updated.

Intel(R) HD Graphics 4000
intel-hd4000.zip

@yingted Thanks! The tuning results for the IvyBridge GPU are added.

Radeon R9 380 (Tonga) tuning results:
Tobago_TuningResults.zip

Of course, the device is called Tonga, just a spelling mistake of the zip-file name.

@MigMuc The results for Tonga are added, thanks!

matze commented

Here are the results for the GTX Titan Black. Unfortunately, I had the same problem as @gcp on the last run. But again, should be fairly complete.

gtx-titan-black.tar.gz

@matze Thanks a lot for your contribution. The tuning results are added.

Hi,
since I'm having problems with attaching files, here are the links for:

Amd Radeon HD6770m (Turks) https://www.dropbox.com/s/wabso93trny8fae/amd%20hd6770m%20%28turks%29.zip?dl=0

Intel Core i7-2670qm
https://www.dropbox.com/s/3as860nlbshmdvo/i7-2670qm.zip?dl=0

from my laptop.
In few days I will be able to test a MSI Nvidia GTX 970

Thanks for the tuning data! The results are added to CLBlast, currently in the development branch but they will be automatically included in the next release.

Tuning results for Nvidia GTX 1080
nvidia_gtx_1080.zip

Results for i7-4790k:
i7-4790k.zip

Thanks a lot! Both the GTX 1080 and i7 results are added.

Tuning results for AMD RX480 (with amdgpu driver and amdgpu-pro opencl stack)
amd-rx480.zip

@OursDesCavernes Added, thanks! Nice to see FP16 support from AMD's side as well.

My two cents: tuning results of an AMD Radeon HD 6750M (unfortunately no support for 16 or 64 bits)

AMD Radeon HD 6750M.zip

The HD 6750M results are added, thanks!

csbnw commented

See the attachment for some tuning results on AMD Radeon FuryX (using driver 1800.8).
amd-radeon-furyx.zip

@bramveenboer The Fiji results are added, thanks!

Hi!
Here are the results for an intel i7-920 on linux using Intel's OpenCL Driver dev-util/intel-ocl-sdk-4.4.0.117-r1

Thanks for your work!

i7-920.zip

Thanks, the Core i7-920 tuning data is added to CLBlast!

I have got results for the Radeon R9 390. There were already Hawaii results there, but the direct gemm kernel was missing. Overall, this improved things for me.

Hawaii.zip

(note: is it possible to get a tuning for GemmBatched? I think it makes a difference whether a single matrix is to be computed vs a set of matrices. Also i have the feeling that gemm is too aggressively tuned towards m=n=k=1024. performance drops by 30-40% on m=n=k=2048 and stays there for larger matrices)

@Ulfgard I tuned those hawaii results on R9 290X. In my case, it would be impossible that performance drops 30%-40% for larger matrices, since I get (if memory serves me well) around 3.7 TFLOPS on 8192x8192, with the theoretical limit being 5.4 TFLOPS. If such performance drop happened, Cedric programmed gemm to run at max theoretical performance that disregards memory access, which seems imposible to me.

OTOH, maybe the drop in performance is simply because these cards are identified as the same (hawaii), but have some internal hardware difference that influences the optimal settings?

Hi,
It is impossible to mix up the tunings, because you have to remove the old tuning to be able to add the new one in the database script. Otherwise it will fail. While I agree that the kernel itself gives okay performance according to the tuner, for some reason, the whole gemm call seems to die after exceeding some matrix size. I did some benchmarking of the whole procedure to see real world performance.

The numbers reported are the wallclock times between enqeuing several trials of the gemm routine and clFinish (disregarding the first trial for possible kernel setup, of course). Thus they are a lower bound on performance. The numbers are roughly in line with the timings reported by clGetEventProfilingInfo of the supplied event to gemm, but this does not necessarily make sense because I do not know which kernel this actually measures.

(columns are row/column major for A/B, and column C indicates whether C is row/column major. m=n=k=size. Numbers are GFlops)

size C A/B: r/r c/r r/c c/c
256 r 781.738 794.147 781.738 781.738
256 c 806.956 820.184 820.184 806.956
512 r 1005 1005 1005 440.789
512 c 1168.6 1116.67 1142.05 1142.05
1024 r 2080 2080 2080 712.329
1024 c 2363.64 2363.64 2363.64 2363.64 //this number fits quite closely with what the tuner reports
2048 r 1523.81 1523.81 1523.81 1523.81 //something here dies.
2048 c 1523.81 1523.81 1488.37 1523.81
4096 r 1523.81 1542.17 1542.17 1580.25
4096 c 1542.17 1542.17 1560.98 1560.98

Beforehand, i.e. with your tuning, the larger matrices where another 50% worse. So even if the gemm kernel is okay, maybe some of the other kernel is at fault here.

For completeness: the same results with the timings returned by clGetEventProfilingInfo for the event passed to the gemm routine (modulo possible errors because i quickly hacked this together):

size C A/B: r/r c/r r/c c/c
256 r 608.416 606.492 607.536 607.19
256 c 622.785 621.137 620.953 619.315
512 r 24963.3 25338 24816.4 25607.1//indication for that this measures the wrong thing?
512 c 813.057 811.845 813.214 812.822
1024 r 62673.7 63890.4 63529.4 63143.3//indication for that this measures the wrong thing?
1024 c 2559.24 2560.99 2557.5 2557.88
2048 r 1487.71 1492.89 1526.47 1530.32
2048 c 1488.77 1490.06 1524.83 1537.17
4096 r 1518.29 1522.16 1557.49 1567.47
4096 c 1524.63 1528.65 1563.42 1567.73

I was also talking about wall clock time in my (Clojure on the JVM) program, not ClTune results. 8192x8192 sgemm runs in 293 milliseconds on R9 290X (5.4 TFLOPS max).

GTX 1080 (8.2 TFLOPS) runs in 220 ms, which makes the numbers pretty consistent in my case.

@blueberry @Ulfgard I've opened issue #169 to have a more detailed discussion on the future of the tuner in CLBlast.

I'll add your tuning data soon to the database, thanks.

Here is my ubuntu16.04 with intel cpu driver:
i7-6770hq.zip
Tuned for 1.0.1 release.
Impressive tool! Let me know if I included the wrong files.

Thanks @theoden8. It took a bit longer than normal since I was in the middle of some database changes, but the results are now added!

Here are the tuning results for a i5-4570 and a GTX580
GTX580.zip
i5-4570.zip

Thanks @fzimmermann89, they are both added.

Some more results. Note that beignet (which I used) is 10-20% slower than Intel NEO.

Intel(R) HD Graphics 6000 BroadWell U-Processor GT3.zip

Thank you for your great work! Here are some tuning results for NVidia GeForce GTX 1070 Ti.

GeForce_GTX_1070_Ti.zip

Here are some tuning results using POCL (1.2-pre/master) on an Intel i5-4590S. The other tuners segfaulted (#293).
i5_4590S_POCL.zip

A little late, but I've added the HD Graphics 6000, GTX 1070 Ti, and i5-4590S results. Thanks all!

Here are some tuning results from Intel Xeon E5-2630 v3 and v4, as well as Nvidia Tesla P100 PCI-E 16 GB.
CLBlast_tuners.zip

Tuning results from Hikey 970 with a Mali-G72 GPU
Do not use these results because when I launch them if I use Gemm with a size greater than 8 it causes an error in the library.
Mali-G72.zip

I tuned the CLBlast on FT-2000plus CPU (2.3Ghz@64cores) , which is an ARMv8-based many-core CPU.
tuned-FT-2000Plus-CPU.tar.gz

Sorry I had overlooked this issue for a while. I've just added tuning results for:

  • Intel Xeon E5-2630 v3
  • Intel Xeon E5-2630 v4
  • NVIDIA Tesla P100

I've not added the results for the ARMv8 machine, since it shows the CPU as device '0x662' from vendor '0x70' in PoCL, perhaps that is not so meaningful. If anyone else is interested they can always take the results from here.

Thanks all for sharing!

csbnw commented

I ran tuning using CLBlast 1.5.0 on a NVIDIA Titan RTX (using driver 415.125): titanrtx-415.125.tar.gz

Results for AMD Radeon RX Vega
Radeon RX Vega.zip

Thanks for sharing the tuning results! I've just added both the RX Vega and also the Titan RTX (sorry I forgot about it) to CLBlast.

AMD RX 6800 XT (Navi21): amd_rx_6800_xt.tar.gz

my latest result on RX6500XT (this is win11 22.3 driver) (performance on linux may be a bit better) and Qualcomm Adreno 540 on SD835 phone.

Got several compilation error messages on Adreno & android. the return value -6 means out of host memory, I'd look into the memory management and find some clue.

RX6500Adreno540.tar.gz

Intel(R) FPGA Emulation Device.
Intel_FPGA_Emulation_Device.zip

Some MacBook-Pros are equipped with an AMD Radeon Pro 450 Compute Engine
AMD_Radeon_Pro_450_Compute_Engine.zip

gspr commented

Attached are tuning results from two devices I don't think have been submitted yet (please correct me if mistaken):

  • NVIDIA GeForce RTX 2080 Ti
  • NVIDIA GeForce RTX 3090

tuning-results.tar.gz

AMD RX 5700XT tuning results:
5700XT_tuning.tar.gz

Intel(R) UHD Graphics 770 tuning results:
Intel(R) UHD Graphics 770.zip

AMD Radeon RX 6600 XT tuning results:
AMD Radeon RX 6600 XT.zip

AMD Radeon RX 6700 XT tuning results:

AMD.Radeon.RX6700.XT.tar.gz

Intel UHD 620 tuning results (the CPU is a i7-8565U) on linux using the intel opencl package.

reesults_intel_uhd620.tar.gz

AMD Radeon 680M on linux with rocm opencl driver. (The CPU is a Ryzen 7 Pro 6850U)
results_radeon_680M.tar.gz

@CNugteren
The ROCm thread has a reply from an AMD employee. Could you please go and answer?

The ROCm thread has a reply from an AMD employee. Could you please go and answer?

I think you are referring to ROCm/ROCm#2161, right? I think the AMD person is just pointing you to the existence of ROCm BLAS as an alternative to CLBlast. Since ROCm didn't exist at CLBlast creation time, I do not have a clear view of its strengths/weaknesses. So I think it is up to you (or other people) to react to that thread I think, not me.

In any case, let's keep this thread for tuning results.

I'll add the recently contributed results soon, thanks everyone ๐Ÿ‘

AMD Radeon RX580 2048SP.zip
Tried my best to use the latest AMD driver, but the driver will fail on 4/4 cases while running clblast_tuner_xgemm.exe -precision 6464 on Windows. The rest of them are fine.

AMD Ryzen 5700G APU.zip
clbast_tuner_xgemm.exe -precision 3232 could not produce all results due to driver freeze.

By the way, I have reported the problem I encountered to AMD community. Professional Dipak there was always very helpful. https://community.amd.com/t5/opencl/driver-freezing-and-produce-wrong-results-while-using-clblast/m-p/609755#M40354

AMD RX5700.zip
RX5700 (not RX5700xt) has no driver issue.

RTX4090.zip
Interestingly, not support FP16. Paid a guy a price tag of a cup of coffee to access the 4090 machine.

AMD 6900xt.zip
Another cup of coffee.

For any Windows user, please join this activity to make CLBlast better. For single GPU Windows users, what you need to do is to download the following bin file which is based on clblast version 1.6, and double-click the "all.bat" file. It will execute all the cases. Depending on your PC, it may take an hour or two. During this process, please do not play games or do something that heavily relies on GPUs. After all the cases, please compress all the .json files in the folder again to a zip file and rename it to your GPU name. Then please upload your GPU here. I firmly believe that clBlast will benefit humanity in terms of scientific research, medical care and even creating new jobs.
bin (2).zip

Thanks for all your efforts! I've added the results from @CaptainSifff and @tangjinchuan in #483.

Note that CLBlast doesn't necessarily require tuning on each device: it computes sensible defaults based on other tuning data for similar devices. E.g. if tuned for a AMD Radeon 5700 and an 5900 XT it will probably get 99% of the performance on an 5800 XT as well.

I've added the results from @CaptainSifff and @tangjinchuan in #483.

Did you not select these Radeon 6700XT on purpose? #1 (comment)

Note that CLBlast doesn't necessarily require tuning on each device: it computes sensible defaults based on other tuning data for similar devices. E.g. if tuned for a AMD Radeon 5700 and an 5900 XT it will probably get 99% of the performance on an 5800 XT as well.

I know there is already a 6600XT and 6800XT, but my 6700XT got 15-20% higher gemm performance after tuning.

My apologies, I missed yours. So many new tuning results submitted these last weeks. I'll add it soon ๐Ÿ‘

radeon vii.zip
Radeon VII, although I am not sure if these .jsons are all you need...

Sorry for the spam, but I couldn't help myself.

What an absolutely incredible effort and help you have been providing, @tangjinchuan ! I applaud your unprecedented output and I thank you for your amazing, positive, and altruistic contributions!

Oh, and while at it, thank you @CNugteren for making this program in the first place. ๐Ÿ˜„

Glory to both of you! ๐Ÿฅ‡ ๐ŸŽ†

Dear @mikkovedru ,
Thank you very much for your kind words. My students and I are very happy to contribute to the opensource community and to make this world a better place. I would like to thank @CNugteren , you, and many others to make this happen.
By the way, for anyone interested in big models, there is a project called llama.cpp (came out 19 May last month) which used clBlast to speed up the prompts and found comparable performance as cuBlas on some testing cases. It is a C++ based project, and now, thanks to clBlast, we can also have very good token performance on non-CUDA GPUs.
NVIDIA_GeForce_RTX_2080_with_Max-Q_Design_ๆŽๅ‚ฒ_1910020001.zip
NVIDIA_GeForce_MX150_็ฎ€ๅ‘้กบ_1917000242.zip

Imagination Technologies GPUs - PowerVR B-Series BXE-4-32
These are results are not 100% complete as clblast_tuner_xgemm -precision 32, clblast_tuner_xgemm -precision 3232 and
clblast_tuner_routine_xtrsv -precision 16 only partially ran before core dumping.
tuned.zip

AMD Firepro W8100

I hope this will eventually make Llama.cpp faster ๐Ÿš€ Note that 6464 xgemm froze midway and did not complete fully.

fireprow8100.zip

AMD RX Vega 10 iGPU
I know there is already a tuning available for an RX Vega, but this is the integrated version, and while the tuning numbers aren't wildly different, every little bit helps. I did skip a few of the tunings because they took a very long time, sometimes up to ten minutes per iteration, but I think I got most of it.
vega10.zip