open-power/snap

HLS_memcopy bandwidth is only 268.918 MiB/sec!!!why?

liwei008ren opened this issue · 14 comments

We use vivado 2018.1 to build HLS_memcopy. When flash bin file to FPGA and run ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh –C0 -dLONG:
The bandwidth is only 268.723 MiB/sec;

root@master:/usr/liwei/FPGA/snap# ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh –C0 -dLONG:
Starting : ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh
SNAP_ROOT : /usr/liwei/FPGA/snap
ACTION_ROOT : /usr/liwei/FPGA/snap/actions/hls_memcopy
Get CARD VERSION
SNAP Card Id: 0 Name: ADKU3. NVME disabled, 8192 MB DRAM available. (Align: 64 Min_DMA: 1)
SNAP FPGA Release: v1.5.1 Distance: 80 GIT: 0x2e41c6c
SNAP FPGA Build (Y/M/D): 2021/04/26 Time (H:M): 05:21
SNAP FPGA CIR Master: 1 My ID: 0
SNAP FPGA Up Time: 6037 sec
SNAP FPGA Exploration already done (MSAT: 1 MAID: 1)

Short | Action Type | Level | Action Name
------+--------------+-----------+------------
0 0x10141000 0x00000000 IBM HLS Memcopy

[00000000] 010501502e41c6ca
[00000008] 0000202104260521

Creating a 33445532 bytes file ...takes a minute or so ...
Doing snap_memcopy benchmarking with 33445532 bytes transfers ...
Read from Host Memory to FPGA ... ok
Write from FPGA to Host Memory ... ok

READ/WRITE Performance Results
memcopy of 33554432 bytes took 126939 usec @ 264.335 MiB/sec (from HOST_DRAM to FPGA_BRAM)
memcopy of 33445532 bytes took 124456 usec @ 268.734 MiB/sec (from FPGA_BRAM to HOST_DRAM)

ok
Test OK

Ni hao,
you should have a 512Mo file with option LONG, and I see a 32Mo only file.
As per :

if [ "$duration" = "LONG" ]; then

May be it explains the wrong computation.
Please find my log as a reference.
Best regards

castella@antipode:~/snap$ ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh -C1 -dLONG
Starting :    ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh
SNAP_ROOT :   /home/capiteam/castella/snap
ACTION_ROOT : /home/capiteam/castella/snap/actions/hls_memcopy
Get CARD VERSION
SNAP Card Id: 0 Name: ADKU3. NVME disabled, 8192 MB DRAM available. (Align: 64 Min_DMA: 1)
SNAP FPGA Release: v1.5.1 Distance: 80 GIT: 0x2e41c6c
SNAP FPGA Build (Y/M/D): 2021/04/26 Time (H:M): 11:01
SNAP FPGA CIR Master: 1 My ID: 0
SNAP FPGA Up Time: 289 sec
SNAP FPGA Exploration already done (MSAT: 1 MAID: 1)

   Short |  Action Type |   Level   | Action Name
   ------+--------------+-----------+------------
     0     0x10141000     0x00000023  IBM HLS Memcopy

[00000000] 010501502e41c6ca
[00000008] 0000202104261101

Creating a 536870912 bytes file ...takes a minute or so ...
Doing snap_memcopy benchmarking with 536870912 bytes transfers ...
Read from Host Memory to FPGA ... ok
Write from FPGA to Host Memory ... ok
Read from Card DDR Memory to FPGA ... ok
Write from FPGA to Card DDR Memory ... ok

READ/WRITE Performance Results
memcopy of 536870912 bytes took 160769 usec @ 3339.393 MiB/sec (from HOST_DRAM to FPGA_BRAM)
memcopy of 536870912 bytes took 162063 usec @ 3312.730 MiB/sec (from FPGA_BRAM to HOST_DRAM)
memcopy of 536870912 bytes took 52025 usec @ 10319.479 MiB/sec (from CARD_DRAM to FPGA_BRAM)
memcopy of 536870912 bytes took 55958 usec @ 9594.176 MiB/sec (from FPGA_BRAM to CARD_DRAM)

ok
Test OK

Thank you very much!!, but I run -dLONG and the bandwidth is also 272.461 MiB/sec , I don't know why ? Can you give me your flash bin file, I guess FPGA file is wrong .
root@master:/dev# cd /usr/liwei/FPGA/snap/
root@master:/usr/liwei/FPGA/snap# ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh -C0 -dLONG
Starting : ./actions/hls_memcopy/tests/test_0x10141000_throughput.sh
SNAP_ROOT : /usr/liwei/FPGA/snap
ACTION_ROOT : /usr/liwei/FPGA/snap/actions/hls_memcopy
Get CARD VERSION
SNAP Card Id: 0 Name: ADKU3. NVME disabled, 8192 MB DRAM available. (Align: 64 Min_DMA: 1)
SNAP FPGA Release: v1.5.1 Distance: 80 GIT: 0x2e41c6c
SNAP FPGA Build (Y/M/D): 2021/04/26 Time (H:M): 05:21
SNAP FPGA CIR Master: 1 My ID: 0
SNAP FPGA Up Time: 8423 sec
0 Max AT: 1 Found AT: 0x10141000 --> Assign Short AT: 0
0 0x10141000 0x00000023 IBM HLS Memcopy

[00000000] 010501502e41c6ca
[00000008] 0000202104260521

Creating a 536870912 bytes file ...takes a minute or so ...
Doing snap_memcopy benchmarking with 536870912 bytes transfers ...
Read from Host Memory to FPGA ... ok
Write from FPGA to Host Memory ... ok

READ/WRITE Performance Results
memcopy of 536870912 bytes took 1970452 usec @ 272.461 MiB/sec (from HOST_DRAM to FPGA_BRAM)
memcopy of 536870912 bytes took 2000320 usec @ 268.393 MiB/sec (from FPGA_BRAM to HOST_DRAM)

ok
Test OK

I removed the DDR on the FPGA and commented the Card DDR Memory to FPGA test code. Does it matter here?

The project is running in the environment:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial

root@master:/usr/liwei/FPGA/snap# uname -a
Linux master 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:49 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

What is the maximum bandwidth for both reading and writing?

伱好,
can you get and try this ADKU3 hls_memcopy binary, I used myself : https://ibm.box.com/s/755s1dv6truyecl64968nh06jlg1bfj7
Could you also try to generate yours with the default settings to make sure reference is passing, and then compare with your modifications ?
Thanks for trials
Concerning bandwidth, CAPI1 was not that performant, and around 3 to 4 GB/s back and forth was the average. Nothing to compare with CAPI2 (around 14GB/s useful rate over 16GB/S links) or even OpenCAPI results (21-22 over 25GB/s).
谢谢

Another thought, can you check why with LONG option you have only 32MB file, as you should automatically use a 512MB file. I think the issue is in the test script itself. May run the test steps by hand ?

Thank you very much, I will try again。

非常感谢

ADKU3 has X16 PCIE, why the bandwidth is only 3.3GB/S. The theoretical bandwidth should be 16GB/S, or at least 7GB/S.

Hi,
From ADKU3 spec, we see : The ADM-PCIE-KU3 is capable of PCIe Gen 1/2/3 with 1/2/4/8/16 lanes (where 16-lanes requires a two bifurcated 8-lane interfaces) and bifurcation has not been implemented in snap for CAPI1.
Check doc#p8-capi10-snap-fpga-supported-boards
So max would be raw 8GB/s, about 7GB/s useful. And as previously mentioned there was a performance limitation with the first implementation of CAPI.
Find expected performance in this document : UG_SNAP_hls_memcopy_v23.pdf
Thanks

We are currently working on a cloud solution with Opencapi and eventually CAPI2.

Hello, does IBM have a cloud solution service and what is the website?

Thank you very much. We want to rent a power server and opencapi service. Is there a recommended website?

Hi,
you should find the nearest local representative for this. This place is not aimed at dealing with commercial aspects.
Thanks