Xilinx/xup_vitis_network_example

Server crashes when loading the benchmark bitfile

trashcrash opened this issue · 4 comments

I have 2 alveo U280 cards on a same host, directly connected.
Built the design using make all DEVICE=xilinx_u280_xdma_201920_3 INTERFACE=3 DESIGN=benchmark
Here's what I was trying to run:

from vnx_utils import *
import pynq
xclbin = '../benchmark.intf3.xilinx_u280_xdma_201920_3/vnx_benchmark_if3.xclbin'
ol_w0 = pynq.Overlay(xclbin,device=pynq.Device.devices[0])
ol_w1 = pynq.Overlay(xclbin,device=pynq.Device.devices[1])

The system crashes and reboots automatically when ol_w0 = pynq.Overlay(xclbin,device=pynq.Device.devices[0]) is executed. I also confirmed the devices do exist:

for i in range(len(pynq.Device.devices)):
    print("{}) {}".format(i, pynq.Device.devices[i].name))

Output:

0) xilinx_u280_xdma_201920_3
1) xilinx_u280_xdma_201920_3

I tried 2 times on the single bitfile, and re-compiled the design for a new bitfile. All lead to system reboot.

I've found some other thread likely concerning this problem (https://support.xilinx.com/s/question/0D52E00006hpJNLSA2/system-crashes-when-pcie-bit-is-burned-on-fpga?language=en_US), could this be the boards on PCIe are disconnected somehow? I ran the basic designs without any problem. Any idea why the crash happens? Thanks!


System information
Manufacturer: Supermicro
Product Name: X9DRG-QF
Version: 0123456789

OS version
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009
Codename: Core

XRT version
XRT Build Version: 2.6.655
Build Version Branch: 2020.1

pynq version
PYNQ version 2.7.0

Hi,

PCIe are disconnected somehow?

No, this should not happen. The link you reference is for Vivado designs not Vitis.

Can you try to grab the dmesg logs of the previous boot?

Can you try to program the Alveo cards in a different order?

Can you try to program the Alveo cards with xbutil?

The XRT version seems quite old as well

This problem does not seem to be related to VNx nor PYNQ, but I'll try to provide some help

The dmesg log is too long so I uploaded it to drive
https://drive.google.com/file/d/1zzh4L7YiTF7bIBLyzLN-Q2S45vplo2A_/view?usp=sharing

A different order gives the same crash

Using xbutil program -d 1 -p vnx_benchmark_if3.xclbin still results in system crashing.

You are right, it doesn't seem to be related to your work. I'll close the issue after your response. Thanks again for the good work you've achieved :)

The dmesg message is for the current boot.
I am interested in the dmesg message when the system crash, you can try something like this https://unix.stackexchange.com/a/345978

A few things you could try:

  • Generate the benchmark design for only one interface
  • Update XRT to at least 2.11

Could this be a power problem?

With only 1 single interface the server doesn't crash. The server simply couldn't handle 2 interfaces, power-wise or otherwise, it seems.