======================================================= AI Engine With PL Example ======================================================= The AI engine with PL example demonstrates how to use AI engine for scalar computation, and use PL for data movement. In this example, to run the matrix multiplication on AI engine, we use standard matrix multiplication algorithm. The user can change the matrix size and the number of cores utilized at compile-time. The expected matrix size must be a multiple of 50 (number of cores used) with the minimum and maximum value as 100x100 and 800x800 respectively. Please note that this example is intended to be a proof of concept only. There can be other ways of implementation, which can leverage more of the AIE resource and hence can result in better performance figures. Introduction ============ Consider two matrices A and B, the product of the two, i.e. AxB, is a linear combination of the the columns of A by matrix B. This means that the elements in a row (i) of A are multiplied with the elements in a column of B (j) and are summed up to give the corresponding single element in the matrix AxB at i, j. This means that if A is an n x m matrix and B is an m x p matrix, then the corresponding product AxB would have dimensions n x p. Note how the number of columns of A equals the number of rows in B to make the matrix multiplication possible. Implementation ============== Data movement ------------- The application uses the PLIO attribute to make external memory-mapped connections to or from global memory. These connections are created between AIE kernel and the logical global memory port of the hardware platform design via AXI-Multichannel Direct Memory Access IP in the fabric. In this design, the buffer descriptors are programmed in the AXI-MCDMA IP to initiate AIE to DDR read and write transactions in the PS program. The burst-length of the memory-mapped transaction is 64-bit,and AXI-MCDMAs use physical memory addressing read/write data from global memory. +-------------------------------------------------------------------------------------------------------------+ | +----------------------------------------+ +-----------------------------------------+ | | | | | | | | | | | | | | | +-----------+ | +--------------+ | +-----------+ | | | | +-->------------| | | | | |-----------| | | | | | |------------<----->+ AIE Core +<----->------------| | | | | +--------------+ | |-----------| | | Module | | |---------------+ +--------------+ | | | | | +--+ |-----------| | | | | |-----------| | | | | | | | | Tile DMA | |-----------| | +--------------+ | |-----------| +-> | Tile DMA | | | | | | | |-----------| | | |-----------| | | | | | | +------+-------+ |-----------| | | |-----------| +-------+------+ | | | | ^ |-----------| | | |-----------| | | | | | | +-----------+ | | +-----------+ | | | | | | Data Memory | | Data Memory | | | | | | | | | | | | | | | | | | | | | | Memory Module | | Memory Module | | | | +----------------------------------------+ +-----------------------------------------+ | | | AIE Array | | +-------------------------------------------------------------------------------------------------------------+ | | +-------------------------------------------------------------------------------------------------------------+ | | | | | +------------------------+ AIE Shim Row +---------------------------+ | | | | | +-------------------------------------------------------------------------------------------------------------+ | | +--------------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | | +----------------------------------------------------------------+ | | | | | | | | | | v | | | | +-----------+-----------+ +-----------+-----------+ | | | | | | | | | | | | | MM2S Data Mover | | S2MM Data Mover | | | | | | | | | | | | | +----------+------------+ +------------+----------+ | | | | ^ | | | | | | AXI-MCDMA IP | | | | +----------------------------------------------------------------+ | | | | | | | | | | | | | | | Programmable Logic (PL) | | +--------------------------------------------------------------------------------------------------------------+ | | | | | v +----------------------------------+----------------------------------+---------------------------------------+ | | | Network-on-Chip (NoC) | | | +------------------------------------------------------+------------------------------------------------------+ ^ | v +--------------------------------------+ |--------------------------------------| |--------------------------------------| |--------------------------------------| |--------------------------------------| |--------------------------------------| |--------------------------------------| |--------------------------------------| |--------------------------------------| +--------------------------------------+ DDR Data slicing ------------ To compute matrix multiplication on AIE, matrix A is sliced horizontally and distributed equally among all the core utilized through the AXI-Stream network. Matrix B is transposed and feed to the first core in the design element by element. The first core shares the input matrix B with the next core through the AXI-Stream connection. As the output received from the cores is in z-order fashion, hence a re-ordering of the output matrix is expected. +----------------+ +---------------+ +---------------+ | | | | | | | | | | | | | | | | | | | A | X | B | = | C | | | | | | | | | | | | | | | | | | | +----------------+ +---------------+ +---------------+ +----------------------------------------+ +-----------------------------------------+ | | | | A0 <----------------------------------------> | | +----------------------------------------+ | <-----------------------------------> | +------+ | | | | +-----> | C1 | A1 | | | | | | +----------------------------------------+ | | +------+ | | | T | A2 | | | B | +----------------------------------------+ | | | | | | A3 | | | | +----------------------------------------+ | | | | | | A4 | | | | +----------------------------------------+ | | | | | | A5 | | | | +----------------------------------------+ | | | | | | A6 | | | | +----------------------------------------+ | | | | | | A7 | | | | +----------------------------------------+ +-----------------------------------------+ Compilation Flow ================ +----------------+ +------------------+ | | | | | xgemm.cpp | | <kernels>.cc | | | | | +--+---------+---+ +-------+----------+ | | | | | | | v v | +---+----------------------+---+ | | | | | Vitis | | | | | +---------+-----------------+--+ | | | | v v | +---------+---------+ +--+----------------+ | | | | | | | aie_control.cpp | | AIE ELF files | | | | | | | +---+-----------+---+ +---------+---------+ | | | | v v v | +--------+---------+---+ +--+-------------+ | | | | | | | GCC Cross-Compiler | | Vitis CDO | | | | | Generator | | +----------+-----------+ | | | | +-------+--------+ | | | | | v | | +-------+--------+ | | | | | | | AIE CDO Init | | | | | | | +----------+-----+ | | | | | v v | +----+----------+--------+ | | | | | Vitis Bootgen | | | | | +----------------+-------+ | | v v +----------+--------------------+ +---------+-----------------------+ | | | | | aie-matrix-multiplication | | PDI containing AIE CDO and ELFs | | (Application Binary) | | | | | | | +-------------------------------+ +---------------------------------+ There are 2 sets of external interfaces for AI Engine configuration - AI Engine configuration Direct call to AI Engine driver APIs CDO parsing APIs - ELF loading Direct call to AI Engine driver to load ELF file The high-level tool, Vitis, can generate outputs in those 2 formats. Also, the user can manually implement the application using direct calls and compile the ELF using the low-level compiler. The generated aie_control.cpp is cross-compiled to run on the target. The compiled application loads the generated ELF to the corresponding tile through AI Engine load ELF API. The AI Engine configuration is done by calling AI Engine driver APIs directly or pass CDO object through CDO parser library. The CDO parser is an external component, and the AI Engine software uses the CDO APIs. Sample Work Directory Structure =============================== components/yocto/workspace/sources/aie-matrix-multiplication/ ├── aie-matrix-multiplication - Cross-Compiled application binary ├── aie-matrix-multiplication.bif - Binary input file ├── aie-matrix-multiplication.pdi - AIE boot PDI ├── aie-pdi-gen.sh - script to generate BIF and PDI ├── clean-aie-work.sh - removes log files from Work directory ├── graph_aie_constraints.aiecst - PL constraint file ├── README ├── src │ ├── kernels │ │ ├── config.h - user defined parameters │ │ ├── one_input.cc - first AIE kernel that reads data from PL stream │ │ ├── one_output.cc - last AIE kernel that sends data to PL stream │ │ └── two_inputs.cc - subsequent kernels that circulated data and output │ ├── kernels.h - kernels declaration │ ├── Makefile.Linux - makefile to build the Linux graph control application. │ ├── xgemm.cpp - main application runs on the APU │ └── xgemm.h - dataflow graph definition └── Work ├── aie │ ├── 0_0 │ │ └── Release │ │ └── 0_0 - AIE ELF. Naming convention is <COLUMN_NUMBER>_<ROW_NUMBER> - 1 │ ├── 0_1 │ │ └── Release │ │ └── 0_1 │ ├── 0_2 │ │ └── Release │ │ └── 0_2 │ ├── 0_3 │ │ └── Release │ │ └── 0_3 . . . . . . . . │ ├── 9_5 │ │ └── Release │ │ └── 9_5 │ ├── 9_6 │ │ └── Release │ │ └── 9_6 │ └── 9_7 │ └── Release │ └── 9_7 └── ps ├── cdo │ └── aie_cdo_init.bin └── c_rts └── aie_control.cpp Run-time Execution ================== At run-time, Linux application binary calls (UIO based) AI Engine userspace driver and (tool generated) run time library, libcardano_api.so. AIE userspace drivers abstract the libmetal and remoteproc APIs to handle runtime configuration along with ELF loading. +--------------------------+ | | | Application | | binary | | | +------------+-------------+ | | +-----------------------------v-------------------------------+ +-----------------------------+ +---------------------------+ | | | | | AIE User Space Driver | | | | | | | | +-----------------+ | | libcardano_api.a | | | IO | | | | | | | | | | | +-----------------+ | | | +-----------------------------+ +---------------------------+ | +-------------------------------------------------------------+ v +-----------+------------+ | Libmetal | +------------------------+ | +-------------------------------------------------------------+ | +-----------v------------+ | Linux UIO | +------------------------+ | +-----v-----+ |Xilinx AIE | | UIO | +-----------+ The libmetal provides the IO abstraction to the application, which allows the application to be platform independent, ex standalone and Linux. So all the IO in the driver is routed through libmetal, and libmetal can handle platform specific part. The Linux UIO subsystem allows to run IO from Linux userspace with a small kernel module, including the MMIO and interrupt handling. The UIO interfaces are based on the Linux sysfs, and the AI Engine driver stack utilizes this UIO subsystem through libmetal, to enable platform-independent AI Engine software stack. So the major part of the AIE specific code resides in the userspace as a library, which can be reused between other software platforms such as baremetal. Build Application Using PetaLinux ================================== By default, the AIE matrix multiplication application is enabled. To enable/disble, run: ``` petalinux-config -c rootfs [*] user packages ---> [*] aie-matrix-multiplication ``` To rebuild the project run, petalinux-build. The generated FIT image will be in images/linux/image.ub. Sample Output ============= Follow PetaLinux boot process to launch the Linux on the target. After you see the Linux login prompt, you can log in with user "root" and password "root". The AIE ELFs are installed in the `/lib/firmware/aie` directory. We will need to go to `/lib/firmware/aie` to run the application. ``` root@xilinx-vck190-2020_1:~# cd /lib/firmware/aie root@xilinx-vck190-2020_1:/lib/firmware/aie# aie-matrix-multiplication Initializing AIE driver... metal: info: Registered shmem provider linux_shm. metal: info: Registered shmem provider ion.reserved. metal: info: Registered shmem provider ion.ion_system_contig_heap. metal: info: Registered shmem provider ion.ion_system_heap. metal: info: device xilinx-aiengine in use by driver uio_dmem_genirq metal: info: metal_uio_dev_open: No IRQ for device f70a0000.aie-npi. Initializing Cardano API... Matrix size(int32): 800x800 PLIO MCDMA> allocated matrix A at 0x7f95c29000 (phy addr: 0x65900000) PLIO MCDMA> allocated matrix B at 0x7f95e9a000 (phy addr: 0x65b71000) PLIO MCDMA> allocated matrix B transpose at 0x7f9610b000 (phy addr: 0x65de2000) PLIO MCDMA> allocated matrix C at 0x7f9637c000 (phy addr: 0x66053000) PLIO MCDMA> allocated AIE result at 0x7f965ed000 (phy addr: 0x662c4000) PLIO MCDMA> allocated APU result at 0x7f9685e000 (phy addr: 0x66535000) PLIO MCDMA> allocated MM2S BD chain #0 at 0x7f96acf000 (phy addr: 0x667a6000) PLIO MCDMA> allocated S2MM BD chain #0 at 0x7f96c91000 (phy addr: 0x66968000) PLIO MCDMA> allocated MM2S BD chain #1 at 0x7f96b07400 (phy addr: 0x667de400) PLIO MCDMA> allocated S2MM BD chain #1 at 0x7f96c97400 (phy addr: 0x6696e400) PLIO MCDMA> allocated MM2S BD chain #2 at 0x7f96b3f800 (phy addr: 0x66816800) PLIO MCDMA> allocated S2MM BD chain #2 at 0x7f96c9d800 (phy addr: 0x66974800) PLIO MCDMA> allocated MM2S BD chain #3 at 0x7f96b77c00 (phy addr: 0x6684ec00) PLIO MCDMA> allocated S2MM BD chain #3 at 0x7f96ca3c00 (phy addr: 0x6697ac00) PLIO MCDMA> allocated MM2S BD chain #4 at 0x7f96bb0000 (phy addr: 0x66887000) PLIO MCDMA> allocated S2MM BD chain #4 at 0x7f96caa000 (phy addr: 0x66981000) PLIO MCDMA> allocated MM2S BD chain #5 at 0x7f96be8400 (phy addr: 0x668bf400) PLIO MCDMA> allocated S2MM BD chain #5 at 0x7f96cb0400 (phy addr: 0x66987400) PLIO MCDMA> allocated MM2S BD chain #6 at 0x7f96c20800 (phy addr: 0x668f7800) PLIO MCDMA> allocated S2MM BD chain #6 at 0x7f96cb6800 (phy addr: 0x6698d800) PLIO MCDMA> allocated MM2S BD chain #7 at 0x7f96c58c00 (phy addr: 0x6692fc00) PLIO MCDMA> allocated S2MM BD chain #7 at 0x7f96cbcc00 (phy addr: 0x66993c00) PLIO MCDMA> init_dmas: 0xa4000000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4010000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4020000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4030000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4040000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4050000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4060000, page size: 0x1000 PLIO MCDMA> init_dmas: 0xa4070000, page size: 0x1000 Resetting AIE array... Initializing graph my_graph... Configuring PL-Interface for graph my_graph... Loading elfs of graph my_graph... Resetting cores of graph my_graph... Configuring DMAs of graph my_graph... Set test-iterations for the core(s) of graph my_graph Disabling core(s) of graph my_graph ... Enabling core(s) of graph my_graph ... Waiting for core(s) of graph my_graph to finish execution ... core(s) are done executing... Success! ``` Customizing and Rebuilding ========================== Following are prerequisites if the user wants to make any changes in software, * xgemm repo in Yocto workspace and aiecompiler for any software-related changes. * PetaLinux. For cross-compiling the main appplication, sysroot is required.To get the Linux sysroot, go to the PetaLinux project directory, run the following command: ``` $ petalinux-build --sdk $ petalinux-package --sysroot ``` The sysroot will be generated in `images/linux/sdk/sysroots/aarch64-xilinx-linux/` directory. Now, to pull the xgemm repository using PetaLinux, set the Yocto build tool as devtool: ``` $ petalinux-config Yocto Settings ---> Build tool ---> (*) devtool $ petalinux-build -c aie-matrix-multiplication -x modify ``` The xgemm source files can be found in `components/yocto/workspace/sources/aie-matrix-multiplication/` As mentioned earlier, the user can change the number of AIE cores utilized for matrix multiplication. However, since the data memory immediately available to the core is limited, reducing the number of cores utilized reduces the maximum matrix size supported by the application. Within the config.h header file NUM_HW_ROWS and NUM_HW_COLS macro can be set to change the number of cores utilized. The maximum cores available are 400. With the changes made in the application, care must be taken by the user that the newly generated configuration is supported by the underlying hardware design. To rebuild, go to the meta-user demo recipe files directory: project-spec/meta-user-recipes-apps/aie-matrix-multiplication/files. set the CARDANO_ROOT and then call AI engine compiler to build: ``` export CARDANO_ROOT=<Root_To_Installed_Cardano_which_is_under_Vitis> source $CARDANO_ROOT/scripts/cardano_env.sh aiecompiler --target=hw --constraints=graph_aie_constraints.aiecst --analyze-kernels=true --dataflow src/xgemm.cpp ``` After the AIE application is customized and compiled, user can run `clean-aie-work.sh` to clean the Work directory to remove unncessary files, only leave the AIE kernels and the aie_control.cpp file. The Linux AIE graph .cpp file is in the `src/` directory. After you build the AIE application, if you need to build the Linux control application. You can use the `Makefile.Linux` in the `aie-matrix-multiplication/files/src` directory to build the Linux AIE graph control application. You will need to specify CARDANO_ROOT and the Linux sysroot. Use Makefile.Linux to rebuild the Linux AIE graph application: ``` $ make -f Makefile.Linux \ SYSROOT=<plnx_proj>/images/linux/sdk/sysroots/aarch64-xilinx-linux/ \ CARDANO_ROOT=<cardano_root> ``` The generated Linux binary will be `aie-matrix-multiplication`. The user can then rebuild petalinux. NOTE: * No hardware change is supported in this version of Vivado for this design. PDI Generation ========================= Vivado can generate a boot PDI includesd PS, PL and NoC configuration with "Generate bitstream" icon from Vivado GUI, but it will not includes the AIE configuration and AIE ELFs. In order to includes the AIE configuration and ELFs into the boot PDI, user will need to update the BIF of the boot PDI generated by Vivado. Vivado generate the boot PDI and the BIF in the hardware project's "xilinx-vck190-2020.1.runs/impl_1/", e.g: * project_1_wrapper.pdi * project_1_wrapper.pdi.bif User will need to generate the AIE configuration CDO using aiecompiler first, and update the bif to includes the AIE CDO and ELFs. To generate the AIE CDO, user can go to the cardano generated AIE Work/ directory which is generated while compiling the AIE application: ``` cd Work/ps/cdo/ ``` Source cardano settings, ``` export CARDANO_ROOT=<Installed_Vitis>/cardano/ source $CARDANO_ROOT/scripts/cardano_env.sh ``` Run the ./generateAIEConfig from this directory. It will generate aie_cdo.bin file. And then please add the following configuration partitions to the boot PDI BIF file generated by Vivado to includes the AIE configuration and ELFs: ``` partition { id = 0x10 type = cdo file = <AIE_APP>/Work/ps/cdo/aie_cdo_init.bin } partition { id = 0x12 core = aie file = <AIE_APP>/Work/aie/0_0/Release/0_0 } ... partition { id = 0x12 core = aie file = <AIE_APP>/Work/aie/<C>_<R>/Release/<C>_<R> } ``` And then use bootgen to generate the PDI, you can source vitis settings to use bootgen. E.g. ``` bootgen -w -arch versal -image <BIF> -o <PDI> ``` The "aie-matrix-multiplication/aie-pdi-gen.sh" gives an example on how to generate a boot PDI to include AIE configuration and ELFs or a partial PDI which only contains AIE configuration and ELFs. The `aie-matrix-multiplication/aie-matrix-multiplication.bif` gives an example of a boot PDI BIF to contains AIE. References ========== [1] Vivado User Guide - for hardware related design. [2] Vitis User Guide - for AIE application. [3] Versal TRM - General subsystem overview.