/plnx-aie-examples

AI Examples for Petalinux builds

Primary LanguageC++OtherNOASSERTION

=======================================================
AI Engine With PL Example
=======================================================

The AI engine with PL example demonstrates how to use AI engine for scalar
computation, and use PL for data movement. In this example, to run the matrix
multiplication on AI engine, we use standard matrix multiplication algorithm.
The user can change the matrix size and the number of cores utilized at
compile-time. The expected matrix size must be a multiple of 50 (number of
cores used) with the minimum and maximum value as 100x100 and 800x800
respectively. Please note that this example is intended to be a proof of
concept only. There can be other ways of implementation, which can leverage
more of the AIE resource and hence can result in better performance figures.

Introduction
============

Consider two matrices A and B, the product of the two, i.e. AxB, is a linear
combination of the the columns of A by matrix B. This means that the elements
in a row (i) of A are multiplied with the elements in a column of B (j) and are
summed up to give the corresponding single element in the matrix AxB at i, j.
This means that if A is an n x m matrix and B is an m x p matrix, then the
corresponding product AxB would have dimensions n x p. Note how the number of
columns of A equals the number of rows in B to make the matrix multiplication
possible.

Implementation
==============


Data movement
-------------

The application uses the PLIO attribute to make external memory-mapped
connections to or from global memory. These connections are created between
AIE kernel and the logical global memory port of the hardware platform design
via AXI-Multichannel Direct Memory Access IP in the fabric. In this design,
the buffer descriptors are programmed in the AXI-MCDMA IP to initiate AIE to DDR
read and write transactions in the PS program. The burst-length of the
memory-mapped transaction is 64-bit,and AXI-MCDMAs use physical memory
addressing read/write data from global memory.

+-------------------------------------------------------------------------------------------------------------+
| +----------------------------------------+                      +-----------------------------------------+ |
| |                                        |                      |                                         | |
| |                                        |                      |                                         | |
| |                        +-----------+   |   +--------------+   |   +-----------+                         | |
| |                    +-->------------|   |   |              |   |   |-----------|                         | |
| |                    |   |------------<----->+   AIE Core   +<----->------------|                         | |
| |  +--------------+  |   |-----------|   |   |    Module    |   |   |---------------+   +--------------+  | |
| |  |              +--+   |-----------|   |   |              |   |   |-----------|   |   |              |  | |
| |  |   Tile DMA   |      |-----------|   |   +--------------+   |   |-----------|   +-> |   Tile DMA   |  | |
| |  |              |      |-----------|   |                      |   |-----------|       |              |  | |
| |  +------+-------+      |-----------|   |                      |   |-----------|       +-------+------+  | |
| |         ^              |-----------|   |                      |   |-----------|               |         | |
| |         |              +-----------+   |                      |   +-----------+               |         | |
| |         |               Data Memory    |                      |    Data Memory                |         | |
| |         |                              |                      |                               |         | |
| |         |                              |                      |                               |         | |
| |         |    Memory Module             |                      |               Memory Module   |         | |
| +----------------------------------------+                      +-----------------------------------------+ |
|           |                                      AIE Array                                      |           |
+-------------------------------------------------------------------------------------------------------------+
            |                                                                                     |
+-------------------------------------------------------------------------------------------------------------+
|           |                                                                                     |           |
|           +------------------------+            AIE Shim Row        +---------------------------+           |
|                                    |                                |                                       |
+-------------------------------------------------------------------------------------------------------------+
                                     |                                |
+--------------------------------------------------------------------------------------------------------------+
|                                    |                                |                                        |
|                                    |                                |                                        |
|                                    |                                |                                        |
|                    +----------------------------------------------------------------+                        |
|                    |               |                                |               |                        |
|                    |               |                                v               |                        |
|                    |   +-----------+-----------+        +-----------+-----------+   |                        |
|                    |   |                       |        |                       |   |                        |
|                    |   |    MM2S Data Mover    |        |    S2MM Data Mover    |   |                        |
|                    |   |                       |        |                       |   |                        |
|                    |   +----------+------------+        +------------+----------+   |                        |
|                    |              ^                                  |              |                        |
|                    |              |           AXI-MCDMA IP           |              |                        |
|                    +----------------------------------------------------------------+                        |
|                                   |                                  |                                       |
|                                   |                                  |                                       |
|                                   |                                  |                                       |
|                                   |      Programmable Logic (PL)     |                                       |
+--------------------------------------------------------------------------------------------------------------+
                                    |                                  |
                                    |                                  |
                                    |                                  v
 +----------------------------------+----------------------------------+---------------------------------------+
 |                                                                                                             |
 |                                                Network-on-Chip (NoC)                                        |
 |                                                                                                             |
 +------------------------------------------------------+------------------------------------------------------+
                                                        ^
                                                        |
                                                        v
                                    +--------------------------------------+
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    |--------------------------------------|
                                    +--------------------------------------+

                                                       DDR


Data slicing
------------

To compute matrix multiplication on AIE, matrix A is sliced horizontally and
distributed equally among all the core utilized through the AXI-Stream network.
Matrix B is transposed and feed to the first core in the design element by
element. The first core shares the input matrix B with the next core through the
AXI-Stream connection. As the output received from the cores is in z-order 
fashion, hence a re-ordering of the output matrix is expected.

+----------------+            +---------------+              +---------------+
|                |            |               |              |               |
|                |            |               |              |               |
|                |            |               |              |               |
|       A        |      X     |       B       |      =       |       C       |
|                |            |               |              |               |
|                |            |               |              |               |
|                |            |               |              |               |
+----------------+            +---------------+              +---------------+


         +----------------------------------------+            +-----------------------------------------+
         |                                        |            |                                         |
      A0 <---------------------------------------->            |                                         |
         +----------------------------------------+            |  <----------------------------------->  |                +------+
         |                                        |            |                                         |    +----->     |  C1  |
      A1 |                                        |            |                                         |                |      |
         +----------------------------------------+            |                                         |                +------+
         |                                        |            |                    T                    |
      A2 |                                        |            |                   B                     |
         +----------------------------------------+            |                                         |
         |                                        |            |                                         |
      A3 |                                        |            |                                         |
         +----------------------------------------+            |                                         |
         |                                        |            |                                         |
      A4 |                                        |            |                                         |
         +----------------------------------------+            |                                         |
         |                                        |            |                                         |
      A5 |                                        |            |                                         |
         +----------------------------------------+            |                                         |
         |                                        |            |                                         |
      A6 |                                        |            |                                         |
         +----------------------------------------+            |                                         |
         |                                        |            |                                         |
      A7 |                                        |            |                                         |
         +----------------------------------------+            +-----------------------------------------+



Compilation Flow
================

      +----------------+          +------------------+
      |                |          |                  |
      |    xgemm.cpp   |          |   <kernels>.cc   |
      |                |          |                  |
      +--+---------+---+          +-------+----------+
         |         |                      |
         |         |                      |
         |         v                      v
         |     +---+----------------------+---+
         |     |                              |
         |     |            Vitis             |
         |     |                              |
         |     +---------+-----------------+--+
         |               |                 |
         |               v                 v
         |     +---------+---------+    +--+----------------+
         |     |                   |    |                   |
         |     |  aie_control.cpp  |    |   AIE ELF files   |
         |     |                   |    |                   |
         |     +---+-----------+---+    +---------+---------+
         |         |           |                  |
         v         v           v                  |
+--------+---------+---+    +--+-------------+    |
|                      |    |                |    |
|  GCC Cross-Compiler  |    |   Vitis CDO    |    |
|                      |    |   Generator    |    |
+----------+-----------+    |                |    |
           |                +-------+--------+    |
           |                        |             |
           |                        v             |
           |                +-------+--------+    |
           |                |                |    |
           |                |  AIE CDO Init  |    |
           |                |                |    |
           |                +----------+-----+    |
           |                           |          |
           |                           v          v
           |                      +----+----------+--------+
           |                      |                        |
           |                      |     Vitis Bootgen      |
           |                      |                        |
           |                      +----------------+-------+
           |                                       |
           v                                       v
+----------+--------------------+        +---------+-----------------------+
|                               |        |                                 |
|   aie-matrix-multiplication   |        | PDI containing AIE CDO and ELFs |
|   (Application Binary)        |        |                                 |
|                               |        |                                 |
+-------------------------------+        +---------------------------------+

There are 2 sets of external interfaces for AI Engine configuration
- AI Engine configuration
	Direct call to AI Engine driver APIs
	CDO parsing APIs
- ELF loading
	Direct call to AI Engine driver to load ELF file

The high-level tool, Vitis, can generate outputs in those 2 formats. Also, the
user can manually implement the application using direct calls and compile the
ELF using the low-level compiler.

The generated aie_control.cpp is cross-compiled to run on the target. The
compiled application loads the generated ELF to the corresponding tile through
AI Engine load ELF API. The AI Engine configuration is done by calling AI
Engine driver APIs directly or pass CDO object through CDO parser library. The
CDO parser is an external component, and the AI Engine software uses the CDO
APIs.

Sample Work Directory Structure
===============================

components/yocto/workspace/sources/aie-matrix-multiplication/
├── aie-matrix-multiplication - Cross-Compiled application binary
├── aie-matrix-multiplication.bif - Binary input file
├── aie-matrix-multiplication.pdi - AIE boot PDI
├── aie-pdi-gen.sh - script to generate BIF and PDI
├── clean-aie-work.sh - removes log files from Work directory
├── graph_aie_constraints.aiecst - PL constraint file
├── README
├── src
│   ├── kernels
│   │   ├── config.h - user defined parameters
│   │   ├── one_input.cc - first AIE kernel that reads data from PL stream
│   │   ├── one_output.cc - last AIE kernel that sends data to PL stream
│   │   └── two_inputs.cc - subsequent kernels that circulated data and output
│   ├── kernels.h - kernels declaration
│   ├── Makefile.Linux - makefile to build the Linux graph control application.
│   ├── xgemm.cpp - main application runs on the APU
│   └── xgemm.h - dataflow graph definition
└── Work
    ├── aie
    │   ├── 0_0
    │   │   └── Release
    │   │       └── 0_0 - AIE ELF. Naming convention is <COLUMN_NUMBER>_<ROW_NUMBER> - 1
    │   ├── 0_1
    │   │   └── Release
    │   │       └── 0_1
    │   ├── 0_2
    │   │   └── Release
    │   │       └── 0_2
    │   ├── 0_3
    │   │   └── Release
    │   │       └── 0_3
    .   .
    .   .
    .   .
    .   .
    │   ├── 9_5
    │   │   └── Release
    │   │       └── 9_5
    │   ├── 9_6
    │   │   └── Release
    │   │       └── 9_6
    │   └── 9_7
    │       └── Release
    │           └── 9_7
    └── ps
		├── cdo
		│   └── aie_cdo_init.bin
		└── c_rts
			└── aie_control.cpp



Run-time Execution
==================

At run-time, Linux application binary calls (UIO based) AI Engine userspace
driver and (tool generated) run time library, libcardano_api.so. AIE userspace
drivers abstract the libmetal and remoteproc APIs to handle runtime
configuration along with ELF loading.


                 +--------------------------+
                 |                          |
                 |       Application        |
                 |         binary           |
                 |                          |
                 +------------+-------------+
                              |
                              |
+-----------------------------v-------------------------------+

+-----------------------------+   +---------------------------+
|                             |   |                           |
|   AIE User Space Driver     |   |                           |
|                             |   |                           |
|    +-----------------+      |   |    libcardano_api.a       |
|    |       IO        |      |   |                           |
|    |                 |      |   |                           |
|    +-----------------+      |   |                           |
+-----------------------------+   +---------------------------+
               |
+-------------------------------------------------------------+
               v
   +-----------+------------+
   |       Libmetal         |
   +------------------------+
               |
+-------------------------------------------------------------+
               |
   +-----------v------------+
   |       Linux UIO        |
   +------------------------+
         |
   +-----v-----+
   |Xilinx AIE |
   |    UIO    |
   +-----------+

The libmetal provides the IO abstraction to the application, which allows the
application to be platform independent, ex standalone and Linux. So all the IO
in the driver is routed through libmetal, and libmetal can handle platform
specific part.

The Linux UIO subsystem allows to run IO from Linux userspace with a small
kernel module, including the MMIO and interrupt handling. The UIO interfaces are
based on the Linux sysfs, and the AI Engine driver stack utilizes this UIO
subsystem through libmetal, to enable platform-independent AI Engine software
stack. So the major part of the AIE specific code resides in the userspace as a
library, which can be reused between other software platforms such as baremetal.

Build Application Using PetaLinux
==================================

By default, the AIE matrix multiplication application is enabled. To
enable/disble, run:

```
petalinux-config -c rootfs

    [*] user packages --->
        [*] aie-matrix-multiplication
```
To rebuild the project run,

petalinux-build.

The generated FIT image will be in images/linux/image.ub.

Sample Output
=============

Follow PetaLinux boot process to launch the Linux on the target.
After you see the Linux login prompt, you can log in with user "root" and
password "root".

The AIE ELFs are installed in the `/lib/firmware/aie` directory.
We will need to go to `/lib/firmware/aie` to run the application.

```
root@xilinx-vck190-2020_1:~# cd /lib/firmware/aie
root@xilinx-vck190-2020_1:/lib/firmware/aie# aie-matrix-multiplication 
Initializing AIE driver...
metal: info:      Registered shmem provider linux_shm.
metal: info:      Registered shmem provider ion.reserved.
metal: info:      Registered shmem provider ion.ion_system_contig_heap.
metal: info:      Registered shmem provider ion.ion_system_heap.
metal: info:      device xilinx-aiengine in use by driver uio_dmem_genirq
metal: info:      metal_uio_dev_open: No IRQ for device f70a0000.aie-npi.
Initializing Cardano API...
Matrix size(int32): 800x800
PLIO MCDMA> allocated matrix A at 0x7f95c29000 (phy addr: 0x65900000)
PLIO MCDMA> allocated matrix B at 0x7f95e9a000 (phy addr: 0x65b71000)
PLIO MCDMA> allocated matrix B transpose at 0x7f9610b000 (phy addr: 0x65de2000)
PLIO MCDMA> allocated matrix C at 0x7f9637c000 (phy addr: 0x66053000)
PLIO MCDMA> allocated AIE result at 0x7f965ed000 (phy addr: 0x662c4000)
PLIO MCDMA> allocated APU result at 0x7f9685e000 (phy addr: 0x66535000)
PLIO MCDMA> allocated MM2S BD chain #0 at 0x7f96acf000 (phy addr: 0x667a6000)
PLIO MCDMA> allocated S2MM BD chain #0 at 0x7f96c91000 (phy addr: 0x66968000)
PLIO MCDMA> allocated MM2S BD chain #1 at 0x7f96b07400 (phy addr: 0x667de400)
PLIO MCDMA> allocated S2MM BD chain #1 at 0x7f96c97400 (phy addr: 0x6696e400)
PLIO MCDMA> allocated MM2S BD chain #2 at 0x7f96b3f800 (phy addr: 0x66816800)
PLIO MCDMA> allocated S2MM BD chain #2 at 0x7f96c9d800 (phy addr: 0x66974800)
PLIO MCDMA> allocated MM2S BD chain #3 at 0x7f96b77c00 (phy addr: 0x6684ec00)
PLIO MCDMA> allocated S2MM BD chain #3 at 0x7f96ca3c00 (phy addr: 0x6697ac00)
PLIO MCDMA> allocated MM2S BD chain #4 at 0x7f96bb0000 (phy addr: 0x66887000)
PLIO MCDMA> allocated S2MM BD chain #4 at 0x7f96caa000 (phy addr: 0x66981000)
PLIO MCDMA> allocated MM2S BD chain #5 at 0x7f96be8400 (phy addr: 0x668bf400)
PLIO MCDMA> allocated S2MM BD chain #5 at 0x7f96cb0400 (phy addr: 0x66987400)
PLIO MCDMA> allocated MM2S BD chain #6 at 0x7f96c20800 (phy addr: 0x668f7800)
PLIO MCDMA> allocated S2MM BD chain #6 at 0x7f96cb6800 (phy addr: 0x6698d800)
PLIO MCDMA> allocated MM2S BD chain #7 at 0x7f96c58c00 (phy addr: 0x6692fc00)
PLIO MCDMA> allocated S2MM BD chain #7 at 0x7f96cbcc00 (phy addr: 0x66993c00)
PLIO MCDMA> init_dmas: 0xa4000000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4010000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4020000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4030000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4040000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4050000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4060000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4070000, page size: 0x1000
Resetting AIE array...
Initializing graph my_graph...
Configuring PL-Interface for graph my_graph...
Loading elfs of graph my_graph...
Resetting cores of graph my_graph...
Configuring DMAs of graph my_graph...
Set test-iterations for the core(s) of graph my_graph
Disabling core(s) of graph my_graph ...
Enabling core(s) of graph my_graph ...
Waiting for core(s) of graph my_graph to finish execution ...
core(s) are done executing...
Success!
```

Customizing and Rebuilding
==========================
Following are prerequisites if the user wants to make any changes in
software,
  * xgemm repo in Yocto workspace and aiecompiler for any software-related
    changes.
  * PetaLinux.

For cross-compiling the main appplication, sysroot is required.To get the Linux
sysroot, go to the PetaLinux project directory, run the following command:

```
  $ petalinux-build --sdk
  $ petalinux-package --sysroot
```

The sysroot will be generated in
`images/linux/sdk/sysroots/aarch64-xilinx-linux/` directory.

Now, to pull the xgemm repository using PetaLinux, set the Yocto build tool as
devtool:

```
  $ petalinux-config
    Yocto Settings --->
    		Build tool --->
			(*) devtool
  $ petalinux-build -c aie-matrix-multiplication -x modify
```

The xgemm source files can be found in 
`components/yocto/workspace/sources/aie-matrix-multiplication/`

As mentioned earlier, the user can change the number of AIE cores utilized for
matrix multiplication. However, since the data memory immediately available to
the core is limited, reducing the number of cores utilized reduces the maximum
matrix size supported by the application. Within the config.h header file
NUM_HW_ROWS and NUM_HW_COLS macro can be set to change the number of cores
utilized. The maximum cores available are 400. With the changes made in the
application, care must be taken by the user that the newly generated
configuration is supported by the underlying hardware design.

To rebuild, go to the meta-user demo recipe files directory:
project-spec/meta-user-recipes-apps/aie-matrix-multiplication/files.
set the CARDANO_ROOT and then call AI engine compiler to build:
```
  export CARDANO_ROOT=<Root_To_Installed_Cardano_which_is_under_Vitis>
  source $CARDANO_ROOT/scripts/cardano_env.sh
  aiecompiler	--target=hw --constraints=graph_aie_constraints.aiecst
		--analyze-kernels=true --dataflow src/xgemm.cpp 

```
After the AIE application is customized and compiled, user can run
`clean-aie-work.sh` to clean the Work directory to remove unncessary files,
only leave the AIE kernels and the aie_control.cpp file.

The Linux AIE graph .cpp file is in the `src/` directory. After you build the
AIE application, if you need to build the Linux control application.

You can use the `Makefile.Linux` in the `aie-matrix-multiplication/files/src`
directory to build the Linux AIE graph control application.
You will need to specify CARDANO_ROOT and the Linux sysroot.

Use Makefile.Linux to rebuild the Linux AIE graph application:
```
  $ make -f Makefile.Linux \
    SYSROOT=<plnx_proj>/images/linux/sdk/sysroots/aarch64-xilinx-linux/ \
    CARDANO_ROOT=<cardano_root>
```

The generated Linux binary will be `aie-matrix-multiplication`.

The user can then rebuild petalinux.

NOTE:
 * No hardware change is supported in this version of Vivado for this design.

PDI Generation
=========================
Vivado can generate a boot PDI includesd PS, PL and NoC configuration with
"Generate bitstream" icon from Vivado GUI, but it will not includes the AIE
configuration and AIE ELFs.

In order to includes the AIE configuration and ELFs into the boot PDI, user
will need to update the BIF of the boot PDI generated by Vivado.
Vivado generate the boot PDI and the BIF in the hardware project's
"xilinx-vck190-2020.1.runs/impl_1/", e.g: 
 * project_1_wrapper.pdi
 * project_1_wrapper.pdi.bif

User will need to generate the AIE configuration CDO using aiecompiler first,
and update the bif to includes the AIE CDO and ELFs.

To generate the AIE CDO, user can go to the cardano generated AIE Work/
directory which is generated while compiling the AIE application:
```
  cd Work/ps/cdo/
```
Source cardano settings,
```
  export CARDANO_ROOT=<Installed_Vitis>/cardano/
  source $CARDANO_ROOT/scripts/cardano_env.sh
```
Run the ./generateAIEConfig from this directory. It will generate aie_cdo.bin
file.
And then please add the following configuration partitions to the boot PDI
BIF file generated by Vivado to includes the AIE configuration and ELFs:
```
 partition
  {
   id = 0x10
   type = cdo
   file = <AIE_APP>/Work/ps/cdo/aie_cdo_init.bin
  }
  partition
  {
   id = 0x12
   core = aie
   file = <AIE_APP>/Work/aie/0_0/Release/0_0
  }
  ...
  partition
  {
   id = 0x12
   core = aie
   file = <AIE_APP>/Work/aie/<C>_<R>/Release/<C>_<R>
  }

```

And then use bootgen to generate the PDI, you can source vitis settings to use
bootgen. E.g.
```
  bootgen -w -arch versal -image <BIF> -o <PDI>
```

The "aie-matrix-multiplication/aie-pdi-gen.sh" gives an example on how to
generate a boot PDI to include AIE configuration and ELFs or a partial PDI which
only contains AIE configuration and ELFs.
The `aie-matrix-multiplication/aie-matrix-multiplication.bif` gives an example
of a boot PDI BIF to contains AIE.

References
==========
[1] Vivado User Guide - for hardware related design.
[2] Vitis User Guide - for AIE application.
[3] Versal TRM - General subsystem overview.