/packet-fifo-hps-fpga

Exchange data packets between HPS and FPGA on Intel Cyclone SoC

Primary LanguageV

This is code to support the exchange of byte-based packets between CλaSH code
running on an FPGA and Linux running on the HPS of an Intel Cyclone SoC.
Communication is based on the Intel FPGA Avalon FIFO Memory Core, so it is
usable in any setting where such Avalon interfaces can be instantiated.

The layout of the repository is as follows:

quartus/
	A Quartus project to instantiate a working setup on a Terasic
	DE10-Standard Development Kit.
src/clash/
	The CλaSH code to support packet exchange, with some example programs.
src/hps/
	The usermode Linux code to support packet exchange with an example
	program and a testing program.

Apart from these things, you will also need a Linux kernel driver that is
available at <https://github.com/DigitalBrains1/altera-fifo-kmod>.

Note that data exchange is not that quick. The likely culprit is the
programmed I/O data transfer on the HPS side. It should be managed by a DMA
controller to achieve good throughput. This project is a first step towards a
performant implementation.


=========================
Using the Quartus project
=========================

The Quartus project is based on the "Golden Hardware Reference Design" for the
Terasic DE10-Standard dev kit, suitably adapted. The CλaSH code is not the
top-level design. Instead, the top-level, DE10_Standard_GHRD.v, expects a
module named packet_fifo with the expected ports. Look at the section about
CλaSH to learn more about generating the correct Verilog module. All that is
needed in Quartus is to add the Verilog file produced by CλaSH to the project.

Open quartus/soc_system.qsys with Quartus to launch the Intel Platform
Designer (QSys) to look at and change the configuration of the project.


=============================
Using the usermode Linux code
=============================

Note on licensing
-----------------

Most files in src/hps/ are dual-licensed under BSD and GPL. You can choose
whether you wish to use the BSD license or the GPL license, either GPL v2 or,
at your option, any later version. HOWEVER, the files uio_helper.{c,h} are GPL
only. This means the combined work can only be distributed under the GPL. The
BSD license only applies to the individual files which declare themselves
BSD-licensed. Linking uio_helper in makes it GPL-only.

Updating overlay.dts
--------------------

The first step is producing a devicetree overlay file that tells the kernel
about the attached Intel FPGA Avalon FIFO Memory Cores. A devicetree overlay
file matching the Quartus project is included at src/hps/overlay.dts, but if
QSys is used to change the project, it will have to be reconstructed.

Quartus can produce devicetrees for the systems designed in QSys, but they do
not seem to lead to a bootable Linux system. Instead, it seems one is supposed
to use this as a basis for writing the correct devicetree. The approach chosen
is that of devicetree overlays, which modify and extend an existing
devicetree.

Note that all devicetree-related commands need to be run from inside an Intel
FPGA SoC Embedded Command Shell. Depending on how it is installed, this might
be the following:

$ ~/intelFPGA/18.1/embedded/embedded_command_shell.sh

Quartus can generate a new devicetree as follows:

$ cd quartus
$ sopc2dts -i soc_system.sopcinfo --force-altr -o temp.dts

The resultant temp.dts is a full devicetree. We need only a portion of it.
Open your favorite editor, and select the section between

	hps_0_bridges: bridge@0xc0000000 {

		[...]

	}; //end bridge@0xc0000000 (hps_0_bridges)

including those lines themselves. Open src/hps/overlay.dts and edit it to
include this new section as follows:


	/dts-v1/ /plugin/;

	/ {
		compatible = "altr,socfpga-cyclone5";

		fragment@0 {
			target-path = "/soc";
			__overlay__ {
				#address-cells = <1>;
				#size-cells = <1>;

				hps_0_bridges: bridge@0xc0000000 {

					[...]

				}; //end bridge@0xc0000000 (hps_0_bridges)
			};
		};

	[... file continues, read on ...]

So to put it differently: the current src/hps/overlay.dts contains a section
called hps_0_bridges: bridge@0xc0000000 and this is the bit that we want to
work with. If we change something in QSys, sopc2dts can regenerate this as
part of a larger file, and we extract just this bit from that file and update
our src/hps/overlay.dts.

We need to change one more thing. Every node that generates interrupts has the
following property:

	interrupt-parent = <&hps_0_arm_gic_0>;

but the interrupt controller in the devicetree we are making an overlay for is
not called hps_0_arm_gic_0, it is simply called intc. There are two solutions.
If one of the devicetree parents already specified an interrupt-parent set to
intc, the interrupt-parent entries in the overlay can just be removed.[1] If
this is not the case, change all occurrences of that interrupt-parent property
to:

	interrupt-parent = <&intc>;

In the overlay provided in this project, the interrupt-parent property is
removed because /soc already provides the correct value. Of course, if you are
using a different devicetree as the basis, the interrupt controller might have
a different name.

Now we come to the second fragment in src/hps/overlay.dts. A devicetree allows
us to give short descriptive names, aliases, to nodes in the tree. The only
allowed characters in aliases are alphanumerics and the dash. Linux does not
enforce this character set restriction (it is in the Devicetree
specification), but does require that the alias ends in a decimal number. Any
aliases that do not end in a number are effectively ignored by Linux.

We use aliases to conveniently find the FIFOs we are interested in. The
example programs in src/hps expect the aliases fifo-h2f0 and fifo-f2h0. h2f
carries packets from the HPS (Linux) to the FPGA (CλaSH). f2h carries packets
from FPGA to HPS. The /aliases node in the devicetree maps these aliases to
the full paths of the node they refer to. For the Quartus project included,
that boils down to the rest of overlay.dts looking as follows:

		fragment@1 {
			target-path="/aliases";
			__overlay__ {
				fifo-h2f0 = "/soc/bridge@0xc0000000/fifo@0x000001000";
				fifo-f2h0 = "/soc/bridge@0xc0000000/fifo@0x000002000";
			};
		};
	};


Applying the overlay to your system
-----------------------------------

Note that all devicetree-related commands need to be run from inside an Intel
FPGA SoC Embedded Command Shell. Depending on how it is installed, this might
be the following:

$ ~/intelFPGA/18.1/embedded/embedded_command_shell.sh

Building the overlay can be done through

$ cd src/hps
$ make overlay.dtb

Applying it can be done in multiple ways. This is one:

- Take the SD card containing your Linux install and insert it in your
  computer.
- Mount the partition that contains zImage and socfgpa.dtb.

Use the following command to directly apply the overlay to socfpga.dtb:

$ fdtoverlay -o socfpga.new.dtb -i socfpga.dtb overlay.dtb

and copy this as the new socfpga.dtb back to the SD card.

If you use this method, you might find that U-Boot will not reserve enough
room for the now larger socfpga.dtb, and booting fails with "Bad Linux ARM
zImage magic!" because the larger file overwrote the start of the kernel
image. This can usually be corrected easily in U-Boot:

	U-Boot SPL 2013.01.01 (Jan 09 2017 - 17:37:20)
	BOARD : Altera SOCFPGA Cyclone V Board
	[...]

	U-Boot 2013.01.01 (Jan 09 2017 - 19:43:57)

	CPU   : Altera SOCFPGA Platform
	BOARD : Altera SOCFPGA Cyclone V Board
	[...]

	Hit any key to stop autoboot:  0
	SOCFPGA_CYCLONE5 # fatls mmc 0:1
	  4988952   zimage~
	      212   u-boot.scr
	  5071776   zimage
	  4694312   zimage.upstream
	    31483   socfpga.dtb~
	    34624   socfpga.dtb

	6 file(s), 0 dir(s)

	SOCFPGA_CYCLONE5 # printenv fdtaddr loadaddr
	fdtaddr=0x00000100
	loadaddr=0x8000
	SOCFPGA_CYCLONE5 # setenv loadaddr 0xc000
	SOCFPGA_CYCLONE5 # saveenv
	Saving Environment to MMC...
	Writing to MMC(0)... Timeout on data busy
	Timeout on data busy
	done
	SOCFPGA_CYCLONE5 # boot
	[...]

While we could actually move the loadaddr less (the last byte of the fdt is at
address 0x883f), I have moved it to 0xc000 so it will still fit even with
larger overlays.

The "Timeout on data busy" seems to be harmless, the environment is correctly
stored on the SD card.

Using the example program
-------------------------

The included program is fpgaeth. It will open a Universal TUN/TAP device (as a
TAP device), and any packets arriving on that network interface are sent to
CλaSH. Any packets received from CλaSH are sent out to the TAP device. You
could use the regular Linux bridging tools to bridge the TAP to the Ethernet
connection of the DE10-Standard.

Invoking

$ make

will not only compile fpgaeth and test-fifo, but also rsync them to the
DE10-Standard over the network. Adapt the Makefile if your setup is different.

On the Linux system on the DE10-Standard, first load the altera_fifo kernel
module mentioned earlier on (see the kernel module documentation), and then:

$ ./fpgaeth tapfpga

Where tapfpga is the name of the TAP device to open. If this is a pre-existing
interface, it binds to that. Otherwise, an interface with that name is
automatically created. The following command creates an unbound, non-ephemeral
interface:

# ip tuntap add tapfpga mode tap

To bridge packets between the FPGA and the onboard Ethernet, either configure
this using your favorite tools, or directly use the low-level command line
tools as follows. A new enough BusyBox might support all the needed
functionality in its ip applet. Otherwise, install the iproute2 package, for
instance using "opkg install iproute2". Making sure a working version of these
tools is available is beyond the scope of this document.

# ip link add en-br type bridge
    <adds an Ethernet bridge which will bridge packets between all slave
    interfaces>
# ip link set en-br address 52:54:00:47:52:19
    <(optional) You might want a static MAC for your network interface>
# ip link set en-br up
    <Bring the bridge up>
# ip link set eth0 master en-br
    <Onboard ethernet added to bridge>
# ip link set eth0 up
    <Bring the slave up>
# ip link set tapfpga master en-br
    <TAP device added to bridge>
# ip link set tapfpga up
    <Bring the slave up>

Writing your own programs
-------------------------

Use the API provided by packet_fifo.h. In addition to the documentation there,
the example program fpgaeth shows much of how to use that API.

Because fpgaeth only exits when it receives a terminating signal, it does not
show the usual freeing of resources when exiting. Simply call close_rdfifo()
or close_wrfifo() when you are done with the FIFO to release all related
resources.

Interrupt-driven reading and writing of FIFOs is done through the Linux
kernel's Userspace I/O (UIO) subsystem.

Suspending execution and waiting for an interrupt is done by either a blocking
read() on `rdfifo_ctx->uio_fd` / `wrfifo_ctx->uio_fd` or a select() on that
fd. Note that `rdfifo_ctx` is the context that was created by init_rdfifo(),
and `wrfifo_ctx` the context created by init_wrfifo(). Reads from a UIO device
*must* always be 4 bytes long. Examples:

Blocking read():

	int32_t dummy;
	read(ctx->uio_fd, &dummy, 4);

With select():

	fd_set read_fds;
	int fdmax;
	int32_t dummy;

	FD_ZERO(&read_fds);
	FD_SET(ctx->uio_fd, &read_fds);
	fdmax = ...maximum fd in all sets + 1...
	select(fdmax, &read_fds, ...);
	if (FD_ISSET(ctx->uio_fd, &read_fds)) {
		read(ctx->uio_fd, &dummy, 4);
		...
	}

The dummy value read is actually the number of interrupts that were caught so
far. In this application, it is not important and can be discarded. It does
need to be read from the device file for every interrupt that is handled.

Having covered /how/ to wait for an interrupt, it is critical to the correct
operation to know /when/ to wait for an interrupt. For FIFO reads, as long as
fifo_read() does not return the status code FIFO_NEED_MORE, do *not* read() or
select() `ctx->uio_fd`! This includes the very first time the FIFO is read.
*Only* when fifo_read() returns FIFO_NEED_MORE can the program be suspended by
a blocking read() or select(). There is a small chance that there is no new
data in the FIFO on resumption, so this should not be assumed to be the case.
Once resumed, keep calling fifo_read() until it once again returns
FIFO_NEED_MORE and only then switch to read() or select() again.

For waiting for an interrupt signaling available space in the fifo_write()
case, the process is similar. Only when fifo_write() indicates a short write
(fewer bytes written than requested by `len`) should the `uio_fd` be used to
block. And there is a small chance there is still no write space available on
resumption.

The Linux kernel documentation has a UIO HOWTO which has more information on
using the subsystem from userspace. It is included in the kernel source code;
Documentation/driver-api/uio-howto.rst as of this writing, or [2] on the web.

Backpressure
------------

When backpressure is "On" in the Intel FPGA Avalon FIFO Memory Core as it is
in this project, a read from the data register of an empty FIFO or a write to
the data register of a full FIFO will *stall the bus*. On the CλaSH side, this
is by design and will neatly backpropagate. On the HPS side, however, the
whole CPU hangs until the register access succesfully completes. The operating
system is not designed to deal with this, and long stalls will cause serious
issues. The solution is to only do a data register access when a read of the
FIFO level register indicates that it will succeed immediately.


====================
Using the CλaSH code
====================

As indicated above in the section about the Quartus project, the actual top
entity of the FPGA design is a file in the Quartus project. That file
instantiates a module which is the toplevel module generated by CλaSH.
To create the ICMPEcho networking proof of concept, you should generate
Verilog as follows:

$ clash --verilog ICMPEcho.hs

and then add all the generated files (except any .manifest files) to the
Quartus project. The actual top entity, DE10_Standard_GHRD.v, will instantiate
the packet_fifo module as well as the soc_system module that was created using
Intel Platform Designer (QSys), plus some more details and wire them all
together.

The top entity produced by CλaSH is expected to have the following form:

{-# ANN f
    (Synthesize
        { t_name   = "packet_fifo"
        , t_inputs =
              [ PortName "clk"
              , PortName "rst_n"
              , PortName "fpga_debounced_buttons"
              , avalonMasterExtInputNames "fifo_f2h_in_mm_"
              , avalonMasterExtInputNames "fifo_h2f_out_mm_"
              ]
        , t_output =
              PortProduct ""
                [ PortName "fpga_led_internal"
                , avalonMasterExtOutputNames "fifo_f2h_in_mm_"
                , avalonMasterExtOutputNames "fifo_h2f_out_mm_"
                ]
        }) #-}
f
    :: Clock System
    -> "rst_n" ::: Signal System Bool
    -- ^ Active-low asynchronous reset
    -> "fpga_debounced_buttons" ::: Signal System (BitVector 3)
    -> "fifo_f2h_in" ::: PacketWriterExtInput System
    -> "fifo_h2f_out" ::: PacketReaderExtInput System
    -> ( "fpga_led_internal" ::: Signal System (Unsigned 10)
       , "fifo_f2h_in" ::: PacketWriterExtOutput System
       , "fifo_h2f_out" ::: PacketReaderExtOutput System
       )

(The name annotations with (:::) are purely for documentation purposes and do
not affect HDL generation in this design.)

"rst_n" is connected to KEY0 on the dev board (debounced), and
"fpga_debounced_buttons" is connected to KEY1 through KEY3. The HPS reset
buttons on the board (WARM_RST and HPS_RST) do not reset the FPGA part. The
code in this repository does not actually use KEY1 through KEY3, but they are
included so you can quickly wire them up during debugging.

The same goes for "fpga_led_internal", which is connected to LEDR0 through
LEDR9 on the board, but not currently used.

"fifo_f2h_in" denotes the inputs and outputs connected to the FIFO named
"fifo-f2h" in the Linux code. The "_in" refers to the fact that this is the
input of the FIFO, not to the direction of individual signals. Similarly,
"fifo_h2f_out" refers to the "fifo-h2f" FIFO.

Along with the ICMPEcho proof-of-concept, there are some more entities with
the same structure in the Test namespace. In particular, Test.Avalon.Common,
Test.Avalon.PacketEcho.Common and Test.Avalon.PacketEcho.Plain could serve as
a starting point for copy-pasting to write new entities that have this same
shape and can thus work with the exact same Quartus project.

[1] https://www.kernel.org/doc/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
	"This [interrupt-parent] property is inherited, so it may be specified
	in an interrupt client node or in any of its parent nodes."

[2] https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html