skvadrik/re2c

Add a HLS backend.

Opened this issue · 5 comments

Moving here the summary of an email thread started by Radhen Hendarmawan.

HLS specification: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug902-vivado-high-level-synthesis.pdf

The main limitation is the absence of goto statement. We can, perhaps, workaround it by storing an explicit state variable and generating the program in the form of a loop over a giant switch on the state variable that contains all DFA states as its sub-cases. So, for example, in case of [a]* [b]* { return 0; } instead of the usual generated code:

yy1:
        ++YYCURSOR;
yy0:
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy1;
        case 'b':       goto yy4;
        default:        goto yy3;
        }
yy3:
        { return 0; }
yy4:
        yych = *++YYCURSOR;
        switch (yych) {
        case 'b':       goto yy4;
        default:        goto yy3;
        }

We could generate something like:

for (...) {
    switch (state) {
        case 0:
            yych = *YYCURSOR;
            switch (yych) {
                case 'a': state = 1; break;
                case 'b': state = 4; break;
                default:  state = 3; break;
            }
            break;
        case 1:
            ++YYCURSOR;
            state = 0;
            break;
        case 3:
            { return 0; }
            break;
        case 4:
            yych = *++YYCURSOR;
            switch (yych) {
                case 'b': state = 3; break;
                default:  state = 3; break;
            }
            break;
    }
}

Where to get HLS toolchain: https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/archive.html

Hello,
I don't think it is a good idea to add a hardware backend. For FPGAs and ASIC it is better to use NFAs than DFAs that re2c produces. There is a lot of literature about it (example ref).

Here is an example for implementing it in VHDL. (it probably has mistakes)

process(clock)
begin
	if rising_edge(clock) then
		....
		if (state_1 or state_4 or state_5)  and input = "01001011" then
			state_13 <= '1';
		end if
		if (state_2 and state_5) and input = "10011011" then
			state_14 <= '1';
		end if
		.....
	end if 
end process;

A more complete VHDL example

library IEEE;
use IEEE.std_logic_1164.all;

entity NFA is
	generic(
		N: positive := 8
	);
	port(
		input: in std_logic_vector (N-1 downto 0);
		match: out std_logic;
		clock: in std_logic
	);
end NFA;

architecture impl of NFA is
	signal state: std_logic_vector(20 downto 0);
begin

	process(clock)
	begin
		if rising_edge(clock) then
			....
			state(13) <= (state(1) or state(4) or state(5)) when input = "01001011" else '0';
			state(14) <= (state(2) or state(5)) when input = "10011011" else '0';
			match <= (state(3) or state(4)) when input = "00011101" else '0';
		end if;
	end process;
end impl;

If you can reach a state with more than 2 inputs,then it can be split into 2 states or use more complicated logic for the state.

Hi @algorithm314 , thanks for your input and for the example. I remember reading that DFA end up too large for FPGA. I don't plan on implementing HLS backend any time soon (I'd rather add support for a general-purpose language like Rust), but I thought the workaround for the absence of goto in the target language is interesting and can be reused in other cases. And also I didn't want to loose track of the email thread with Radhen.

@skvadrik it's not so much that the DFA couldn't fit on an FPGA (it all depends on what automaton on what particular device at what throughput target) but that it would not be able to utilize the massive parallelism the device has to offer. An NFA is a more natural fit to the FPGA that could represent the states as actual circuitry, firing synchronous to a clock. If using a DFA, where throughput in software depends more clearly on the # of transitions you can take per second, the FPGA will not be really able to beat a multi-core CPU (it has lower clock frequencies and there's communication overhead). Actually, in terms of raw performance it's pretty hard to beat a multi-socket server with an FPGA.

For an example of FPGA-based regex performance take a look at our work from way back for an idea: https://zistvan.github.io/doc/regex-fccm2016.pdf, we compare to RE2 (on a single thread) the paper, and even though the FPGA beats that, it's hard to beat a full socket CPU on all expressions if you would use several parallel instances of RE2. A much fresher paper, and with wider applicability, appeared this year at ASPLOS http://www.cs.virginia.edu/~gr5yf/Reza_files/papers/FlexAmata_ASPLOS2020.pdf which I think also answers some of the "compiling the expression down to hardware peculiarities".
Most importantly, even though C-based HLS (e.g., from Xilinx) looks in its syntax like C, it's execution model is fundamentally different. If anything, thinking of how you'd implement something in a functional language, will be closer to what will work well in HLS. For this reason, there is no Goto, because the code is not translated to a series of instructions but rather to a spatial register representation.

Using an FPGA for regex, nonetheless, can be beneficial if one targets 1) low power or embedded use-cases because they have much better energy efficiency than CPUs while offering similar performance and 2) it's possible to implement regex in a way that provides constant, guaranteed, processing rates regardless of the expression that is being implemented.

(sorry for the "uninvited" comment :) )

@zistvan Thanks for a very interesting and useful comment (and the links)!