DarkRISCV
Opensource RISC-V implemented from scratch in one night!
Table of Contents
- Introduction
- Project Background
- Directory Description
- "src" Directory
- "sim" Directory
- "rtl" Directory
- "board" Directory
- Implementation Notes
- Development Tools
- Development Boards
- Acknowledgments
- References
Introduction
Developed in a magic night of 19 Aug, 2018 between 2am and 8am, the DarkRISCV softcore started as an proof of concept for the opensource RISC-V instruction set. The general concept was based in my other early 16-bit RISC processors and composed by a simplified two stage pipeline working with a two phase clock, where a instruction is fetch from a instruction memory in the first clock and then the instruction is decoded/executed in the second clock. The pipeline is overlapped without interlocks, in a way that the DarkRISCV can reach the performance of one clock per instruction most of time, except by a taken branch, where one clock is lost in the pipeline flush. Of course, in order to perform read operations in blockrams in a single clock, a two-phase clock is required, in a way that no wait states are required. As result, the code is very compact, with around three hundred lines of obfuscated but beautiful Verilog code. After lots of exciting sleepless nights of work and the help of lots of colleagues, the DarkRISCV reached a very good quality result, in a way that the code compiled by the standard GCC for RV32I worked fine.
Nowadays, a three stage pipeline working with a single clock phase is also available, resulting in a better distribution between the decode and execute stages. In this case the instruction is fetch in the first clock from a blockram, decoded in the second clock and executed in the third clock. As long the load instruction cannot load the data from a blockram in a single clock, the external logic inserts one extra clock in this case. Also, there are two extra clocks in order to flush the pipeline in the case of taken branches. The impcat of the pipeline flush depends of the compiler optimizations, but according to the lastest measurements, the 3-stage pipeline version can reach a instruction per clock (IPC) of 0.7, smaller than the measured IPC of 0.85 in the case of the 2-stage pipeline version. Anyway, with the 3-stage pipeline and some other expensive optimizations, the DarkRISCV can reach 100MHz in a low-cost Spartan-6, which results in more performance when compared with the 2-stage pipeline version, which is supported as reference and with smaller clocks (typically 50MHz).
Although the code is small and crude when compared with other RISC-V implementations, the DarkRISCV has lots of impressive features:
- implements most of the RISC-V RV32I instruction set (missing csr*, e* and fence*)
- works up to 100MHz (spartan-6) and sustain 1 clock per instruction most of time
- flexible harvard architecture (easy to integrate a cache controller)
- works fine in a real xilinx and lattice FPGAs
- works fine with gcc 9.0.0 for RISC-V (no patches required!)
- uses only around 1000-1500 LUTs (depending of enabled features)
- no interlock between pipeline stages
- BSD license: can be used anywhere with no restrictions!
Some extra features are planned for the furure or under development:
- interrupt controller (under tests)
- cache controller (under tests)
- gpio and timer (under tests)
- sdram controller w/ data scrambler
- branch predictor (under tests)
- ethernet controller (GbE)
- multi-threading (coarse-grained MT)
- multi-processing (SMP)
- network on chip (NoC)
- rv32e support (less registers, more threads)
- rv64i support (not so easy as appears...)
- 16x16-bit MAC instruction (under tests!)
- big-endian support
- user/supervisor modes
- debug support
And much other features!
Feel free to make suggestions and good hacking! o/
Project Background
The main motivation for the DarkRISCV was create a migration path for some projects around the 680x0/Coldfire family.
Although there are lots of 680x0 cores available, they are designed around different concepts and requirements, in a way that I found no much options regarding my requirements (more than 50MIPS with around 1000LUTs). The best option at this moment, the TG68, requires at least 2400LUTs (by removing the MUL/DIV instructions), and works up to 40MHz in a Spartan-6. As addition, the TG68 core requires at least 2 clock per instruction, which means a peak performance of 20MIPS. As long the 680x0 instruction is too complex, this result is really not bad at all and, at this moment, probably the best opensource option to replace the 68000.
Anyway, it does not match with the my requirements regarding space and performance. As part of the investigation, I tested other cores, but I found no much options as good as the TG68 and I even started design a risclized-68000 core, in order to try find a solution. Unfortunately, I found no much ways to reduce the space and increase the performance, in a way that I started investigate about non-680x0 cores. After lots of tests with different cores, I found the picorv32 core and the all the ecosystem around the RISC-V. The picorv32 is a very nice project and can peak up to 150MHz in a low-cost Spartan-6. Although most instructions requires 3 or 4 clocks per instruction, the picorv32 resembles the 68020 in some ways, but running at 150MHz and providing a peak performance of 50MIPS, which is very impressive.
Although the picorv32 is a very good option to directly replace the 680x0 family, it is not powerful enough to replace some Coldfire processors (more than 75MIPS). As long I had some good experience with experimental 16-bit RISC cores for DSP-like applications, I started code the DarkRISCV only to check the level of complexity and compare with my risclized-68000. For my surprise, in the first night I mapped almost all instructions of the rv32i specification and the DarkRISCV started to execute the first instructions correctly at 75MHz and with one clock per instruction, which not only resembles a fast and nice 68040 and can beat some Coldfires! wow! :)
After the success of the first nigth of work, I started to work in order to fix small details in the hardware and software implementation.
Directory Description
Although the DarkRISCV is only a small processor core, a small eco-system is required in order to test the core, including RISCV compatible software, support for simulations and support for peripherals, in a way that the processor core produces observable results. Each element is stored with similar elements in directories, in a way that the top level has the following organization:
- README.md: the top level README file (points to this document)
- LICENSE: unlimited freedom! o/
- Makefile: the show start here!
- src: the source code for the test firmware (boot.c, main.c etc in C language)
- rtl: the source code for the DarkRISCV core and the support logic (Verilog)
- sim: the source code for the simulation to test the rtl files (currently via icarus)
- board: support and examples for different boards (currently via Xilinx ISE)
- tmp: empty, but the ISE will create lots of files here)
The top level Makefile is responsible to build everything, but it must be edited first, in a way that the user at least must select the compiler path and the target board.
By default, the top level Makefile uses:
CROSS = riscv32-embedded-elf
CCPATH = /usr/local/share/gcc-$(CROSS)/bin/
ICARUS = /usr/local/bin/iverilog
BOARD = avnet_microboard_lx9
Tust update the configuration according to your system configuration, type make and hope everything is in the correct location! You probably will need fix some paths and set some others in the PATH environment variable, but it will eventually work.
And, when everything is correctly configured, the result will be something like this:
# make
make -C src all CROSS=riscv32-embedded-elf CCPATH=/usr/local/share/gcc-riscv32-embedded-elf/bin/ ARCH=rv32e HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S boot.c -o boot.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c boot.s -o boot.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S stdio.c -o stdio.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c stdio.s -o stdio.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S main.c -o main.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c main.s -o main.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S io.c -o io.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c io.s -o io.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S banner.c -o banner.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c banner.s -o banner.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-cpp -P -DHARVARD=1 darksocv.ld.src darksocv.ld
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld -Tdarksocv.ld -Map=darksocv.map -m elf32lriscv boot.o stdio.o main.o io.o banner.o -o darksocv.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld: warning: section `.data' type changed to PROGBITS
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objdump -d darksocv.o > darksocv.lst
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary darksocv.o darksocv.text --only-section .text*
hexdump -ve '1/4 "%08x\n"' darksocv.text > darksocv.rom.mem
#xxd -p -c 4 -g 4 darksocv.o > darksocv.rom.mem
rm darksocv.text
wc -l darksocv.rom.mem
1016 darksocv.rom.mem
echo rom ok.
rom ok.
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary darksocv.o darksocv.data --only-section .*data*
hexdump -ve '1/4 "%08x\n"' darksocv.data > darksocv.ram.mem
#xxd -p -c 4 -g 4 darksocv.o > darksocv.ram.mem
rm darksocv.data
wc -l darksocv.ram.mem
317 darksocv.ram.mem
echo ram ok.
ram ok.
echo sources ok.
sources ok.
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
make -C sim all ICARUS=/usr/local/bin/iverilog HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
/usr/local/bin/iverilog -I ../rtl -o darksocv darksimv.v ../rtl/darksocv.v ../rtl/darkuart.v ../rtl/darkriscv.v
./darksocv
WARNING: ../rtl/darksocv.v:280: $readmemh(../src/darksocv.rom.mem): Not enough words in the file for the requested range [0:1023].
WARNING: ../rtl/darksocv.v:281: $readmemh(../src/darksocv.ram.mem): Not enough words in the file for the requested range [0:1023].
VCD info: dumpfile darksocv.vcd opened for output.
reset (startup)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvvvv rr
rrrr vvvvvvvvvvvvvvvvvvvvvvvvvv rrrr
rrrrrr vvvvvvvvvvvvvvvvvvvvvv rrrrrr
rrrrrrrr vvvvvvvvvvvvvvvvvv rrrrrrrr
rrrrrrrrrr vvvvvvvvvvvvvv rrrrrrrrrr
rrrrrrrrrrrr vvvvvvvvvv rrrrrrrrrrrr
rrrrrrrrrrrrrr vvvvvv rrrrrrrrrrrrrr
rrrrrrrrrrrrrrrr vv rrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrrrr
INSTRUCTION SETS WANT TO BE FREE
boot0: text@0 data@4096 stack@8192
board: simulation only (id=0)
build: darkriscv fw build Sat, 30 May 2020 00:55:21 -0300
core0: darkriscv@100.0MHz with rv32e+MT+MAC
uart0: 115200 bps (div=868)
timr0: periodic timer=1000000Hz (io.timer=99)
Welcome to DarkRISCV!
> no UART input, finishing simulation...
echo simulation ok.
simulation ok.
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
make -C boards all BOARD=piswords_rs485_lx9 HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/boards'
cd ../tmp && xst -intstyle ise -ifn ../boards/piswords_rs485_lx9/darksocv.xst -ofn ../tmp/darksocv.syr
Reading design: ../boards/piswords_rs485_lx9/darksocv.prj
*** lots of weird FPGA related messages here ***
cd ../tmp && bitgen -intstyle ise -f ../boards/avnet_microboard_lx9/darksocv.ut ../tmp/darksocv.ncd
echo done.
done.
Which means that the software compiled and liked correctly, the simulation worked correctly and the FPGA build produced a image that can be loaded in your FPGA board with a make install (case you has a FPGA board and, of course, you have a JTAG support script in the board directory).
Case the FPGA is correctly programmed and the UART is attached to a terminal emulator, the FPGA will be configured with the DarkRISCV, which will run the test software and produce the following result:
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvvvv rr
rrrr vvvvvvvvvvvvvvvvvvvvvvvvvv rrrr
rrrrrr vvvvvvvvvvvvvvvvvvvvvv rrrrrr
rrrrrrrr vvvvvvvvvvvvvvvvvv rrrrrrrr
rrrrrrrrrr vvvvvvvvvvvvvv rrrrrrrrrr
rrrrrrrrrrrr vvvvvvvvvv rrrrrrrrrrrr
rrrrrrrrrrrrrr vvvvvv rrrrrrrrrrrrrr
rrrrrrrrrrrrrrrr vv rrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrrrr
INSTRUCTION SETS WANT TO BE FREE
boot0: text@0 data@4096 stack@8192
board: piswords rs485 lx9 (id=6)
build: darkriscv fw build Fri, 29 May 2020 23:56:39 -0300
core0: darkriscv@100.0MHz with rv32e+MT+MAC
uart0: 115200 bps (div=868)
timr0: periodic timer=1000000Hz (io.timer=99)
Welcome to DarkRISCV!
>
The beautiful ASCII RISCV logo was produced by Andrew Waterman! [6]
As long as the build works, it is possible start make changes, but my recommendation when working with soft processors is not work in the hardware and software at the same time! This means that is better freeze the hardware and work only with the software or freeze the software and work only with the hardware. It is perfectly possible make your research in both, but not at the same time, otherwise you find the DarkRISCV in a non-working state after software and hardware changes and you will not be sure where the problem is.
"src" Directory
The src directory contains the source code for the test firmware, which includes the boot code, the main process and auxiliary libraries. The code is compiled via gcc in a way that some auxiliary files are produced, for example:
- boot.c: the original C code for the boot process
- boot.s: the assembler version of the C code, generated automatically by the gcc
- boot.o: the compiled version of the C code, generated automatically by the gcc
When all .o files are produced, the result is linked in a darksocv.o ELF file, which is used to produce the darksocv.bin file, which is converted to hexadecimal and separated in ROM and RAM files (which are loaded by the Verilog code in the blockRAMs). The linker also produces a darksocv.lst with a complete list of the code generated and the darsocv.map, which shows the map of all functions and variables in the produced code.
The firmware concept is very simple:
- boot.c contains the boot code
- main.c contains the main application code (shell)
- banner.c contains the riscv banner
- stdio.c contains a small version of stdio
- io.c contains the IO interfaces
Extra code can be easily added in the compilation by editing the src/Makefile.
For example, in order to add a lempel-ziv code lz.c, it is necessary make the Makefile knows that we need the lz.s and lz.o:
OBJS = boot.o stdio.o main.o io.o banner.o lz.o
ASMS = boot.s stdio.s main.s io.s banner.s lz.s
SRCS = boot.c stdio.c main.c io.c banner.c lz.c
And add a "lz" command in the main.c, in a way that is possible call the function via the prompt. Alternatively, it is possible entirely replace the provided firmware and use your own firmware.
"sim" Directory
The simulation, in the other hand will show some waveforms and is possible check the DarkRISCV operation when running the example code.
The main simulation tool for DarkRISCV is the iSIM from Xilinx ISE 14.7, but the Icarus simulator is also supported via the Makefile in the sim directory (the changes regarding Icarus are active when the symbol ICARUS is detected). I also included a workaround for ModelSim, as pointed by our friend HYF (the changes regarding ModelSim are active when the symbol MODEL_TECH is detected).
The simulation runs the same firmware as in the real FPGA, but in order to improve the simulation performance, the UART code is not simulated, since the 115200 bps requires lots dead simulation time.
"rtl" Directory
The RTL directory contains the DarkRISCV core and some auxiliary files, such as the DarkSoCV (a small system-on-chip with ROM, RAM and IO), the DarkUART (a small UART for debug) and the configuration file, where is possible enable and disable some features that are described in the Implementation Notes section.
"board" Directory
The current supported boards are:
- avnet_microboard_lx9
- lattice_brevia2_xp2
- piswords_rs485_lx9
- qmtech_sdram_lx16
- qmtech_spartan7_s15
- xilinx_ac701_a200
The organization is self-explained, w/ the vender, board and FPGA model in the name of the directory. Each board directory contains the project files to be open in the Xilinx ISE 14.x, as well Makefiles to build the FPGA image regarding that board model. Although a ucf file is provided in order to generate a complete build with a UART and some LEDs, the FPGA is NOT fully wired in any particular configuration and you must add the pins that you will use in your FPGA board.
Anyway, although not wired, the build always gives you a good estimation about the FPGA utilization and about the timing (because the UART output ensures that the complete processor must be synthesized).
As long there are much supported boards, there is no way to test all boards everytime, which means that sometimes the changes regarding one board may affect other board in a wrong way.
Implementation Notes*
[*This section is kept for reference, but the description may not match exactly with the current code]
Since my target is the ultra-low-cost Xilinx Spartan-6 family of FPGAs, the project is currently based in the Xilinx ISE 14.7 for Linux, which is the latest available ISE version. However, there is no explicit reference for Xilinx elements and all logic is inferred directly from Verilog, which means that the project is easily portable to other FPGA families and easily portable to other environments, as can be observed in the case of Lattice XP2 support. Anyway, keep in mind that certain Verilog structures may not work well in some FPGAs.
In the last update I included a way to test the firmware in the x86 host, which helps as lot, since is possible interact with the firmware and fix quickly some obvious bugs. Of course, the x86 code does not run the boot.c code, since makes no sense (?) run the RISCV boot code in the x86.
Anyway, as main recomendation when working with softcores try never work in the hardware and in the software at the same time! Start with the minimum software configuration possible and freeze the software. When implementing new software updates, use the minium hardware configuration possible and freeze the hardware.
The RV32I specification itself is really impressive and easy to implement (see [1], page 16). Of course, there are some drawbacks, such as the funny little-endian bus (opposed to the network oriented big-endian bus found in the 680x0 family), but after some empirical tests it is easy to make work.
The funny information here is that, after lots of research regarding add support for big-endian in the DarkRISCV, I found no way to make the GCC generate the code and data correctly.
Another drawback in the specification is the lacking of delayed branches. Although i understand that they are bad from the conceptual point of view, they are good trick in order to extract more performance. As reference, the lack of delayed branches or branch predictor in the DarkRISCV may reduce between 20 and 30% the performance, in a way that the real measured performance may be between 1.25 and 1.66 clocks per instruction.
Although the branch prediction is not complex to implement, I found the experimental multi-threading support far more interesting, as long enable use the idle time in the branches to swap the processor thread. Anyway, I will try debug the branch prediction code in order to improve the single-thread performance.
The core supports 2 or 3-state pipelines and, although the main logic is almost the same, there are huge difference in how they works. Just for reference, the following section reflects the historic evolution of the core and may not reflect the current core code.
The original 2-stage pipeline design has a small problem concerning the ROM and RAM timing, in a way that, in order to pre-fetch and execute the instruction in two clocks and keep the pre-fetch continously working at the rate of 1 instruction per clock (and the same in the execution), the ROM and RAM must respond before the next clock. This means that the memories must be combinational or, at least, use a 2-phase clock.
The first solution for the 2-stage pipeline version with a 2-phase clock is the default solution and makes the DarkRISCV work as a pseudo 4-stage pipeline:
- 1/2 stage for instruction pre-fetch (rom)
- 1/2 stage for static instruction decode (core)
- 1/2 stage for address generation, register read and data read/write (ram)
- 1/2 stage for data write (register write)
From the processor point of view, there are only 2 stages and from the memory point of view, there are also 2 stages. But they are in different clock phases. In normal conditions, this is not recommended because decreases the performance by a 2x factor, but in the case of DarkRISCV the performance is always limited by the combinational logic regarding the instruction execution.
The second solution with a 2-stage pipeline is use combinational logic in order to provide the needed results before the next clock edge, in a way that is possible use a single phase clock. This solution is composed by a instruction and data caches, in a way that when the operand is stored in a small LUT-based combinational cache, the processor can perform the memory operation with no extra wait states. However, when the operand is not stored in the cache, extra wait-states are inserted in order to fetch the operand from a blockram or extenal memory. According to some preliminary tests, the instruction cache w/ 64 direct mapped instructions can reach a hit ratio of 91%. The data cache performance, although is not so good (with a hit ratio of only 68%), will be a requirement in order to access external memory and reduce the impact of slow SDRAMs and FLASHes.
Unfortunately, the instruction and data caches are not working anymore for the 2-stage pipeline version and only the instruction cache is working in the 3-stage pipeline. The problem is probably regarding the HLT signal and/or a problem regarding the write byte enable in the cache memory.
Both the use of the cache and a 2-phase clock does not perform well. By this way, a 3-stage pipeline version is provided, in order to use a single clock phase with blockrams.
The concept in this case is separate the pre-fetch and decode, in a way that the pre-fetch can be done entirely in the blockram side for the instruction bus. The decode, in a different stage, provides extra performance and the execute stage works with one clock almost all the time, except when the load instruction is executed. In this case, the external memory logic inserts one wait-state. The write operation, however, is executed in a single clock.
The solution with wait-states can be used in the 2-stage pipeline version, but decreases the performance too much. Case is possible run all versions with the same, clock, the theorical performance in clocks per instruction CPI), number of clocks to flush the pipeline in the taken branch (FLUSH) and memory wait-states (WSMEM) will be:
- 2-stage pipe w/ 2-phase clock: CPI=1, FLUSH=1, WSMEM=0: real CPI=~1.25
- 3-stage pipe w/ 1-phase clock: CPI=1, FLUSH=2, WSMEM=1: real CPI=˜1.66
- 2-stage pipe w/ 1-phase clock: CPI=2, FLUSH=1, WSMEM=1, real CPI=~2.00
Empiracally, the impact of the FLUSH in the 2-stage pipeline is around 20% and in the 3-stage pipeline is 30%. The real impact depends of the code itself, of course... In the case of the impact of the wait-states in the memory access regarding the load instruction, the impact ranges between 5 and 10%, again, depending of the code.
However, the clock in the case of the 3-stage pipeline is far better than the 2-stage pipeline, in special because the better distribuition of the logic between the decode and execute stages.
Currently, the most expensive path in the Spartan-6 is the address bus for the data side of the core (connected to RAM and peripherals). The problem regards to the fact that the following actions must be done in a single clock:
- generate the DADDR[31:0] = REG[SPTR][31:0]+EXTSIG(IMM[11:0])
- generate the BE[3:0] according to the operand size and DADDR[1:0]
In the case of read operation, the DATAI path includes also a small mux in order to separate RAM and peripheral buses, as well separate the diferent peripherals, which means that the path increases as long the number of peripherals and the complexity increases.
Of course, the best performance setup uses a 3-state pipeline and a single-clock phase (posedge) in the entire logic, in a way that the 2-stage pipeline and dual-clock phase will be kept only for reference.
The only disadvantage of the 3-state pipeline is one extra wait-state in the load operation and the longer pipeline flush of two clocks in the taken branches.
Just for reference, I registered some details regarding the performance measurements:
The current firmware example runs in the 3-stage pipeline version clocked at 100MHz runs at a verified performance of 62 MIPS. The theorical 100MIPS performance is not reached 5% due to the extra wait-state in the load instruction and 32% due to pipeline flushes after taken branches. The 2-stage pipeline version, in the other side, runs at a verified performance of 79MIPS with the same clock. The only loss regards to 20% due to pipeline flushes after a taken branch.
Of course, the impact of the pipeline flush depends also from the software and, as long the software is currently optimized for size. When compiled with the -O2 instead of -Os, the performance increase to 68MIPS in the 3-state pipeline and the loss changed to 6% for load and 25% for the pipeline flush. The -O3 option resulted in 67MIPS and the best result was the -O1 option, which produced 70MIPS in the 3-stage version and 85MIPS in the 2-stage version.
By this way, case the performance is a requirement, the src/Makefile must be changed in order to use the -O1 optimization instead of the -Os default.
And although the 2-stage version is 15% faster than the 3-stage version, the 3-stage version can reach better clocks and, by this way, will provide better performance.
Regarding the pipeline flush, it is required after a taken branch, as long the RISCV does not supports delayed branches. The solution for this problem is implement a branch cache (branch predictor), in a way that the core populates a cache with the last branches and can predict the future branches. In some inicial tests, the branch prediction with a 4 elements entry appers to reach a hit ratio of 60%.
Another possibility is use the flush time to other tasks, for example handle interrupts. As long the interrupt handling and, in a general way, threading requires flush the current pipelines in order to change context, by this way, match the interrupt/threading with the pipeline flush makes some sense!
With the option THREADING is possible test this feature.
The implementation is in very early stages of development and does not handle correctly the initial SP and PC. Anyway, it works and enables the main() code stop in a gets() while the interrupt handling changes the GPIO at a rate of more than 1 million interrupts per second without affecting the execution and with little impact in the performance! :)
The interrupt support can be expanded to a more complete threading support, but requires some tricks in the hardware and in the software, in order to populate the different threads with the correct SP and PC.
The interrupt handling use a concept around threading and, with some extra effort, it is probably possible support 4, 8 or event 16 threads. The drawback in this case is that the register bank increses in size, which explain why the rv32e is an interesting option for threading: with half the number of registers is possible store two more threads in the core.
Currently, the time to switch the context in the darkricv is two clocks in the 3-stage pipeline, which match with the pipeline flush itself. At 100MHz, the maximum empirical number of context switches per second is around 2.94 million.
NOTE: the interrupt controller is currently working only with the -Os flag in the gcc!
About the new MAC instruction, it is implemented in a very preliminary way with the OPCDE 7'b1111111 (this works, but it is a very bad decision!). I am checking about the possibility to use the p.mac instruction, but at this time the instruction is hand encoded in the mac() function available in the stdio.c (i.e. the darkriscv libc). The details about include new instructions and make it work with GCC can be found in the reference [5].
The preliminary tests pointed, as expected, that the performance decreases to 90MHz and although it was possible run at 100MHz with a non-zero timing score and reach a peak performance of 100MMAC/s, the small 32-bit accumulator saturates too fast and requries extra tricks in order to avoid overflows.
The mul operation uses two 16-bit integers and the result is added with a separate 32-bit register, which works as accumulator. As long the operation is always signed and the signal always use the MSB bit, this means that the 15x15 mul produces a 30 bit result which is added to a 31-bit value, which means that the overflow is reached after only two MAC operations.
In order to avoid overflows, it is possible shift the input operands. For example, in the case of G711 w/ u-law encoding, the effective resolution is 14 bits (13 bits for integer and 1 bit for signal), which means that a 13x13 bit mul will be used and a 26-bit result produced to be added in a 31-bit integer, enough to run 32xMAC operations before overflow (in this case, when the ACC reach a negative value):
# awk 'BEGIN { ACC=2**31-1; A=2**13-1; B=-A; for(i=0;ACC>=0;i++) print i,A,B,A*B,ACC+=A*B }'
0 8191 -8191 -67092481 2080391166
1 8191 -8191 -67092481 2013298685
2 8191 -8191 -67092481 1946206204
...
30 8191 -8191 -67092481 67616736
31 8191 -8191 -67092481 524255
32 8191 -8191 -67092481 -66568226
Is this theory correct? I am not sure, but looks good! :)
As complement, I included in the stdio.c the support for the GCC functions regarding the native *, / and % (mul, div and mod) operations with 32-bit signed and unsigned integers, which means true 32x32 bit operations producing 32-bit results. The code was derived from an old 68000-related project (as most of code in the stdio.c) and, although is not so faster, I guess it is working. As long the MAC instruction is better defined in the syntax and features, I think is possible optimize the mul/div/mod in order to try use it and increase the performance.
Here some additional performance results (synthesis only, 3-stage version) for other Xilinx devices available in the ISE for speed grade 2:
- Spartan-6: 100MHz (measured 70MIPS w/ gcc -O1)
- Artix-7: 178MHz
- Kintex-7: 225MHz
For speed grade 3:
- Spartan-6: 117MHz
- Artix-7: 202MHz
- Kintex-7: 266MHz
The Kintex-7 can reach, theorically 186MIPS w/ gcc -O1.
This performance is reached w/o the MAC and THREADING activated. Thanks to the RV32E option, the synthesis for the Spartan-3E is now possible with resulting in 95% of LUT occupation in the case of the low-cost 100E model and 70MHz clock (synthesis only and speed grade 5):
- Spartan-3E: 70MHz
For the 2-stage version and speed grade 2, we have less impact from the pipeline flush (20%), no impact in the load and some impact in the clock due to the use of a 2-phase clock:
- Spartan-6: 56MHz (measured 47MIPS w/ -O1)
About the compiler performance, from boot until the prompt, tested w/ the 3-stage pipeline core at 100MHz and no interrupts, rom and ram measured in 32-bit words:
- gcc w/ -O3: t=289us rom=876 ram=211
- gcc w/ -O2: t=291us rom=799 ram=211
- gcc w/ -O1: t=324us rom=660 ram=211
- gcc w/ -O0: t=569us rom=886 ram=211
- gcc w/ -Os: t=398us rom=555 ram=211
Due to reduced ROM space in the FPGA, the -Os is the default option.
In another hand, regarding the support for Vivado, it is possible convert the Artix-7 (Xilinx AC701 available in the ise/boards directory) project to Vivado and make some interesting tests. The only problem in the conversion is that the UCF file is not converted, which means that a new XDC file with the pin description must be created.
The Vivado is very slow compared to ISE and needs lots of time to synthesise and inform a minimal feedback about the performance... but after some weeks waiting, and lots of empirical calculations, I get some numbers for speed grade 2 devices:
- Artix7: 147MHz
- Spartan-7: 146MHz
And one number for speed grade 3 devices:
- Kintex-7: 221MHz
Although Vivado is far slow and shows pessimistic numbers for the same FPGAs when compared with ISE, I guess Vivado is more realistic and, at least, it supports the new Spartan-7, which shows very good numbers (almost the same as the Artix-7!).
That values are only for reference. The real values depends of some options in the core, such as the number of pipeline stages, who the memories are connected, etc. Basically, the best clock is reached by the 3-stage pipeline version (up to 100MHz in a Spartan-6), but it requires at lease 1 wait state in the load instruction and 2 extra clocks in the taken branches in order to flush the pipeline. The 2-state pipeline requires no extra wait states and only 1 extra clock in the taken branches, but runs with less performance (56MHz).
Well, my conclusion after some years of research is that the branch prediction solve lots of problems regarding the performance.
Development Tools
About the gcc compiler, I am working with the experimental gcc 9.0.0 for RISC-V. No patches or updates are required for the DarkRISCV other than the -march=rv32i. Although the fence*, e* and crg* instructions are not implemented, the gcc appears to not use of that instructions and they are not available in the core.
Although is possible use the compiler set available in the oficial RISC-V site, our colleagues from lowRISC project pointed a more clever way to build the toolchain:
Basically:
git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
mkdir combined
cd combined
ln -s ../newlib-cygwin/* .
ln -sf ../binutils-gdb/* .
ln -sf ../gcc/* .
mkdir build
cd build
../configure --target=riscv32-unknown-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib --with-arch=rv32ima --with-abi=ilp32 --prefix=/usr/local/share/gcc-riscv32-unknown-elf
make -j4
make
make install
export PATH=$PATH:/usr/local/share/gcc-riscv32-unknown-elf/bin/
riscv32-unknown-elf-gcc -v
and everything will magically work! (:
Case you have no succcess to build the compiler, have no interest to change the firmware or is just curious about the darkriscv running in a FPGA, the project includes the compiled ROM and RAM, in a way that is possible examine all derived objects, sources and correlated files generated by the compiler without need compile anything.
Finally, as long the DarkRISCV is not yet fully tested, sometimes is a very good idea compare the code execution with another stable reference!
In this case, I am working with the project picorv32:
https://github.com/cliffordwolf/picorv32
When I have some time, I will try create a more well organized support in order to easily test both the DarkRISCV and picorv32 in the same cache, memory and IO sub-systems, in order to make possible select the core according to the desired features, for example, use the DarkRISCV for more performance or picorv32 for more features.
About the software, the most complex issue is make the memory design match with the linker layout. Of course, it is a gcc issue and it is not even a problem, in fact, is the way that the software guys works when linking the code and data!
In the most simplified version, directly connected to blockRAMs, the DarkRISCV is a pure harvard architecture processor and will requires the separation between the instruction and data blocks!
When the cache controller is activated, the cache controller provides separate memories for instruction and data, but provides a interface for a more conventional von neumann memory architecture.
In both cases, a proper designed linker script (darksocv.ld) probably solves the problem!
The current memory map in the linker script is the follow:
- 0x00000000: 4KB ROM
- 0x00001000: 4KB RAM
Also, the linker maps the IO in the following positions:
- 0x80000000: UART status
- 0x80000004: UART xmit/recv buffer
- 0x80000008: LED buffer
The RAM memory contains the .data area, the .bss area (after the .data and initialized with zero), the .rodada and the stack area at the end of RAM.
Although the RISCV is defined as little-endian, appears to be easy change the configuration in the GCC. In this case, it is supposed that the all variables are stored in the big-endian format. Of course, the change requires a similar change in the core itself, which is not so complex, as long it affects only the load and store instructions. In the future, I will try test a big-endian version of GCC and darkriscv, in order to evaluate possible performance enhancements in the case of network oriented applications! :)
Finally, the last update regarding the software included new option to build a x86 version in order to help the development by testing exactly the same firmware in the x86.
In a preliminary way, it is possible build the gcc for RV32E with the folllowing configuration:
git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
mkdir combined
cd combined
ln -s ../newlib-cygwin/* .
ln -sf ../binutils-gdb/* .
ln -sf ../gcc/* .
mkdir build
cd build
../configure --target=riscv32-embedded-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib --with-arch-rv32e --with-abi=ilp32e --prefix=/usr/local/share/gcc-riscv32-embedded-elf
make -j4
make
make install
export PATH=$PATH:/usr/local/share/gcc-riscv32-embedded-elf/bin/
riscv32-embedded-elf-gcc -v
Currently, I found no easy way to make the GCC build big-endian code for RISCV. Instead, the easy way is make the endian switch directly in the IO device or in the memory region.
As long is not so easy build the GCC in some machines, I left in a public share the source and the pre-compiled binary set of GCC tools for RV32E:
https://drive.google.com/drive/folders/1GYkqDg5JBVeocUIG2ljguNUNX0TZ-ic6?usp=sharing
As far as i remember it was compiled in a Slackware Linux or something like, anyway, it worked fine in the Windows 10 w/ WSL and in other linux-like environments.
Development Boards
Currently, the following boards are supported:
- Avnet Microboard LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
- XilinX AC701 A200: equipped with a Xilinx Artix-7 A200 running at 90MHz
- QMTech SDRAM LX16: equipped with a Xilinx Spartan-6 LX16 running at 100MHz
- QMTech NORAM S15: equipped with a Xilinx Spartan-7 S15 running at 100MHz
- Lattice Brevia2 XP2: equipped with a Lattice XP2-6 running at 50MHz
- Piswords RS485 LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
The speeds are related to available clocks in the boards and different clocks may be generated by programming a clock generator. The Spartan-6 is found in most boards and the core runs fine at ~100MHz, regardless the frequency of the main oscillator (typically 50MHz).
All Xilinx based boards typically supports a 115200 bps UART for console, some LEDs for debug and on-chip 4KB ROM and 4KB RAM (as well the RESET button to restart the core and the DEBUG signals for an oscilloscope).
In the case of QMTECH boards, that does not include the JTAG neither the UART/USB port, and external USB/UART converter and a low-cost JTAG adapter can solve the problem easily!
The Lattice Brevia is clocked by the on-board 50MHz oscillator, with the UART operating at 115200bps and the LED and DEBUG ports wired to the on- board LEDs.
In the software side, a small shell is available with some basic commands:
- clear: clear display
- dump : dumps an area of the RAM
- led : change the LED register (which turns on/off the LEDs)
- timer : change the timer prescaler, which affects the interrupt rate
- gpio : change the GPIO register (which changes the DEBUG lines)
The proposal of the shell is provide some basic test features which can provide a go/non-go status about the current hardware status.
Useful memory areas:
- 4096: the start of RAM (data)
- 4608: the start of RAM (data)
- 5120: empty area
- 5632: empty area
- 6144: empty area
- 6656: empty area
- 7168: empty area
- 7680: the end of RAM (stack)
As long the DarkRISCV uses separate instruction and data buses, it is not possible dump the ROM area. However, this limitation is not present when the option HARVARD is activated, as long the core is constructed in a way that the ROM bus is conected to one bus from a dual-ported memory and the RAM bus is connected to a different bus from the same dual-ported memory. From the DarkRISCV point of view, they are fully separated and independent buses, but in reality they area in the same memory area, which makes possible the data bus change the area where the code is stored. With this feature, it will be possible in the future create loadable codes from the FLASH memory! :)
Acknowledgments
Special thanks to my old colleagues from the Verilog/VHDL/IT area:
- Paulo Matias (jedi master and verilog/bluespec/riscv guru)
- Paulo Bernard (co-worker and verilog guru)
- Evandro Hauenstein (co-worker and git guru)
- Lucas Mendes (technology guru)
- Marcelo Toledo (technology guru)
- Fabiano Silos (technology guru)
Also, special thanks to the "friends of darkriscv" that found the project in the internet and contributed in any way to make it better:
- Guilherme Barile (technology guru and first guy to post anything about the darkriscv! [2]).
- Alasdair Allan (technology guru, posted an article about the darkriscv [3])
- Gareth Halfacree (technology guru, posted an article about the DarkRISCV [4])
- Ivan Vasilev (ported DarkRISCV for Lattice Brevia XP2!)
- timdudu from github (fix in the LDATA and found a bug in the BCC instruction)
- hyf6661669 from github (lots of contributions, including the fixes regarding the AUIPC and S{B,W,L} instructions, ModelSIM simulation, the memory byte select used by store/load instructions and much more!)
- All other colleagues from github that contributed with fixes, corrections and suggestions.
Finally, thanks to all people who directly and indirectly contributed to this project, including the company I work for and all colleagues that tested the DarkRISCV.
References
[1] https://www.amazon.com/RISC-V-Reader-Open-Architecture-Atlas/dp/099924910X
[2] https://news.ycombinator.com/item?id=17852876
[3] https://blog.hackster.io/the-rise-of-the-dark-risc-v-ddb49764f392
[4] https://abopen.com/news/darkriscv-an-overnight-bsd-licensed-risc-v-implementation/
[5] http://quasilyte.dev/blog/post/riscv32-custom-instruction-and-its-simulation/
[6] https://github.com/riscv/riscv-pk/blob/master/bbl/riscv_logo.txt