Skip to content

Latest commit

 

History

History
337 lines (240 loc) · 13.4 KB

README.md

File metadata and controls

337 lines (240 loc) · 13.4 KB

CVA6 2023-2024 Hackathon

SUPAERO TEAM

Best improvement : 33.4% on clock cycles.

This repo constains work done for the cva6 2023-2024 Hackathon.

This CVA6 version is designed to accelerate MNIST execution through hardware.

./sw folder contains app example, including a simple MNIST forward propagation for 1 digit.

How to use it

Note ont building & flashing the bitstream

To build the core and flash it onto a FPGA, you will find the original README by scrolling down this document, it explains everything you need to know to build the bitstream, and flash it onto an FPGA.

Take advantage of the design

BEFORE BUILDING THE BISTREAM : be mindful of your layer and the max kernel sizes, this is useful as kernels gets loaded into a tighly coupled co-processor using SRAM. This SRAM should be as small a spossible so:

  • The values handled by your CNN should not exceed 8 bits (0-255).
  • You HAVE to specify the max kernel size in ./core/cvxif_example/cvxif_example_coprocessor.sv uder the "Nb_of_regs" parameter.

Then, to take full advantage of this design based on a home-made TPU (Tensor Processing Unit), you need to tell the compiler a few specific things using inline assembly :

  • First : load data in CVXIF using LBC and LBCU instructions
  • Then : Lunch MAC instruction (this will read tensor operation result form CV-X-IF and clear all of its registers)
  • Finally : to further push performances, you can add load checks to avoid re-loading an already loaded kernel in memory

In this example, we modified the MNIST program (./sw/app/mnist/NetworkPropagate.c).

More infos in ./REPORT (See pdf article).

Building the software binaries

RISC-V binaries are built using GCC and bintuils. Make sure to use the right toolchain by build a docker container using the techniques described below.

The custom binutils sources are included in this repo if you wish do do it yourself. (./util/gcc-toolchain-builder/src/binutils-gdb/)

Once the binaries built using the right technique & toolchain, you can look at you disassambly to check wether the ... instructions were successfully included.

ORIGINAL README : Getting started

To get more familiar with CVA6 architecture, a partial documentation is available:

https://cva6.readthedocs.io/en/latest/

Checkout the repository and initialize all submodules:

$ git clone https://github.com/ThalesGroup/cva6-softcore-contest.git
$ git submodule update --init --recursive

Do not forget to check all the details of the contest in Annonce RISC-V contest 2023-2024 v2.pdf.

This repository contains the files needed for the 2021-2022 contest focusing on energy efficiency. The 2020-2021 contest focusing on the performance can be retrieved in this repository under the cv32a6_contest_2020 GitHub tag.

Prerequisites

Questa tool

Questa Prime version 10.7 must be used to measure power during the simulations. Other simulation tools and versions will receive no support from the organization team.

Vitis/Vivado setting up

For the contest, the CVA6 processor will be implemented on Zybo Z7-20 board from Digilent. This board integrates a Zynq 7000 FPGA from Xilinx. To do so, Vitis 2020.1 environment from Xilinx needs to be installed.

Furthermore, Digilent provides board files for each development board.

These files ease the creation of new projects with automated configuration of several complicated components such as Zynq Processing System and memory interfaces.

All guidelines to install vitis 2020.1 and Zybo Z7-20 board files are explained in https://reference.digilentinc.com/reference/programmable-logic/guides/installation.

Be careful about your linux distribution and the supported version of Vitis 2020.1 environment.

Hardware

If you have not yet done so, start provisioning the following:

Reference URL List price Remark
Zybo Z7-20 https://store.digilentinc.com/zybo-z7-zynq-7000-arm-fpga-soc-development-board/ $299.00 Zybo Z7-10 is too small for CVA6.
Pmod USBUART https://store.digilentinc.com/pmod-usbuart-usb-to-uart-interface/ $9.99 Used for the console output
JTAG-HS2 Programming Cable https://store.digilentinc.com/jtag-hs2-programming-cable/ $59.00
Connectors https://store.digilentinc.com/pmod-cable-kit-2x6-pin-and-2x6-pin-to-dual-6-pin-pmod-splitter-cable/ $5.99 At least a 6-pin connector Pmod is necessary; other references may offer it.

FPGA platform

A FPGA platform running CV32A6 (CVA6 in 32b flavor) has been implemented on Zybo Z7-20

This platform includes a CV32A6 processor, a JTAG interface to run and debug software applications and a UART interface to display strings on hyperterminal.

The steps to run the RIPE application on CV32A6 FPGA platform are described below.

The JTAG-HS2 programming cable is initially a cable that allows programming of Xilinx FPGAs (bitstream loading) from a host PC.

In our case, we use this cable to program software applications on the CV32A6 instantiated in the FPGA through a PMOD connector.

Get the Zybo ready

  1. First, make sure the Digilent JTAG-HS2 debug adapter is properly connected to the PMOD JE connector and that the USBAUART adapter is properly connected to the PMOD JB connector of the Zybo Z7-20 board. alt text

  2. Generate the bitstream of the FPGA platform:

$ make cva6_fpga
  1. When the bitstream is generated, switch on Zybo board and run:
$ make program_cva6_fpga

When the bitstream is loaded, the green LED done lights up. alt text

  1. Get a hyperterminal configured on /dev/ttyUSB0 115200-8-N-1

Now, the hardware is ready and the hyperterminal is connected to the UART output of the FPGA. We can now start the software.

Get started with software environment

Building the docker image

Install Docker on the workstation.

A sw-docker docker container is used to ease the installation of RISC-V tools including the toolchain and OpenOCD.

  1. The sw-docker image can be built using the following command:
docker build -f Dockerfile --build-arg UID=$(id -u) --build-arg GID=$(id -g) -t sw-docker:v1 .

Using the docker image

the sw-docker Docker container consists of the entire RISC-V compilation chain as well as the openocd tool.

  1. To compile software applications in sw/app, you need to use Docker container with the following command:
docker run -ti --privileged -v `realpath sw`:/workdir sw-docker:v1

The sw directory is mounted in the docker container. alt text

Once in the sw-docker Docker container, you are in the default directory /workdir which corresponds to the sw directory in the host OS.

user@[CONTAINER ID]:/workdir$ ll
total 24
drwxrwxr-x  5 user user 4096 Nov 23 10:57 ./
drwxr-xr-x  1 root root 4096 Nov 24 09:09 ../
-rw-rw-r--  1 user user 2620 Nov 23 10:57 README.md
drwxrwxr-x 18 user user 4096 Nov 23 10:59 app/
drwxrwxr-x  5 user user 4096 Nov 23 10:57 bsp/
drwxrwxr-x  2 user user 4096 Nov 23 10:57 utils/
  1. To compile mnist application, run the following commands.
user@[CONTAINER ID]:/workdir$ cd app
user@[CONTAINER ID]:/workdir/app$ make mnist

At the end of the compilation the mnist.riscv executable file must be created.

  1. Then, in the Docker container, launch OpenOCD in background:
user@[CONTAINER ID]:/workdir/app$ openocd -f openocd_digilent_hs2.cfg &
[1] 90
user@[CONTAINER ID]:/workdir/app$ Open On-Chip Debugger 0.11.0-dirty (2023-11-23-09:23)
Licensed under GNU GPL v2
For bug reports, read
    http://openocd.org/doc/doxygen/bugs.html
DEPRECATED! use 'adapter driver' not 'interface'
DEPRECATED! use 'adapter speed' not 'adapter_khz'
Info : auto-selecting first available session transport "jtag". To override use 'transport select <transport>'.
Info : clock speed 1000 kHz
Info : JTAG tap: riscv.cpu tap/device found: 0x249511c3 (mfg: 0x0e1 (Wintec Industries), part: 0x4951, ver: 0x2)
Info : datacount=2 progbufsize=8
Info : Examined RISC-V core; found 1 harts
Info :  hart 0: XLEN=32, misa=0x40141101
Info : starting gdb server for riscv.cpu on 3333
Info : Listening on port 3333 for gdb connections
Ready for Remote Connections
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
  1. In the Docker container (same terminal), launch gdb as following:
user@[CONTAINER ID]:/workdir/app$ riscv-none-elf-gdb mnist.riscv
GNU gdb (GDB) 14.0.50.20230114-git
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=riscv-none-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mnist.riscv...
(gdb)
  1. In gdb, you need to connect gdb to openocd as following:
(gdb) target remote :3333
Remote debugging using :3333
Info : accepting 'gdb' connection on tcp/3333
Warn : Prefer GDB command "target extended-remote 3333" instead of "target remote 3333"
0x00010ec4 in ?? ()
(gdb)
  1. In gdb, load mnist.riscv to CV32A6 FPGA platform by the load command:
(gdb) load
Loading section .vectors, size 0x80 lma 0x80000000
Loading section .init, size 0x60 lma 0x80000080
Loading section .text, size 0xe518 lma 0x800000e0
Loading section .rodata, size 0x11c2c lma 0x8000e5f8
Loading section .eh_frame, size 0x3c lma 0x80020224
Loading section .data, size 0x91c lma 0x80020260
Loading section .sdata, size 0x60 lma 0x80020b80
Start address 0x80000080, load size 134108
Transfer rate: 57 KB/sec, 9579 bytes/write.
(gdb) 
  1. At last, in gdb, you can run the mnist application by command c:
(gdb) c
Continuing.
(gdb) 
  1. On the hyperterminal configured on /dev/ttyUSB0 11520-8-N-1, you should see:
Expected  = 4
Predicted = 4
Result : 1/1
credence: 82
image env0003: 1731593 instructions
image env0003: 2353693 cycles

This result is obtained just after the FPGA bitstream loading. When MNIST is rerun system is not at initial state. For instance, cache is preloaded.

Simulation get started

When the development environment is set up, it is now possible to run a simulation. Some software applications are available into the sw/app directory. Especially, there are benchmark applications such as Dhrystone and CoreMark and other test applications.

To simulate a software application on CVA6 processor, run the following command:

$ make sim APP=’application to run’

For instance, if you want to run the mnist application, you will have to run :

$ make sim APP=mnist

This command:

  • Compiles CVA6 architecture and testbench with Questa Sim tool.
  • Compiles the software application to be run on CVA6 with RISCV tool chain.
  • Runs the simulation.

Questa tool will open with waveform window. Some signals will be displayed; you are free to add as many signals as you want.

Moreover, all printf used in software application will be displayed into the transcript window of Questa Sim and save into uart file to the root directory.

Simulation may take lot of time, so you need to be patient to have results.

Simulation is programmed to run 10000000 cycles but the result is displayed before the end of simulation.

For mnist application, at the end of the simulation, result is diplayed as following:

Expected  = 4
Predicted = 4
Result : 1/1
credence: 82
image env0003: 1731593 instructions
image env0003: 2316653 cycles

CVA6 software environment is detailed into sw/app directory.

Synthesis and place and route get started

You can perform synthesis and place and route of the CVA6 architecture.

In the first time, synthesis and place and route are carried in "out of context" mode, that means that the CVA6 architecture is synthetized in the FPGA fabric without consideration of the external IOs constraints.

That allows to have an estimation of the logical resources used by the CVA6 in the FPGA fabric as well as the maximal frequency of CVA6 architecture. They are both major metrics for a computation architecture.

Command to run synthesis and place & route in "out of context" mode:

$ make cva6_ooc CLK_PERIOD_NS=<period of the architecture in ns>

For example, if you want to clock the architecture to 50 MHz, you have to run:

$ make cva6_ooc CLK_PERIOD_NS=20

By default, synthesis is performed in batch mode, however it is possible to run this command using Vivado GUI:

$ make cva6_ooc CLK_PERIOD_NS=20 BATCH_MODE=0

This command generates synthesis and place and route reports in fpga/reports_cva6_ooc_synth and fpga/reports_cva6_ooc_impl.