- Table of contents
- Project Brief
- Personal Information
- Individual Contributions
- Repo Structure
- Single Cycle Design
- Pipeline Design
- Data Memory Cache Design
- Tests
Note: The
rtl
andtest
folders for all versions can be found in relevant branches, as explained in Repo Structure section
This project aims to build a reduced RISC-V RV32I Processor that is able to execute two programs, as specified in the project brief. Tests are taken to verify the processor design and F1 Algorithm.
Two additional stretch goals are specified for advanced implementation:
- Pipelined RV32I design.
- Data Memory Cache.
Through joint efforts, Group 17 has successfully implemented the F1 algorithm in assembly language, verified the program and RV32I design, implemented a pipelined RV32I, and included data memory cache.
Name | CID | Github | Link to Personal Statement | |
---|---|---|---|---|
Tingxu Chen | 2221097 | ccchloe510 | [email protected] | Personal Statement |
Yiyao Zhou | 2213516 | TaylorZYY | [email protected] | Personal Statement |
Haocheng Fan | 2262158 | franfafdaf | [email protected] | Personal Statement |
Guanxi Lu | 2233412 | lgxi24 | [email protected] | Personal Statement |
Version | Section | File Name | Tingxu Chen | Yiyao Zhou | Haocheng Fan | Guanxi Lu |
---|---|---|---|---|---|---|
SingleCycle_Ref | PC | pc_top.sv | X | P | ||
Instruction Memory | instr_mem.sv | X | ||||
Extend | extend.sv | P | X | |||
Register File | reg_file.sv | X | P | P | ||
Control Unit | ALU_decode.sv | p | X | |||
control_unit.sv | P | X | ||||
main_decode.sv | P | X | ||||
PCSrc_Decode | P | X | ||||
Data Memory | DataMemory.sv | X | P | P | ||
DataMux.sv | X | P | ||||
ALU | alu_top.sv | X | ||||
Testbench and Shell Script | doit.sh | X | ||||
top_tb.cpp | X | |||||
Debug and Testing | top.sv | P | X | |||
debug |
X | X | ||||
SingleCycle_F1 | Assembly Language | F1.s | U | U | U | X |
F1.mem | X | |||||
Testbench and Shell Script | top_tb.cpp | P | X | |||
Debug and Testing | debug |
X | X | |||
Pipelined_NOP_F1 | F1.s | U | U | |||
F1.mem | U | U | ||||
Pipeline_NOP_Ref | pdf.s | U | U | |||
pdf.hex | U | U | ||||
Pipelined_Ref | Pipelining | Stage1.sv | X | |||
Stage2.sv | X | |||||
Stage3.sv | X | |||||
Stage4.sv | U | X | ||||
Stage.sv | X | |||||
Hazard Unit | HazardUnit.sv | U | U | X | ||
Debug and Testing | debug |
X | ||||
Pipelined_F1 | Debug and Testing | debug |
X | |||
Cache_Unused_Direct&4way | Direct Mapped Cache | direct_mapped_cache.sv | U | U | ||
4-Way Associative Cache | cache.sv | U | U | |||
Cache_Ref_Direct_Mapped | direct_cache_memory.sv | X | ||||
debug |
X | |||||
Cache_Ref_2Way | 2-Way Block Size 1 Associative Cache | Cache.sv | X | |||
CacheMux.sv | X | |||||
debug |
X | |||||
Administrative | Repo Master | Repo Master |
P | X | ||
Joint Statement | Joint Statement |
P | X |
Note:
X
= Full Participation
P
= Partial Participation (modification or debug)
U
= Unused Version
Note: We would like to acknowledge that, due to inadvertent errors in version control, the commit history on the
top
branch may not accurately represent the contributions of the group members. Notably, some files from thetop
branch, which were subsequently merged into theSingleCycle_F1
,SingleCycle_Ref
,Pipelined_F1
, and other branches, were copied instead of being merged from the individual branches (FHC
,LGX
,ZYY
,CTX
). As a result, in the event of any conflicts, please refer to the table mentioned above for accurate attribution of work.
The group decided to control the version in the following way:
- Each member completes their own section in their own branch.
- One member merges the individual work together, creating
top.sv
for testing. - Members test and debug the
Single Cycle
version on thetop
branch. - Once the
Single Cycle
versions are completed, they are separately stored in a new branch. - Members continue developing the
Pipelined
andCache
versions on thetop
branch. - Once the
Pipelined
andCache
versions are completed, they are separately stored in a new branch. - The final version and
test
folders are completed.
The repo structure can be viewed in the picture below.
Note: Only the test
folder in SingleCycle_F1
and Pipelined F1
branches contains F1 Program.
This policy leads to both advantages and drawbacks:
-
During the individual work stage, all the group members work on separate branches, ensuring their work is not disrupted.
-
After all sections are merged, only one final version exists, and all members debug the same version.
-
Individual contributions may become unclear, with contributions possibly being attributed incorrectly to different team members due to the commit history, especially when all individual branches are merged. This issue may be mitigated with increased familiarity with Git commands.
-
Only one or two members can debug at a time, as simultaneous debugging by more members might cause version conflicts.
Lab4 requires group members to complete a reduced RV32I design, to execute a program that consists only two instructions, ADDI
and BNE
.
main:
addi t1, zero, 0xff # load t1 with 255
addi a0, zero, 0x0 # a0 is used for output
mloop:
addi a1, zero, 0x0 # a1 is the counter , init to 0
iloop:
addi a0, a1, 0x0 # load a0 with a1
addi a1, a1, 0x1 # increment a1
bne a1, t1, iloop # if a1 == 255, jump to iloop
bne t1, zero, mloop # else always branch to mloop
The design, as illustrated in the diagram below, adheres closely to the Project's framework. However, the ALU block is notably simplified, performing only addition (+
) and subtraction (-
) operations. This simplification results in a significantly reduced control logic.
Reduced RV32I CPU, Cited from Project Brief
In the project, the increase in the number of instructions to be implemented necessitates a more intricate control logic. This complexity arises from the introduction of additional control signals and the expansion in the bit number of some signals.
Moreover, a generalized logic must be established to accommodate various types of instructions, including those within the same category. For instance, the JALR
and ADDI
instructions are both of I-type, yet they exhibit significant functional differences. Consequently, it is imperative to maintain simplicity in the design (avoiding excessive use of multiplexers or additional blocks) while ensuring the accurate implementation of the desired instructions.
Additionally, the project incorporates a Data Memory
block to execute instructions like LW
and SW
. Following this modification, a memory map, as displayed below, is utilized as a reference for the design. This memory map delineates the allocation of memory for the Instruction Memory and Data Memory, and all designs are expected to adhere strictly to this layout.
Memory Map, Cited From Project Brief
The Project Brief provides a design example, as shown in the diagram below, which introduces several new control signals.
Sample RV32I CPU, Cited from Project Brief
However, it was discovered in practice that this structure is insufficient for implementing all the required signals. For example, the JALR
instruction necessitates the program counter (PC
) to become RS1 + ImmExt
, a functionality not supported by the current design. Furthermore, specific blocks like the Control Unit
and Data Memory
require refined designs, either to simplify the structure or to accommodate particular instructions, which will be elucidated in subsequent sections.
In light of these findings, we modified the design (as depicted in the diagram below) to provide an appropriate structure for executing all the instructions in our program. Detailed explanation of specific blocks will be provided in the sections below.
Single Cycle Design Overview
Among all the RV32I instructions, 18 are implemented, covering all 6 types of instructions. The List of Instructions is shown in the table below.
No | Type | Instruction | No | Type | Instruction |
---|---|---|---|---|---|
1 | R-type | ADD | 12 | S-type | SW |
2 | SUB | 13 | SB | ||
3 | AND | 14 | B-type | BEQ | |
4 | XOR | 15 | BNE | ||
5 | I-type | ADDI | 16 | U-type | LUI |
6 | SLLI | 17 | AUIPC | ||
7 | SRLI | 18 | J-Type | JAL | |
8 | ANDI | ||||
9 | LW | ||||
10 | LBU | ||||
11 | JALR |
The program counter (PC
) determines which instruction will be executed in each cycle. In this design, the Instruction Memory starts from BFC00000
and extends to BFC00FFF
, indicating that the initial value of PC (set to 0
in Lab4) should be BFC00000
.
Typically, the PC increments by 4 due to the nature of byte addressing. In our CPU, each memory location holds 1 byte of data/instruction (equivalent to 8 bits). Given that the instruction and data widths in the RV32I CPU are both 32 bits, occupying 4 memory locations, the PC must increase by 4 to execute the next instruction.
However, there are special cases in our design where the PC is relocated to a different value, as outlined in the table below.
Instruction | Operation on PC |
---|---|
BEQ | PC = PC + Imm12 |
BNE | PC = PC + Imm12 |
JAL | PC = PC + Imm20 |
JALR | PC = RS1 + Imm12 |
What's more, the rst
signal sets PC
to the start of the program. Once rst
is asserted, the value of PC
becomes BFC00000
Operations on the PC can be classified easily, allowing for straightforward implementation in our design.
- A MUX determines if a
Jump
orBranch
operation is required. - For
Branch
instructions, the operation is alwaysPC = PC + Imm12
. - For
Jump
instructions, the CPU needs to determine whether to addPC
orRS1
toImmExt
, with the value ofImmExt
varying based on the instruction type.
Program Counter
Instruction memory stores the executable instructions. As defined by the memory map, Instruction Memory spans 12 bits (from BFC00000
to BFC00FFF
). Consequently, the address length (A_length
) is set to 12 bits. Additionally, since the data follows byte-addressing, the data length (D_length
) is set to 8 bits.
In this design, the Instruction Memory comprises a memory array with a size of RD(Instr)
).
It is important to note that RISC-V is a byte-addressing processor employing a little-endian format, where the least significant byte is stored in the lower address. For instance, if the machine code is F1F2F3F4
, it is stored as follows:
Location: 00 01 02 03
Data: F4 F3 F2 F1
Therefore, the assembly of RD
from the memory array is executed using the following SystemVerilog assignment:
assign RD = {rom_array[A+3], rom_array[A+2], rom_array[A+1], rom_array[A]};
The Extend Unit is responsible for fetching parts of the instruction and extending them according to the Instruction Set Architecture (ISA). Among the six types of instructions in RV32I, the I-type, S-type, B-type, U-type, and J-type require extension.
Cited from RV32I ISA
There are two key points to note. Firstly, all immediates in RV32I are sign-extended. Secondly, the union of bits covered by the immediates is 31:7
; therefore, fetching these instruction bits would be sufficient.
Our group has designed a mapping for the ImmSrc
with corresponding extension operations, as demonstrated in the table below. Given that five types of instructions require extension, three bits of ImmSrc
are necessary.
ImmSrc[2:0] | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|
Type | I-type | S-type | B-type | J-type | U-type | Unoccupied | Unoccupied | Unoccupied |
Extension Operation | {{20{Instr[31]}}, Instr[31:20]} |
{{20{Instr[31]}}, Instr[31:25], Instr[11:7]} |
{{20{Instr[31]}}, Instr[7], Instr[30:25], Instr[11:8], 1'b0} |
{{12{Instr[31]}}, Instr[19:12], Instr[20], Instr[30:21], 1'b0} |
{Instr[31:12], {12{0}}} |
In the Extend
Unit, mapping the decoded ImmSrc
to the appropriate extension operations is straightforward.
The Control Unit is pivotal in determining the control signals based on instructions. It fetches three segments, Opcode
, funct3
, and func75
from Instr
, and a distinct signal Zero
from the ALU
. These four output signals collectively decide the control signals, determining the operations of other blocks in the processor.
The table below presents the control signals corresponding to the 18 instructions implemented in the RV32I CPU design. Each instruction type is distinguished by different colors. Among which, blue columns represent input signals, while grey columns (PCSrc
and ALUControl[2:0]
) indicate signals requiring two-layer decoding.
Control Signal Table
The Control Unit is formed of three decoders:
- The
Main Decoder
utilizesOpcode[5]
andfunct3
fromInstr
, decoding most signals and providing inputs for other decoders. - The
PCSrc Decoder
decodes thePCSrc
signal using theBranch
andJump
signals from theMain Decoder
, along with theZero
signal from theALU
. - The
ALU Decoder
decodes theALUControl[2:0]
signal. It receivesALUOp[1:0]
from theMain Decoder
, as well asOpcode[5]
,funct3
, andfunc75
fromInstr
.
The diagram below illustrates the hierarchical relationship among these three decoder units.
Single Cycle Processor Control Unit
"Don't Care" signals are uniquely processed since Verilator does not support them. In practical implementation, all "Don't Care" signals are set to 0. This does not impact the overall result, as these signals aren't expected to occur in any valid situations.
The ALUSrcA
, ALUSrcB
, ALUOp
, and ALUControl
signals are categorized as ALU signals.
ALUSrcA
and ALUSrcB
determine the values for SrcA
and SrcB
, the input signals to the ALU
. The potential inputs for SrcA
are PC
and RD1
, and for SrcB
, they are RD2
or ImmExt
. The decoding logic is as follows:
- R-type instructions use
RD1
andRD2
respectively forSrcA
andSrcB
. - I-type instructions use
RD1
andImmExt
. - S-type instructions use
RD1
andImmExt
. - B-type instructions, which compare two register values, use
RD1
andRD2
. - U-type instructions are more complex:
- For the
LUI
instruction, which loads the immediate,SrcA
is unused, andSrcB
selectsImmExt
. - The
AUIPC
instruction uses bothPC
andImmExt
.
- For the
- J-type instructions do not use
SrcA
orSrcB
, rendering bothALUSrcA
andALUSrcB
as "Don't Care"s.
Main Decoder
provides an initial logic to generate ALUOp
, varying for each instruction type. The association is shown in the following table, where I-type and S-type share the same ALUOp
due to similar ALU usage in Load
and Store
instructions.
Instruction Type | I&S | B | R | U | J |
---|---|---|---|---|---|
ALUOp[1:0] | 00 | 01 | 10 | 11 | XX |
The ALU Decoder
then deciphers the ALUControl
signal, with each value representing a specific operation. A special case is 110
, designated for "Select SrcB" and aligning with the LUI
instruction's sole selection of the sign-extended immediate.
ALUControl[2:0] | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|
ALU Operation | Add | Subtract | AND | OR | Shift Right | XOR | Select SrcB | Shift Left |
ResultSrc
, MemWrite
, RegWrite
, LdSrc
, and StSrc
are categorized as Register and Memory Signals.
The RV32I design offers three alternatives for the Result
signal written to the Register File:
- For instructions that use ALU,
ResultSrc
is00
, directly passing the ALU result to the register. - For
Load
instructions that fetch from memory,ResultSrc
is01
, writing data from memory to the register. - For
JAL
andJALR
instructions, whereRD = PC + 4
,ResultSrc
is10
, andPCPlus4
is written to the register.
MemWrite
and RegWrite
signals follow straightforward decode logic:
MemWrite
is set only forStore
instructions; otherwise, it is 0.RegWrite
is 0 for S-type and B-type instructions, which do not write to the register. When relevant,LdSrc
andStSrc
act asSELECT
bits of the MUX. The decoding logic for the twoLoad
andStore
instructions used is simple.
LdSrc
operates solely for Load
instructions, while StSrc
is specific to Store
instructions. In other scenarios, their values are considered "Don't Care"s:
LdSrc
equals 1 forLBU
instructions, otherwise indicatingLW
.StSrc
equals 1 forSB
instructions, otherwise indicatingSW
.
Branch
, Jump
, PCSrc
, and JalSrc
are categorized as Jump Signals.
Branch
and Jump
detect whether an instruction is a Jump or Branch type, setting their values to 1 accordingly.
The PCSrc
signal indicates if the PC
is to be redirected to a specific value. When Jump
is 1, so is PCSrc
. For B-type Instructions, the Zero
signal is also considered: PC
is redirected for BEQ
only if Zero
is True, and for BNE
, the opposite applies.
JalSrc
is specific to Branch
and Jump
instructions:
- Its value is 1 for
BEQ
,BNE
, andJAL
, indicatingPC
added toImmExt
. - For
JALR
,JalSrc
is 0, whereRS1
is added toImmExt
.
ImmSrc
is the sole Extend signal, dictating the extension methodology for the immediate part of the instruction. ImmSrc
varies across different instruction types, and its relationship with each type is detailed in the Extend Unit Section, and the table is also attached below. The table below outlines how ImmSrc
is assigned based on the instruction type:
ImmSrc[2:0] | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|
Type | I-type | S-type | B-type | J-type | U-type | Unoccupied | Unoccupied | Unoccupied |
Extension Operation | {{20{Instr[31]}}, Instr[31:20]} |
{{20{Instr[31]}}, Instr[31:25], Instr[11:7]} |
{{20{Instr[31]}}, Instr[7], Instr[30:25], Instr[11:8], 1'b0} |
{{12{Instr[31]}}, Instr[19:12], Instr[20], Instr[30:21], 1'b0} |
{Instr[31:12], {12{0}}} |
Data Memory facilitates the execution of Load
and Store
instructions. For Load
instructions, data from memory written to registers, whereas for Store
instructions, data from registers is written to memory.
The diagram below depicts the structure of Load
and Store
instructions. Load
instructions, which are I-type, read data from the memory location RS1 + Imm[11:0]
and write it to RD
, with funct3
differentiating various Load
types. Conversely, Store
instructions are S-type, where data from the register RS2
is written to the memory location RS1 + Imm[11:0]
, and here too, funct3
serves to distinguish among different Store
types.
Load & Store Structure, Cited from RV32I ISA
As indicated in the memory map, Data Memory spans from 00000000
to 0001FFFF
, necessitating an address length of 17 bits. Adhering to byte-addressing, the data length (DATA_WIDTH
) is also set to 8 bits. Consequently, Data Memory comprises a memory array of
For the standard LW
(load word) instruction, Data Memory transfers data from memory to register using a little-endian format. The reverse process is applied for the SW
(store word) instruction. However, the assembly language also requires the implementation of LBU
(load byte unsigned) and SB
(store byte). LBU
loads the least significant 8 bits, extended to 32 bits without sign extension, into the register. SB
alters only the specific memory block addressed by RS1 + Imm[11:0]
with the least significant 8 bits of RS2
, rather than the entire word.
To facilitate this, two additional control signals, LdSrc
and StSrc
, are introduced (as detailed in the Control Unit section). These figure out the exact Load
or Store
instruction being executed, ensuring the correct corresponding operation.
Further, the Single Cycle Data Memory is conceptualized as a primary memory with two auxiliary functional blocks. For Store
instructions, data passes through an additional Store
block before entering the Memory
block. Conversely, for Load
instructions, data traverses an extra Load
block before being written to registers.
The ALU is responsible for executing arithmetic and logic operations. In our design, it processes two inputs, SrcA
and SrcB
, to produce the output ALUResult
.
A MUX is responsible for selecting the appropriate values for SrcA and SrcB respectively. The possible inputs for SrcA
are PC
and RD1
, while for SrcB
, they are RD2
or ImmExt
. The control logic governing these selections is detailed in the preceding section.
Within the ALU, the operations are determined by the ALUControl[2:0]
signal, as depicted in the table below. The implementation of these operations follows a straightforward logic.
ALUControl[2:0] | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
---|---|---|---|---|---|---|---|---|
ALU Operation | Add | Subtract | AND | OR | Shift Right | XOR | Select SrcB | Shift Left |
The top-level design of the system serves as an integrative platform, combining all submodules and establishing the necessary bus connections, thereby exemplifying the concept of hierarchical design. The top-level design encompasses the following modules:
pc_top instr_mem regfile alu_top
control_unit extend DataMemory DataMux
Concurrently, assignment statements are utilized to map portions of the bus to specific logic signals. These mappings are highlighted in red in the accompanying diagram.
A C++ testbench is added to test the sesign. In this section we mainly discuss the testbench for Reference Program. The corresponding testbench incorporates three top-level signals: clk
, a0
, rst
.
In the testbench for the Reference Program, both clk
and rst
are initialized to a value of 1. Subsequently, at each clock cycle, rst
is set to 0, and the value of a0
is dumped to a VCD file. A boolean variable, plot
, is used to monitor the status of a0
. A non-zero value of a0
denotes the successful completion of the build process, thus plot
will enable the plotting on Vbuddy. Furthermore, the program is designed to terminate after a specific number of cycles, which varies depending on the version (either 960 or 1920 cycles).
The testbench for the F1 program differs from that of the Reference Program, details of which are elaborated in the following section.
The shell script employs Verilator to convert Verilog code into C++, which also includes the C++ testbench.
The roles of the shell script are outlined as follows:
- Clearing out data from previous simulations.
- Use Verilator to convert Verilog into C++.
- Build the C++ project.
- Executing the simulation file.
The reference program is designed to build and display the probability distribution function (PDF) for four signals, employing an algorithm structured as Initialisation - Build - Display.
- During the Initialisation stage, each
a1
value (representing a "bin") is reset, preparing the bins for data accumulation. - The Build stage involves iterating through the data array. Here, counts are added to respective bins based on the data value, e.g., for a data value of
EE
, the count in theEE
bin is incremented. This stage terminates when the count in any bin reachesmax_count
or upon completion of the iteration. - The Display stage sees the program iterating through the PDF array, loading count numbers into the
a0
register, and generating the corresponding output.
# PDF program
.text
.equ base_pdf, 0x100
.equ base_data, 0x10000
.equ max_count, 200
main:
JAL ra, init # jump to init, ra and save position to ra
JAL ra, build
forever:
JAL ra, display
J forever
init: # function to initialise PDF buffer memory
LI a1, 0x100 # loop_count a1 = 256
_loop1: # repeat
ADDI a1, a1, -1 # decrement a1
SB zero, base_pdf(a1) # mem[base_pdf+a1) = 0
BNE a1, zero, _loop1 # until a1 = 0
RET
build: # function to build prob dist func (pdf)
LI a1, base_data # a1 = base address of data array
LI a2, 0 # a2 = offset into of data array
LI a3, base_pdf # a3 = base address of pdf array
LI a4, max_count # a4 = maximum count to terminate
_loop2: # repeat
ADD a5, a1, a2 # a5 = data base address + offset
LBU t0, 0(a5) # t0 = data value
ADD a6, t0, a3 # a6 = index into pdf array
LBU t1, 0(a6) # t1 = current bin count
ADDI t1, t1, 1 # increment bin count
SB t1, 0(a6) # update bin count
ADDI a2, a2, 1 # point to next data in array
BNE t1, a4, _loop2 # until bin count reaches max
RET
display: # function send PDF array value to a0 for display
LI a1, 0 # a1 = offset into pdf array
LI a2, 255 # a2 = max index of pdf array
_loop3: # repeat
LBU a0, base_pdf(a1) # a0 = mem[base_pdf+a1)
addi a1, a1, 1 # incr
BNE a1, a2, _loop3 # until end of pdf array
RET
In the F1 program, a similar algorithm is adopted, structured as Initialisation - Generation Light Up - Countdown. The detailed process is as follows:
- In the
Init
section, thresholds are established, including the fixed countdown value. - The
rst
section is responsible for resetting the variables. - During the
mainloop
loop, a 7-bit LFSR (Linear Feedback Shift Register), initialized at 0x7F, is generated. - In the
light_up
loop, the output registera0
incrementally increases from0000_0000
to0000_0001
, and ultimately to1111_1111
. Following each increment, the program enters thelightdelay
loop to perform a fixed-number countdown. - The
lightdelay
andfinal_random
loops conduct countdowns based on a fixed number and a pre-generated random number, respectively. - Upon completion of the countdown in
final_random
, the program returns to therst
section to reinitialize all values.
The algorithm's flow is visually represented in the diagram below, with straight lines illustrating unconditional jumps, and red curves indicating conditional branches.
Having tested the Single Cycle design on Reference program, the design is modified to accomodate the F1 program as explained above. There are a few points that differs F1 design from Ref Design.
The Reference Program uses load
and store
instructions such as LW
and SB
. However, the F1 design omits these instructions, resulting in no data being written to or read from the Data Memory.
This design choice is linked to the method of pseudo-random number generation in the F1 program. Before the trigger
signal is asserted, the program iterates through a loop generating random numbers. Once trigger
is asserted, a random number is generated.
One alternative could involve pre-generating all random numbers and storing them in the data memory. Subsequently, the program could map the PC
value to a specific location in the data memory to read the random number. However, PC
would be a constant value at the point where a random number is requested to be loaded. In this case, this approach would transform the random number into a constant, as a fixed one in random number series would exist for a fixed PC
.
Another strategy might be to generate a new random number each time PC
changes, making the random number a by-product of the varying PC
. However, this could significantly increase the program's complexity, an unnecessary complication.
For the data memory of the reference program, the diagram below illustrates the memory organization. The data_array
consists of pre-generated values from memory files like sine.mem
, while the pdf_array
represents the accumulation in each "bin." Separate memory segments are allocated for data_array
and pdf_array
.
Cited from Reference Program
A significant difference between the F1 Program and Reference Program is the trigger
signal. As described in the Project Brief,
The trigger signal is used to initiate the F1 light sequence in the RISC-V.
RV32I F1 Overview, Cited from Project Brief
In the F1.s
assembly program, the t0
register (x5
) is used to represent the Trigger
signal. At the beginning of the mainloop
, its value is compared with the pre-defined s1
register to determine if the trigger
signal has been activated, as depicted in the program extract below.
# F1.s
init:
addi s1, zero, 0x1 /* trigger destination set as 1 */
...
rst:
...
addi t0, zero, 0x0 /* t0 is trigger */
mainloop:
beq t0, s1, light_up /* trigger? */
...
In physical testing, the push-button switch on the Vbuddy functions as the trigger
. This is configured in the testbench with the line
top->trigger = vbdFlag();
Additionally, the t0
in reg_file.sv
is isolated for the representation of trigger signal. Therefore, an extra line is incorporated into reg_file.sv
to facilitate this:
assign register[5] = trigger;
The modified reg_file.sv
is illustrated as the diagram below.
As mentioned earlier, the F1 Program employs a distinct testbench from that of the Reference Program. In this testbench, the 32-bit output a0
is transformed into an 8-bit data_out
by masking the top 24 bits. Subsequently, the vbdBar()
function is invoked to display data_out
on the neopixel, which is implemented using the following code snippet
uint32_t value_32bit = top->a0; // Display F1 Light, toggle neopixel
uint8_t data_out = static_cast<uint8_t>(value_32bit & 0xFF); // Masking to get the lowest 8 bits
vbdBar(data_out & 0xFF);
Pipelining is a highly useful and widely applied technique in processor design. Through pipelining, performance can be greatly enhanced. Our pipelined processor design divides the single-cycle processor into five stages:
- F: Fetch Instruction
- D: Decode
- E: Execute ALU
- M: Memory Read and Write
- W: Write Register
It's worth mentioning that instructions will not necessarily use all of these five stages. Pipelining executes different stages for different instructions in the same cycle. For example, after the first instruction completes the F stage, the second instruction executes its F stage at the same time as the first instruction executes its D stage.
Although this technique seems simple, it also raises problems: data hazards and control hazards can exist. A straightforward way to deal with hazards is to add NOP
instructions between the "useful" ones, which can reduce performance. A more advanced method is to introduce a Hazard Unit to implement Forwarding, Stall, and Flush.
The following diagram (modified based on H&H Digital Design and Computer Architecture, RISC-V Edition: RISC-V Edition) illustrates the Pipelined Processor design with a Hazard Unit, as implemented in our project.
The pipelined version is applied to both the F1 Program and the Reference Program.
Forwarding is necessary when the result of an instruction is used by subsequent instructions before it has been written to the register. Consider the following program snippet as an example, where the value of s8
needs to be forwarded:
add s8, s4, s5
sub s2, s8, s3
or s9, s6, s8
and s7, s8, t2
Forwarding can be effectively implemented by integrating MUXes before the ALU to select its operands. In practice, the implementation of this logic can be structured as follows:
//forwarding
if (((Rs1E == RdM) & RegWriteM) & (Rs1E != 0)) ForwardAE = 2'b10; // Forward from Memory stage
else if (((Rs1E == RdW) & RegWriteW) & (Rs1E != 0)) ForwardAE = 2'b01; // Forward from Writeback stage
else ForwardAE = 2'b00; // No forwarding (use RF output)
A Stall is necessary when a subsequent instruction requires data from the destination register of a previous instruction. For instance, in the program below, s7
needs to be stalled. This means that the second line must wait for the first line to finish in order to receive the correct data:
lw s7, 40(s5)
and s8, s7, t3
or t2, s6, s7
sub s3, s7, s2
To support stalls, enable inputs (EN) are added to the Fetch and Decode pipeline registers, along with a synchronous reset/clear (CLR) input for the Execute pipeline register. If data needs to be cleared, FlushE
is used to flush the data. In practice, the implementation of this logic can be structured as follows:
//stall
assign lwStall = ResultSrcE0 & ((Rs1D == RdE)|(Rs2D == RdE));
assign StallF = lwStall;
assign StallD = lwStall;
Flush operations are also useful for addressing control hazards, which occur when a pipelined processor is uncertain about the next instruction to fetch due to an unresolved branch decision. Initially, a prediction is made regarding whether the branch will be taken. If this prediction turns out to be incorrect, the results from the incorrectly predicted path must be discarded, a process known as incurring a branch misprediction penalty.
In practice, the implementation of this logic can be structured as follows:
//flush
assign FlushD = PCSrcE;
assign FlushE = lwStall | PCSrcE;
In CPU architecture, caches play a crucial role in enabling quick read/write operations, thereby enhancing performance. This is achieved by exploiting two key principles: temporal locality and spatial locality. Temporal locality is based on the assumption that if data is used once, it's likely to be accessed again soon. Conversely, spatial locality posits that when data is accessed, adjacent data is likely to be needed in the near future.
Our project has developed two distinct cache designs: Direct Mapped Cache and 2-Way Associative Cache. Notably, both designs utilize a single word as the block size, focusing primarily on leveraging temporal locality.
Further, as discussed earlier, our F1 program didn't use Data Memory. In this sense, only the Reference Program version using cache is developed.
-
Capacity: 8 words
-
Block Size: 1 word
-
Number of Sets: 8
-
Number of Ways: 1
-
Tag_WIDTH
: 29 bits -
Set_WIDTH
: 3 bits
-
Read Policy
- Hit: Data is directly read from the cache.
- Miss: Data is fetched from memory and then loaded into the cache.
-
Write Policy
Since a write-through policy is adopted, data is written simultaneously to both the cache and the memory. This approach simplifies implementation but may lead to reduced performance under high write volumes.
-
Overview
The cache is integrated with the memory system. Since "memory passes data to cache", implementing cache and memory as separate blocks would complicate the configuration where data can be transferred bidirectionally. The design is illustrated in the diagram below.
-
Hit Logic
A hit occurs when the
inputTag
matches the cache'stag
and thevalid
bit is set toTRUE
. The logic can be expressed as:
assign hit = ((tag[inputSet] == inputTag) && (valid[inputSet]));
-
Write Logic
Write operations realize to
SW
andSB
instructions. In our write-through approach, data is concurrently written to both the cache and memory. This method ensures data consistency between cache and memory but might impact performance under heavy write operations. The design updates thetag
andHit
signals simultaneously during write operations, and for Hit cases, the singals remain their previous values. -
Read Logic
Read operations realize
LW
andLBU
instructions. They are implemented using a state machine:- In the event of a Hit, the next state is
from_cache
. - In the event of a Miss, the next state is
from_memory
.
For Miss cases, in addition to updating the
tag
andHit
, the data is passed from memory to cache.Special attention is given to
LBU
andSW
instructions, which involve partial word read/write operations through byte addressing.Further, Unlike standard word-aligned addressing, RV32I employs byte addressing. Hence, our design adjusts the bit layout for
Set
andTag
to accommodate this difference. In the given example, the last two bits of the address are used for byte offset, allowing only00
. However, in our design, it allows for00
,01
,10
, or11
as possible values.Consequently, we allocate the last three bits to
Set
and the preceding 29 bits toTag
. - In the event of a Hit, the next state is
- Capacity: 16 words
- Block Size: 1 word
- Number of Sets: 8
- Number of Ways: 2
Tag_WIDTH
: 29 bitsSet_WIDTH
: 3 bits
-
Read Policy
- Hit: Data is directly read from the cache.
- Miss: Data is fetched from memory and then loaded into the cache.
-
Write Policy
Adopting a write-through policy, data gets written concurrently to both the cache and the memory. While this method eases implementation, it can potentially diminish performance, especially with a high volume of write operations.
-
Overview
The 2-way cache is designed with a connection to the data memory and a cache multiplexer. This connection serves to fetch data in the event of a miss during a read operation. Within the cache, each block is equipped with
tag
andvalid
components to determine whether the block's content is a hit. When a hit occurs (withhit
set to 1), the multiplexer selects the cache's output as the memory stage output. Conversely, when there's a cache miss (hit logic is 0), the output selection opts for the data memory's output, which is then loaded into the cache block. This configuration ensures efficient data retrieval and storage within the memory hierarchy of the system. -
Hit Logic
A hit occurs when the
inputTag
matches the cache'stag
and thevalid
bit is set toTRUE
. The logic can be expressed as:
if ((valid[inputSet][way]) && (tag[inputSet][way] == inputTag)) begin // find the block
-
Least Recently Used
The
lru
(Least Recently Used) logic determines which cache way has been more recently accessed. If way 1 is identified as more recently used (indicated bylru
=1), this suggests a higher probability of it being accessed again soon, due to temporal locality. Consequently, subsequent write or read operations will target the content in the alternative way, maintaining the sametag
andset
, but differing in the way used.
if (LdSrcM && hit) begin
lru[inputSet] <= !lru[inputSet]; // rest the least used
end
-
Write Logic
Write operations are executed for and
SB
instructions. In our write-through scheme, data is simultaneously written to both the cache and memory. This approach maintains data consistency between cache and memory, though it may affect performance during intensive write operations. The design involves updating thetag
andHit
signals concurrently during these write operations. In instances of a cache hit, these signals retain their previous values. -
Read Logic
When a hit occurs, the output is sourced from the cache memory. Conversely, in the event of a miss, data is loaded to the sepcific block in cache from data memory, and the output for the current cycle is derived from this data memory.
- Configure Vbuddy
- For Windows user, type
ls /dev/ttyUSB*
- For Mac user, type
ls /dev/tty.u*
- Copy the name of the USB device
- Go to
rtl/vbuddy.cfg
, paste the name, don't forget the line breaks
- Go to terminal, type
source doit.sh
- A
.vcd
file will be generated while Vbuddy demonstrating the test results
This program aims to generate the probability distribution function (PDF) for four distinct signals. The theoretical PDFs for these signals are depicted as follows:
sine.mem
: a sine wave signaltriangle.mem
: a triangular wave signalgaussian.mem
: a noise signal with a Gaussian distributionnoisy.mem
: a noisy sine wave signal
Sine |
Triangle |
Gaussian |
Noisy |
To change the displayed signal, please go to Data Memory and modify the file name on the line
initial $readmemh("sine.mem", data_array, 17'h10000);
Reference programme test results:
Sine |
Triangle |
Gaussian |
Noisy |
This program is designed to simulate the F1 light mechanism, as detailed in the previous section:
- Initially, all 8 lights illuminate simultaneously at a unit step (assumed as 0x8 in our program).
- Once all 8 lights are lit, the program initiates a random number countdown (we using a 7-bit random number, which may take longer than expected).
- Upon completion of the countdown, all lights turn off synchronously.
Attached below are the test results for SingleCycle_F1
. For the results (video & .vcd
file) of other branches, refer to the test
folders.
Video
SingleCycle_F1.mp4
Waveform Explained