In this part of our compiler writing journey, I've ported the compiler over to the ARM CPU on the Raspberry Pi 4.
I should preface this section by saying that, while I know MIPS assembly language quite well, I only knew a bit of x86-32 assembly language when I started this journey, and nothing about x86-64 nor ARM assembly language.
What I've been doing along the way is compiling example C programs down to an assembler with various C compilers to see what sort of assembly language they produce. That's what I've done here to write the ARM output for this compiler.
Firstly, ARM is a RISC CPU and x86-64 is a CISC CPU. There are fewer addressing modes on the ARM when compared to the x86-64. There are also other interesting constraints that occur when generating ARM assembly code. So I will start with the major differences, and leave the main similarities to later.
ARM has heaps more registers than x86-64. That said, I'm sticking with four
registers to allocate: r4
,r5
, r6
and r7
. We will see that r0
and
r3
get used for other things below.
On x86-64, we only have to declare a global variable with a line like:
.comm i,4,4 # int variable
.comm j,1,1 # char variable
and, later, we can load and store to these variables easily:
movb %r8b, j(%rip) # Store to j
movl %r8d, i(%rip) # Store to i
movzbl i(%rip), %r8 # Load from i
movzbq j(%rip), %r8 # Load from j
With ARM, we have to manually allocate space for all global variables in our program postamble:
.comm i,4,4
.comm j,1,1
...
.L2:
.word i
.word j
To access these, we need to load a register with the address of each variable, and load a second register from that address:
ldr r3, .L2+0
ldr r4, [r3] # Load i
ldr r3, .L2+4
ldr r4, [r3] # Load j
Stores to variables are similar:
mov r4, #20
ldr r3, .L2+4
strb r4, [r3] # i= 20
mov r4, #10
ldr r3, .L2+0
str r4, [r3] # j= 10
There is now this code in cgpostamble()
to generate the table of .words:
// Print out the global variables
fprintf(Outfile, ".L2:\n");
for (int i = 0; i < Globs; i++) {
if (Gsym[i].stype == S_VARIABLE)
fprintf(Outfile, "\t.word %s\n", Gsym[i].name);
}
This also means that we need to determine the offset from .L2
for each
global variable. Following the KISS principle, I manually calculate the
offset each time I want to load r3
with the address of a variable.
Yes, I should calculate each offset once and store it somewhere; later!
// Determine the offset of a variable from the .L2
// label. Yes, this is inefficient code.
static void set_var_offset(int id) {
int offset = 0;
// Walk the symbol table up to id.
// Find S_VARIABLEs and add on 4 until
// we get to our variable
for (int i = 0; i < id; i++) {
if (Gsym[i].stype == S_VARIABLE)
offset += 4;
}
// Load r3 with this offset
fprintf(Outfile, "\tldr\tr3, .L2+%d\n", offset);
}
The size of an integer literal in a load instruction is limited to 11 bits
and I think this is a signed value. Thus, we can't put large integer literals
into a single instruction. That answer is to store the literal values in
memory, like variables. So I keep a list of previously-used literal
values. In the postamble, I output them following the .L3
label. And, like
variables, I walk this list to determine the offset of any literal from
the .L3
label:
// We have to store large integer literal values in memory.
// Keep a list of them which will be output in the postamble
#define MAXINTS 1024
int Intlist[MAXINTS];
static int Intslot = 0;
// Determine the offset of a large integer
// literal from the .L3 label. If the integer
// isn't in the list, add it.
static void set_int_offset(int val) {
int offset = -1;
// See if it is already there
for (int i = 0; i < Intslot; i++) {
if (Intlist[i] == val) {
offset = 4 * i;
break;
}
}
// Not in the list, so add it
if (offset == -1) {
offset = 4 * Intslot;
if (Intslot == MAXINTS)
fatal("Out of int slots in set_int_offset()");
Intlist[Intslot++] = val;
}
// Load r3 with this offset
fprintf(Outfile, "\tldr\tr3, .L3+%d\n", offset);
}
I'm going to give you the function preamble, but I am not completely
sure what each instruction does. Here it is for int main(int x)
:
.text
.globl main
.type main, %function
main: push {fp, lr} # Save the frame and stack pointers
add fp, sp, #4 # Add sp+4 to the stack pointer
sub sp, sp, #8 # Lower the stack pointer by 8
str r0, [fp, #-8] # Save the argument as a local var?
and here's the function postamble to return a single value:
sub sp, fp, #4 # ???
pop {fp, pc} # Pop the frame and stack pointers
With the x86-64 there's an instruction to set a register to 0 or 1
based on the comparison being true, e.g. sete
, but then we have to
zero-fill the rest of the register with movzbq
. With the ARM, we run
two separate instructions which set a register to a value if the condition
we want is true or false, e.g.
moveq r4, #1 # Set r4 to 1 if values were equal
movne r4, #0 # Set r4 to 0 if values were not equal
I think that's all the major differences out of the road. So below
is a comparison of the cgXXX()
operation, any specific type for that
operation, and an example x86-64 and ARM instruction sequence to
perform it.
Operation(type) | x86-64 Version | ARM Version |
---|---|---|
cgloadint() | movq $12, %r8 | mov r4, #13 |
cgloadglob(char) | movzbq foo(%rip), %r8 | ldr r3, .L2+#4 |
ldr r4, [r3] | ||
cgloadglob(int) | movzbl foo(%rip), %r8 | ldr r3, .L2+#4 |
ldr r4, [r3] | ||
cgloadglob(long) | movq foo(%rip), %r8 | ldr r3, .L2+#4 |
ldr r4, [r3] | ||
int cgadd() | addq %r8, %r9 | add r4, r4, r5 |
int cgsub() | subq %r8, %r9 | sub r4, r4, r5 |
int cgmul() | imulq %r8, %r9 | mul r4, r4, r5 |
int cgdiv() | movq %r8,%rax | mov r0, r4 |
cqo | mov r1, r5 | |
idivq %r8 | bl __aeabi_idiv | |
movq %rax,%r8 | mov r4, r0 | |
cgprintint() | movq %r8, %rdi | mov r0, r4 |
call printint | bl printint | |
nop | ||
cgcall() | movq %r8, %rdi | mov r0, r4 |
call foo | bl foo | |
movq %rax, %r8 | mov r4, r0 | |
cgstorglob(char) | movb %r8, foo(%rip) | ldr r3, .L2+#4 |
strb r4, [r3] | ||
cgstorglob(int) | movl %r8, foo(%rip) | ldr r3, .L2+#4 |
str r4, [r3] | ||
cgstorglob(long) | movq %r8, foo(%rip) | ldr r3, .L2+#4 |
str r4, [r3] | ||
cgcompare_and_set() | cmpq %r8, %r9 | cmp r4, r5 |
sete %r8 | moveq r4, #1 | |
movzbq %r8, %r8 | movne r4, #1 | |
cgcompare_and_jump() | cmpq %r8, %r9 | cmp r4, r5 |
je L2 | beq L2 | |
cgreturn(char) | movzbl %r8, %eax | mov r0, r4 |
jmp L2 | b L2 | |
cgreturn(int) | movl %r8, %eax | mov r0, r4 |
jmp L2 | b L2 | |
cgreturn(long) | movq %r8, %rax | mov r0, r4 |
jmp L2 | b L2 |
If you copy the compiler from this part of the journey to a Raspberry Pi 3 or 4, you should be able to do:
$ make armtest
cc -o comp1arm -g -Wall cg_arm.c decl.c expr.c gen.c main.c misc.c
scan.c stmt.c sym.c tree.c types.c
cp comp1arm comp1
(cd tests; chmod +x runtests; ./runtests)
input01: OK
input02: OK
input03: OK
input04: OK
input05: OK
input06: OK
input07: OK
input08: OK
input09: OK
input10: OK
input11: OK
input12: OK
input13: OK
input14: OK
$ make armtest14
./comp1 tests/input14
cc -o out out.s lib/printint.c
./out
10
20
30
It did take me a bit of head scratching to get the ARM version of
the code generator cg_arm.c
to correctly compile all of the test
inputs. It was mostly straight-forward, I just wasn't familiar with
the architecture and instruction set.
It should be relatively easy to port the compiler to a platform with
3 or 4 registers, 2 or so data sizes and a stack (and stack frames).
As we go forward, I'll try to keep both cg.c
and cg_arm.c
functionally in sync.
In the next part of our compiler writing journey, we will add the char
pointer to the language, as well as the '*' and '&' unary operators.