- Compiler II: Code Generation
- Unit 5.1: Code Generation
- Unit 5.2: Handling Variables
- Unit 5.3: Handling Expressions
- Unit 5.4: Handling Flow of Control
- Unit 5.5: Handling Objects: Low-Level Aspects
- Unit 5.6: Handling Objects: Construction
- Unit 5.7: Handling Objects: Manipulation
- Unit 5.8: Handling Arrays
- Unit 5.9: Standard Mapping Over the Virtual Machine
- Unit 5.10: Completing the Compiler: Proposed Implementation
- Unit 5.11: Project 11
- when you set out to develop something complicated like a compiler , it always helps to try to simplify matters as mush as possible.
- So I'm going to make some simplification observations.
- the first one is , each class file is compiled separately.
- Each Jack class file, just like each Java class file, is a separate and standalone compilation unit.
- So, the overall task of compiling a multi-class program has been reduced to the more sort of limited task of compiling one class at a time.
- So that's my first observation.
- the second one is based on going into the class itself.
- First of all, the class consists of two main modules.
- class declaration : class / field / static ...
- we have some preamble code that defines the class in some class level variables.
- zero or more subroutine called declarations
- class declaration : class / field / static ...
- Now, the simplifying assumption is such that compilation can once again be viewed as two separate processes when you go down to the class level.
- First, you compile the class level code that we see here on the top.
- And then you compile each subroutine one at a time.
- These two compilation tasks are relatively separate and standalone.
- And, therefore, once again, the overall and rather formidable past of writing a compiler for multi-class program has been reduced to compiling one subroutine at a time.
- First of all, the class consists of two main modules.
- the first one is , each class file is compiled separately.
- Compilation challenges
- Handling procedural code
- variables
- expressions
- flow of control
- Handling objects
- Handling arrays
- Handling procedural code
- The challenge: expressing the above semantics in the VM language
- example source code
sum = x * ( 1+rate )
- translate to VM code (pseudo)
push x
push 1
push rate
push +
push *
pop sum
-
for now , let us focus only on the variables
-
we have sum, x, and rate
-
remember that the VM langauge does not have symbolic variables.
- it only has things like local , argument, this , that, and so on.
-
So in order to resolve this pseudo VM code into final executable VM code, I have to map these symbolic variables on what we called the virtual memory segments.
-
In order to generate actual VM code, we must know (among other things):
- Whether each variable is a field,static,local,or argument
- Whether each variable is the first,second,third... variable of its kind
-
VM code (actual)
- (making some arbitrary assumptions about the variable )
push argument 2
push constant 1
push static 0
add
call Math.multiply 2
pop local 3
- We have to handle:
- class-level variables
- field
- static
- subroutine-level variables
- argument
- local
- class-level variables
- Every one of these variables has some properties:
- name (identifier)
- type (int, char, boolean, class name)
- kind (field, static, local, argument)
- scope (class level subroutine level)
- Variable properties
- Needed for code generation
- Can be managed efficiently using a symbol table
class Point {
field int x,y;
static int pointCount;
...
}
name | type | kind | # |
---|---|---|---|
x | int | field | 0 |
y | int | field | 1 |
pointCount | int | static | 0 |
class Point {
...
method int distance( Point other ) {
var int dex, dy;
...
}
}
- note
distance
is a class method, so it will always contain athis
variable
name | type | kind | # |
---|---|---|---|
this | Point | argument | 0 (argument 0 in every method) |
other | Point | argument | 1 |
dx | int | local | 0 |
dy | int | local | 1 |
-
Class-level symbol table:
- can be reset each time we start compiling a new class
- in Jack, or Java, or C# , classes are standalone compilation units
- can be reset each time we start compiling a new class
-
Subroutine-level symbol table:
- can be reset each time we start compiling a new subroutine
-
It turns out that when you compiling anything in Jack , you always have to maintain just 2 symbol tables.
- the class symbol talbe, and the current subroutine symbol table.
-
Handling variable declarations:
- field/static/var type varName;
- Add the variable and its properties to the symbol table
- parameter list
- field/static/var type varName;
-
Handling variable usage:
- example :
let dx=x-other.getx();
- look up the variable in the subroutine-level symbol table; if not found, look it up in the class-level symbol table.
- if not found, throw an error.
- example :
-
so expression
let y=y + dy
will be translated into VM code :
push this 1 // y
push local 1 // dy
add
pop this 1 // y
- not in Jack, but often in other language like Jave, C ...
- Some languages feature unlimited nested scoping
- Can be managed using a linked list of symbol tables.
- Variable lookup:
- start in the 1st table
- if not found , look up the next table, and so on.
- prefix
* a + b c
- functional
- infix
a * (b+c)
- human oriented , most source code are infix.
- postfix
a b c + *
- stack oriented
- this postfix notation is intimately related to our stack machine, because our stack language is also postfix.
- our target language , the VM language , is postfix. so the compiler has to translate from infix to postfix.
- source code -> parse tree ( The XML file we output in project 10 is a parse tree )
- go throuth every node in this parse tree in a certain order , and then you generate the stack machine code.
- 生成 parse tree 的代价是很大的,实践中一般不会这么做。
- The following algorithm is going to generate the VM code on the fly without having to create the whole parse tree in the process.
- we have 5 statements
- let, do, return are trivial statements to translate
- translating
while
andif
are far more challenging.
if (expression)
statement1
else
statement2
- before I generate code for this statement, I would like to rewrite it using a flow chart.
- the first thing to do is to take the expression and negate it.
- why to do this negation ? Because once you negate the expression in such a way , code generation becomes far simpler and tighter.
- where are these labels (L1,L2) come from ?
- the compiler generates these labels
while (expression)
statements
- A program typically contains multiple
if
andwhile
statements.- we have to make sure that the compiler generates unique labels
- local, argument
- represent local and argument variables
- located on the stack
- Implementation
- Base addresses: LCL and ARG
- Managed by the VM implementation
- this, that
- represent object and array data
- localted on the heap
- you may well have numerous objects and quite a few arrays.
- Implementation
- Base address: THIS and THAT
- Set using
pointer 0 (this)
,pointer 1 (that)
- Managed by VM code.
- suppose we wish to access RAM words 8000, 8001, 8002, ...
VM code (commands) | VM implementation (resulting effect) |
---|---|
push 8000 | sets THIS to 8000 |
pop pointer | |
push/pop this 0 | accessing RAM[8000] |
push/pop this 1 | accessing RAM[8001] |
push/pop this i | accessing RAM[8000+i] |
- Object data is accessed via the
this
segment - Array data is accessed via the
that
segment - Before we use these segments, we must first anchor them using pointer.
- source code
var Point p1,p2;
var int d ;
...
let p1 = Point.new(2,3);
let p2 = Point.new(5,7);
- compiled VM(pseudo) code
// var Point p1,p2;
// var int d ;
// The compiler updates the subroutine's symbol table
// No code is generated
// let p1 = Point.new(2,3);
// Subroutine call:
push 2
push 3
call Point.new
// Contract: the caller assumes that the constructor's code
(i) arranges a memory block to store the new object , and
(ii) returns its base address to the caller.
pop p1 // p1 = base address of the new object
// let p2 = Point.new(5,7);
// similar
- subroutine symbol table
name | type | kind | # |
---|---|---|---|
p1 | Point | local | 0 |
p2 | Point | local | 1 |
d | int | local | 2 |
- During compile-time, the compiler maps p1 on local 0 , p2 on local 1, and d on local 2 ?
- During run-time, the execution of the constructos's code effects the creation of the objects themselves. on the heap.
- So, object construction is a two-stage affair that happens both during compile time and during runtime.
- A constructor typically does 2 things:
- Arranges the creation of a new object
- Initializes the new object to some initial state
- Therefore , the constructor's code typically needs access to the object's fields.
- How to access the object's fields ?
- The constructor's code can access the object's data using the this segment
- But first , the constructor's code must anchor the this segment on the object's data , using pointer
class Point {
field int x,y ;
static int pointCount ;
...
/** Constructs a new point */
constructor Point new(int ax, int ay) {
let x = ax ;
let y = ay ;
let pointCount = pointCount +1;
return this ;
}
}
- class-level symbol table
name | type | kind | # |
---|---|---|---|
x | int | field | 0 |
y | int | field | 1 |
pointCount | int | static | 0 |
The compiler knows that the constructor has to create some space on the RAM for the newly constructed object.
So the first question is how does the compiler know how much space is needed ?
Well, the compiler can consult the class-level symbol table and realize that objects of this particular class, Point, require 2 words only , x and y.
So the 2nd question is how do we find such a free memory block on the RAM ?
Well , at this stage the operating system comes to the rescure -- alloc
function.
- compiled code
// constructor Point new(int ax, int ay)
// The compiler creates the subroutine's symbol table,
// no code genrates
// The compiler figures out the size of an object, and call
// Memory.alloc(n).
// This OS method finds a memory block of the required size,
// and return its base address.
push 2
call Memory.alloc 1
pop pointer 0 // anchors this at the base address
// let x = ax; let y = ay;
push argument 0
pop this 0
push argument 1
pop this 1
// let pointCount = pointCount + 1 ;
push static 0
push 1
add
pop static 0
// return this
push pointer 0
return // returns the base address of the new object
- constructor subroutine symbol table
name | type | kind | # |
---|---|---|---|
ax | int | arg | 0 |
ay | int | arg | 1 |
- Manipulation:
- Compiling
obj.methodCall()
- Compiling methods
- Compiling
// OO Style:
... obj.foo(x1,x2,...) ...
- The target machine language is procedural
- Therefore , the compiler must rewrite the OO method calls in a procedural style
// Procedural style:
... foo(obj,x1,x2,...) ...
// obj.foo(x1,x2,...)
// Pushes the object on which the method
// is called to operate(implicit parameter) ,
// then pushes the method arguments,
// then calls the method for its effect
push obj
push x1
push x2
push ...
call foo
// take care of return value
- Methods are designed to operate on the current object(this):
- Therefore , each method's code needs access to the object's fields
- How to access the object's fields:
- The method's code can access the object's i-th field by accessing
this i
- But first , the method's code must anchor the
this
segment on the object's data, using pointer
- The method's code can access the object's i-th field by accessing
// method int distance(Pointer other)
// var int dx , dy
// The compiler constructs the method's
// symbol table. No code is generated.
// Next, it generates code that associates the
// this memory segment with the object on
// which the method is called to operate.
push argument 0
pop pointer 0 // THIS = argument 0
// let dx = x - other.gets()
push this 0 // x
push argument 1 // other
call Point.getx 1
sub
pop local 0 // dx
// let dy = y - other.gety()
// similar, code omitted
// return Math.sqrt( (dx * dx) + (dy * dy) )
push local 0
push local 0
call Math.multiply 2
push local 1
push local 1
call Math.multiply 2
add
call Math.sqrt 1
return
...
// Methods must return a value
push constant 0
return
- for the caller ...
// do p1.print()
push p1
call Point.print
// the caller of a void method
// must dump the returned value
pop temp 0
...
- Recap
- Each compiled method must return a value
- By convention, void methods return a dummy value
- Callers of void methods are responsible for removing the returned value from the stack.
var Array arr;
- it will alloc a variable in local segment ( stack )
- generates no code, only effects the symbol table
let arr = Array.new(n);
- alloc Array in heap
- from the caller's prespective , handled exactly loke object construction
- 2 "portable" virtual memory segments that can be aligned to diferent RAM addresses
· | this | that |
---|---|---|
VM use: | current object | current array |
pointer ( base address ) | THIS | THAT |
how 2 set: | pop pointer 0 |
pop pointer 1 |
// RAM[8056] = 17
push 8056
pop pointer 1 // THAT = 8056
push 17
pop that 0 // THAT[0] = 17
// arr[2] = 17
push arr // base address
push 2 // offset
add
pop pointer 1
push 17
pop that 0
- note we only use the entry 0 for that , why we not generate following simple code instead ?
push arr
pop pointer 1
push 17
pop that 2
- because
- The simple code works only for constant offsets (indices) , it cannot be used when the source is statement, say, “let arr[x] = y” , since the compiler donot its actual value when generating code.
// solution: arr[expression1] = expression2
push arr
push expression1
add
pop pointer 1
push expression2
pop that 0
- Unfortunately, the above VM code does not work when handle such kind of statement :
a[i] = b[j]
// solution: a[i] = b[j]
push a
push i
add
push b
push j
add
pop pointer 1
push that 0 // b[j] -> stack
pop temp 0 // stack b[j] -> temp 0
pop pointer 1
push temp 0
pop that 0
- General solution for generting array access code
// arr[expression1] = expression2
push arr
VM code for computing and pushing the value of expression1
add // top stack value = RAM address of arr[expression1]
VM code for computing and pushing the value of expression2
pop temp 0 // temp 0 = the value of expression2
// top stack value = RAM address of arr[expression1]
pop pointer 1 // that pointer to access where to store
push temp 0
// store , since its array, it use `that` segment
pop that 0
- If needed ,the evaluation of expression2 can set and use pointer 1 and that 0 safely.
- What about
a[a[i]] = a[b[a[b[j]]]]
?- No problem.
- Specifies how to map the constructs of the high-level language on the constructs of the virtual machine.
- Each file filename.jack is compiled into the file fileName.vm
- Each subroutine subName in fileNname.jack is compiled into a VM function fileName.subName
- A Jack constructor function with k arguments is compiled into a VM function that operates on k arguments
- A Jack method with k arguments is compiled into a VM function that operates on k+1 arguments.
- The local variables
- of a Jack subroutine are mapped on the virtual segment local
- The argument variables
- of a Jack subroutine are mapped on the virtual segment argument
- The static variables
- of a .jack class file are mapped on the virtual memory segment static of the compiled .vm class file.
- The field variables
- of the current object are accessed as follows:
- assumption: pointer 0 has been set to the this object.
- some1 has to do it for us : the agent that does it is the generated code of the current subroutine.
- the i-th field of this object is mapped on
this i
- assumption: pointer 0 has been set to the this object.
- of the current object are accessed as follows:
- Accessing array entries:
- Access to any array entry
arr[i]
is realized as follows:- first set
pointer 1
to the entry's address (arr + i) - access the entry by accessing
this 0
- first set
- Access to any array entry
- When compiling a Jack method:
- the compiled VM code must set the base of the
this
sgement toargument 0
- the compiled VM code must set the base of the
- When compiling a Jack constructor:
- the compiled VM code must allocate a memory block for the new object, and then set the base of segment
this
to the new object's base address. - the compiled VM code must return the object's base address to the caller.
- the compiled VM code must allocate a memory block for the new object, and then set the base of segment
- When compiling a void function of a void method
- The compiled VM code must return the value constant 0
- Calling a subroutine call
subName(arg1, arg2, ...)
- The caller (a VM function) must push the arguments onto the stack, and then call the subroutine.
- Calling the called subroutine is a method,
- The caller must first push a reference to the object on which the method is supposed to operates;
- next, the caller must push arg1, arg2, ... , and then call the method.
- If the called subroutine is void
- when compiling the Jack statment
do subName
, following the call the caller must pop(and ignore) the returned value.
- when compiling the Jack statment
- null is mapped on the constant 0
- false is mapped on the constant 0
- true is mapped on the constant -1
- -1 can be obtained using
push 1
followed byneg
- -1 can be obtained using
- The basic Jack OS is implemented as a set of compiled VM class files:
- Math.vm, Memory.vm, Screen.vm, Output.vm, Keyboard.vm, String.vm, Array.vm, Sys.vm
- All the OS class files must reside in the same directory as the VM files generated by the compiler
- Any VM function can call any OS VM function for its effect.
- Multiplication is handled using the OS function Math.multiply()
- Division is handled using the OS function Math.divide()
- String constants are created using the OS constructor
String.new(length)
- String assigments like
x="cc...c"
are handled using a series of calls toString.appendChar(c)
- Object construction requires allocating space for the new object using the OS function Memory.alloc(size)
- Object recycling is handled using the OS function Memory.deAlloc(object).
-
The scope of static and field variable is the class in which they are defined
-
The scope of local and argument variables is the subroutine in which they are defined.
-
We give each variable a running index within its scope and kind
-
The index starts at 0, increments by 1 each time a new symbol is added to the table, and is reset to 0 when starting a new scope
name | type | kind | # |
---|---|---|---|
this | Point | argument | 0 |
... | ... | ... | ... |
- Implementation notes
- The symbol table abstraction can be implemented using 2 seprate hash tables :
- one for the class scope
- and one for the subroutine scope
- When we start compiling a new subroutine, the latter hash table can be reset
- When compiling an error-free Jack code, each symbol NOT found in the the symbol table can be assumed to be either a subroutine name or a class name.
- The symbol table abstraction can be implemented using 2 seprate hash tables :
- Extend the handling of identifiers
- output the identifier's category:
- var, argument, static, field, class, subroutine
- if the identifier's category is
var, argument, static field
, output also the running index assigned to this variable in the symbol table - output whether the identifier is being defined, or being used.
- output the identifier's category:
// Main.jack
class Main {
function void main() {
do Output.printInt( 1+(2*3) ) ;
return ;
}
}
// Main.vm
function Main.main 0
push constant 1
push constant 2
push constant 3
call Math.mulitply 2
add
call Output.printInt 1
pop temp 0 // instruction 1
push constant // instruction 2
return
- Inspect the VM code, focusing on the 2 instructions highlighted in red.
- instruction 1
- every subroutine must return a value.
- the return value of Output.printInt serves no purpose whatever, so we simply dump it.
- instruction 2
- the method
main
also need return a value
- the method
- instruction 1