Compiler II: Code Generation

Compiler II: Code Generation

Unit 5.1: Code Generation

when you set out to develop something complicated like a compiler , it always helps to try to simplify matters as mush as possible.
So I'm going to make some simplification observations.
- the first one is , each class file is compiled separately.
  - Each Jack class file, just like each Java class file, is a separate and standalone compilation unit.
  - So, the overall task of compiling a multi-class program has been reduced to the more sort of limited task of compiling one class at a time.
  - So that's my first observation.
- the second one is based on going into the class itself.
  - First of all, the class consists of two main modules.
    - class declaration : class / field / static ...
      - we have some preamble code that defines the class in some class level variables.
    - zero or more subroutine called declarations
  - Now, the simplifying assumption is such that compilation can once again be viewed as two separate processes when you go down to the class level.
    - First, you compile the class level code that we see here on the top.
    - And then you compile each subroutine one at a time.
    - These two compilation tasks are relatively separate and standalone.
    - And, therefore, once again, the overall and rather formidable past of writing a compiler for multi-class program has been reduced to compiling one subroutine at a time.

Compilation challenges
- Handling procedural code
  - variables
  - expressions
  - flow of control
- Handling objects
- Handling arrays
The challenge: expressing the above semantics in the VM language

Unit 5.2: Handling Variables

example source code

sum = x * ( 1+rate )

translate to VM code (pseudo)

push x
push 1
push rate
push +
push *
pop sum

for now , let us focus only on the variables
we have sum, x, and rate
remember that the VM langauge does not have symbolic variables.
- it only has things like local , argument, this , that, and so on.
So in order to resolve this pseudo VM code into final executable VM code, I have to map these symbolic variables on what we called the virtual memory segments.
In order to generate actual VM code, we must know (among other things):
- Whether each variable is a field,static,local,or argument
- Whether each variable is the first,second,third... variable of its kind
VM code (actual)
- (making some arbitrary assumptions about the variable )

push argument 2
push constant 1
push static 0
add
call Math.multiply 2
pop local 3

Variables

We have to handle:
- class-level variables
  - field
  - static
- subroutine-level variables
  - argument
  - local
Every one of these variables has some properties:
- name (identifier)
- type (int, char, boolean, class name)
- kind (field, static, local, argument)
- scope (class level subroutine level)
Variable properties
- Needed for code generation
- Can be managed efficiently using a symbol table

Symbol tables

class Point {
    field int x,y;
    static int pointCount;    
    ...
}

name	type	kind	#
x	int	field	0
y	int	field	1
pointCount	int	static	0

class Point {
    ...
    method int distance( Point other ) {
        var int dex, dy; 
        ...    
    }    
}

note distance is a class method, so it will always contain a this variable

name	type	kind	#
this	Point	argument	0 (argument 0 in every method)
other	Point	argument	1
dx	int	local	0
dy	int	local	1

Class-level symbol table:
- can be reset each time we start compiling a new class
  - in Jack, or Java, or C# , classes are standalone compilation units
Subroutine-level symbol table:
- can be reset each time we start compiling a new subroutine
It turns out that when you compiling anything in Jack , you always have to maintain just 2 symbol tables.
- the class symbol talbe, and the current subroutine symbol table.
Handling variable declarations:
- field/static/var type varName;
  - Add the variable and its properties to the symbol table
- parameter list
Handling variable usage:
- example : let dx=x-other.getx();
- look up the variable in the subroutine-level symbol table; if not found, look it up in the class-level symbol table.
  - if not found, throw an error.
so expression let y=y + dy will be translated into VM code :

push this 1  // y
push local 1  // dy
add 
pop this 1   // y

Handling nested scoping

not in Jack, but often in other language like Jave, C ...
Some languages feature unlimited nested scoping
Can be managed using a linked list of symbol tables.

Variable lookup:
- start in the 1st table
- if not found , look up the next table, and so on.

Unit 5.3: Handling Expressions

Parse tree

prefix
- * a + b c
- functional
infix
- a * (b+c)
- human oriented , most source code are infix.
postfix
- a b c + *
- stack oriented
- this postfix notation is intimately related to our stack machine, because our stack language is also postfix.
- our target language , the VM language , is postfix. so the compiler has to translate from infix to postfix.

Generating code for expressions: a two-stage approach

source code -> parse tree ( The XML file we output in project 10 is a parse tree )
go throuth every node in this parse tree in a certain order , and then you generate the stack machine code.

生成 parse tree 的代价是很大的，实践中一般不会这么做。

Generating code for expressions: a one-stage approach

The following algorithm is going to generate the VM code on the fly without having to create the whole parse tree in the process.

Unit 5.4: Handling Flow of Control

we have 5 statements
- let, do, return are trivial statements to translate
- translating while and if are far more challenging.

Compiling if statements

if (expression) 
    statement1
else
    statement2

before I generate code for this statement, I would like to rewrite it using a flow chart.
- the first thing to do is to take the expression and negate it.
- why to do this negation ? Because once you negate the expression in such a way , code generation becomes far simpler and tighter.

where are these labels (L1,L2) come from ?
- the compiler generates these labels

Compiling while statements

while (expression)
    statements

Some (minor) complications

A program typically contains multiple if and while statements.
- we have to make sure that the compiler generates unique labels

Unit 5.5: Handling Objects: Low-Level Aspects

Handling local and argument variables

local, argument
- represent local and argument variables
- located on the stack
Implementation
- Base addresses: LCL and ARG
- Managed by the VM implementation

Handling object and array data

this, that
- represent object and array data
- localted on the heap
- you may well have numerous objects and quite a few arrays.
Implementation
- Base address: THIS and THAT
- Set using pointer 0 (this) , pointer 1 (that)
- Managed by VM code.

Accessing RAM data

suppose we wish to access RAM words 8000, 8001, 8002, ...

VM code (commands)	VM implementation (resulting effect)
push 8000	sets THIS to 8000
pop pointer
push/pop this 0	accessing RAM[8000]
push/pop this 1	accessing RAM[8001]
push/pop this i	accessing RAM[8000+i]

Recap

Object data is accessed via the this segment
Array data is accessed via the that segment
Before we use these segments, we must first anchor them using pointer.

Unit 5.6: Handling Objects: Construction

The caller's side: compiling new

source code

var Point p1,p2;
var int d ;
...
let p1 = Point.new(2,3);
let p2 = Point.new(5,7);

compiled VM(pseudo) code

// var Point p1,p2;
// var int d ;
// The compiler updates the subroutine's symbol table 
// No code is generated

// let p1 = Point.new(2,3);
// Subroutine call:
push 2
push 3
call Point.new
// Contract: the caller assumes that the constructor's code
(i) arranges a memory block to store the new object , and 
(ii) returns its base address to the caller.
pop p1  // p1 = base address of the new object 

// let p2 = Point.new(5,7);
// similar

subroutine symbol table

name	type	kind	#
p1	Point	local	0
p2	Point	local	1
d	int	local	2

Resulting impact

During compile-time, the compiler maps p1 on local 0 , p2 on local 1, and d on local 2 ?
During run-time, the execution of the constructos's code effects the creation of the objects themselves. on the heap.
So, object construction is a two-stage affair that happens both during compile time and during runtime.

Object construction: the big picture

A constructor typically does 2 things:
- Arranges the creation of a new object
- Initializes the new object to some initial state
  - Therefore , the constructor's code typically needs access to the object's fields.
How to access the object's fields ?
- The constructor's code can access the object's data using the this segment
- But first , the constructor's code must anchor the this segment on the object's data , using pointer

Compiling constructors

class Point {
    field int x,y ;
    static int pointCount ;
    ...
    /** Constructs a new point  */
    constructor Point new(int ax, int ay) {
        let x = ax ;
        let y = ay ;
        let pointCount = pointCount +1;
        return this ;    
    }    
}

class-level symbol table

name	type	kind	#
x	int	field	0
y	int	field	1
pointCount	int	static	0

The compiler knows that the constructor has to create some space on the RAM for the newly constructed object.

So the first question is how does the compiler know how much space is needed ?

Well, the compiler can consult the class-level symbol table and realize that objects of this particular class, Point, require 2 words only , x and y.

So the 2nd question is how do we find such a free memory block on the RAM ?

Well , at this stage the operating system comes to the rescure -- alloc function.

compiled code

// constructor Point new(int ax, int ay) 
// The compiler creates the subroutine's symbol table, 
// no code genrates

// The compiler figures out the size of an object, and call
// Memory.alloc(n).
// This OS method finds a memory block of the required size, 
// and return its base address.
push 2
call Memory.alloc 1
pop pointer 0  // anchors this at the base address

// let x = ax; let y = ay;
push argument 0
pop this 0
push argument 1
pop this 1

// let pointCount = pointCount + 1 ;
push static 0
push 1 
add 
pop static 0

// return this
push pointer 0
return // returns the base address of the new object

constructor subroutine symbol table

name	type	kind	#
ax	int	arg	0
ay	int	arg	1

Unit 5.7: Handling Objects: Manipulation

Manipulation:
- Compiling obj.methodCall()
- Compiling methods

Compiling method calls

// OO Style:
... obj.foo(x1,x2,...) ...

The target machine language is procedural
Therefore , the compiler must rewrite the OO method calls in a procedural style

// Procedural style:
... foo(obj,x1,x2,...) ...

Compiling method calls: the general technique

// obj.foo(x1,x2,...)
// Pushes the object on which the method 
// is called to operate(implicit parameter) ,
// then pushes the method arguments, 
// then calls the method for its effect
push obj
push x1
push x2
push ...
call foo
// take care of return value

Compiling methods

Methods are designed to operate on the current object(this):
- Therefore , each method's code needs access to the object's fields
How to access the object's fields:
- The method's code can access the object's i-th field by accessing this i
- But first , the method's code must anchor the this segment on the object's data, using pointer

// method int distance(Pointer other)
// var int dx , dy
// The compiler constructs the method's 
// symbol table. No code is generated.
// Next, it generates code that associates the 
// this memory segment with the object on 
// which the method is called to operate.
push argument 0
pop pointer 0  // THIS = argument 0

// let dx = x - other.gets()
push this 0   // x
push argument 1   // other
call Point.getx 1
sub
pop local 0   // dx

// let dy = y - other.gety()
// similar, code omitted

// return Math.sqrt( (dx * dx) + (dy * dy) )
push local 0
push local 0
call Math.multiply 2
push local 1
push local 1
call Math.multiply 2
add
call Math.sqrt 1
return

Compiling void methods

...
// Methods must return a value
push constant 0
return

for the caller ...

// do p1.print()
push p1
call Point.print
// the caller of a void method
// must dump the returned value
pop temp 0
...

Recap
- Each compiled method must return a value
- By convention, void methods return a dummy value
- Callers of void methods are responsible for removing the returned value from the stack.

Unit 5.8: Handling Arrays

Array construction

var Array arr;
- it will alloc a variable in local segment ( stack )
- generates no code, only effects the symbol table
let arr = Array.new(n);
- alloc Array in heap
- from the caller's prespective , handled exactly loke object construction

this and that ( reminder )

2 "portable" virtual memory segments that can be aligned to diferent RAM addresses

·	this	that
VM use:	current object	current array
pointer ( base address )	THIS	THAT
how 2 set:	`pop pointer 0`	`pop pointer 1`

Example : RAM access using that

// RAM[8056] = 17
push 8056
pop pointer 1    // THAT = 8056
push 17
pop that 0    // THAT[0] = 17

Array access

// arr[2] = 17
push arr  // base address
push 2    // offset
add 
pop pointer 1
push 17
pop that 0

note we only use the entry 0 for that , why we not generate following simple code instead ?

push arr
pop pointer 1
push 17
pop that 2

because
- The simple code works only for constant offsets (indices) , it cannot be used when the source is statement, say, “let arr[x] = y” , since the compiler donot its actual value when generating code.

// solution: arr[expression1] = expression2
push arr
push expression1
add
pop pointer 1
push expression2
pop that 0

Unfortunately, the above VM code does not work when handle such kind of statement : a[i] = b[j]

// solution: a[i] = b[j]
push a 
push i
add

push b
push j
add
pop pointer 1
push that 0    // b[j] -> stack

pop temp 0     // stack b[j] -> temp 0

pop pointer 1
push temp 0  
pop that 0

General solution for generting array access code

// arr[expression1] = expression2   
push arr

VM code for computing and pushing the value of expression1
add    // top stack value = RAM address of  arr[expression1]


VM code for computing and pushing the value of expression2

pop temp 0  // temp 0 = the value of expression2
            // top stack value = RAM address of  arr[expression1]

pop pointer 1  // that pointer to access where to store
push temp 0

// store , since its array, it use `that` segment
pop that 0

If needed ,the evaluation of expression2 can set and use pointer 1 and that 0 safely.
What about a[a[i]] = a[b[a[b[j]]]] ?
- No problem.

Unit 5.9: Standard Mapping Over the Virtual Machine

Specifies how to map the constructs of the high-level language on the constructs of the virtual machine.

Files and subroutine mapping

Each file filename.jack is compiled into the file fileName.vm
Each subroutine subName in fileNname.jack is compiled into a VM function fileName.subName
A Jack constructor function with k arguments is compiled into a VM function that operates on k arguments
A Jack method with k arguments is compiled into a VM function that operates on k+1 arguments.

Variables mapping

The local variables
- of a Jack subroutine are mapped on the virtual segment local
The argument variables
- of a Jack subroutine are mapped on the virtual segment argument
The static variables
- of a .jack class file are mapped on the virtual memory segment static of the compiled .vm class file.
The field variables
- of the current object are accessed as follows:
  - assumption: pointer 0 has been set to the this object.
    - some1 has to do it for us : the agent that does it is the generated code of the current subroutine.
    - the i-th field of this object is mapped on this i

Array mapping

Accessing array entries:
- Access to any array entry arr[i] is realized as follows:
  - first set pointer 1 to the entry's address (arr + i)
  - access the entry by accessing this 0

Compiling subroutines

When compiling a Jack method:
- the compiled VM code must set the base of the this sgement to argument 0
When compiling a Jack constructor:
- the compiled VM code must allocate a memory block for the new object, and then set the base of segment this to the new object's base address.
- the compiled VM code must return the object's base address to the caller.
When compiling a void function of a void method
- The compiled VM code must return the value constant 0

Compiling subroutine calls

Calling a subroutine call subName(arg1, arg2, ...)
- The caller (a VM function) must push the arguments onto the stack, and then call the subroutine.
Calling the called subroutine is a method,
- The caller must first push a reference to the object on which the method is supposed to operates;
- next, the caller must push arg1, arg2, ... , and then call the method.
If the called subroutine is void
- when compiling the Jack statment do subName , following the call the caller must pop(and ignore) the returned value.

Compiling constants

null is mapped on the constant 0
false is mapped on the constant 0
true is mapped on the constant -1
- -1 can be obtained using push 1 followed by neg

OS classes and subroutines

The basic Jack OS is implemented as a set of compiled VM class files:
- Math.vm, Memory.vm, Screen.vm, Output.vm, Keyboard.vm, String.vm, Array.vm, Sys.vm
All the OS class files must reside in the same directory as the VM files generated by the compiler
Any VM function can call any OS VM function for its effect.

Special OS services

Multiplication is handled using the OS function Math.multiply()
Division is handled using the OS function Math.divide()
String constants are created using the OS constructor String.new(length)
String assigments like x="cc...c" are handled using a series of calls to String.appendChar(c)
Object construction requires allocating space for the new object using the OS function Memory.alloc(size)
Object recycling is handled using the OS function Memory.deAlloc(object).

Unit 5.10: Completing the Compiler: Proposed Implementation

Symbol table

The scope of static and field variable is the class in which they are defined
The scope of local and argument variables is the subroutine in which they are defined.
We give each variable a running index within its scope and kind
The index starts at 0, increments by 1 each time a new symbol is added to the table, and is reset to 0 when starting a new scope

name	type	kind	#
this	Point	argument	0
...	...	...	...

Implementation notes
- The symbol table abstraction can be implemented using 2 seprate hash tables :
  - one for the class scope
  - and one for the subroutine scope
- When we start compiling a new subroutine, the latter hash table can be reset
- When compiling an error-free Jack code, each symbol NOT found in the the symbol table can be assumed to be either a subroutine name or a class name.

VMWriter

Unit 5.11: Project 11

Symbol table

Extend the handling of identifiers
- output the identifier's category:
  - var, argument, static, field, class, subroutine
- if the identifier's category is var, argument, static field , output also the running index assigned to this variable in the symbol table
- output whether the identifier is being defined, or being used.

Quiz

// Main.jack
class  Main {
    function void main() {
        do Output.printInt( 1+(2*3) ) ;
        return ;    
    }    
}

// Main.vm
function Main.main 0
push constant 1
push constant 2
push constant 3
call Math.mulitply 2
add
call Output.printInt 1
pop temp 0     // instruction 1
push constant  // instruction 2
return

Inspect the VM code, focusing on the 2 instructions highlighted in red.
- instruction 1
  - every subroutine must return a value.
  - the return value of Output.printInt serves no purpose whatever, so we simply dump it.
- instruction 2
  - the method main also need return a value

Files

Nand2TetrisII_5.md

Latest commit

History

Nand2TetrisII_5.md

File metadata and controls

Compiler II: Code Generation

Unit 5.1: Code Generation

Unit 5.2: Handling Variables

Variables

Symbol tables

Handling nested scoping

Unit 5.3: Handling Expressions

Parse tree

Generating code for expressions: a two-stage approach

Generating code for expressions: a one-stage approach

Unit 5.4: Handling Flow of Control

Compiling if statements

Compiling while statements

Some (minor) complications

Unit 5.5: Handling Objects: Low-Level Aspects

Handling local and argument variables

Handling object and array data

Accessing RAM data

Recap

Unit 5.6: Handling Objects: Construction

The caller's side: compiling new

Resulting impact

Object construction: the big picture

Compiling constructors

Unit 5.7: Handling Objects: Manipulation

Compiling method calls

Compiling method calls: the general technique

Compiling methods

Compiling void methods

Unit 5.8: Handling Arrays

Array construction

this and that ( reminder )

Example : RAM access using that

Array access

Unit 5.9: Standard Mapping Over the Virtual Machine

Files and subroutine mapping

Variables mapping

Array mapping

Compiling subroutines

Compiling subroutine calls

Compiling constants

OS classes and subroutines

Special OS services

Unit 5.10: Completing the Compiler: Proposed Implementation

Symbol table

VMWriter

Unit 5.11: Project 11

Symbol table

Quiz