MAKE SURE YOU DO THE PRELAB FIRST!
Big picture: by the end of this lab you will have a very simple round-robin threading package for "cooperative" (i.e., non-preemptive) threads.
The main operations (see rpi-thread.[ch]
):
rpi_fork(code, arg)
: to create a new thread, put it on therunq
.rpi_yield()
: yield control to another thread.rpi_exit(int)
: kills the current thread.rpi_thread_start()
: starts the threading system for the first time. Returns when there are threads left to run.
Recall we split threads into two types:
-
Non-preemptive threads run until they explicitly yield the processor.
-
Pre-emptive threads can be interrupted at any time (e.g., if their time slice expires, or a higher-priority thread becomes runnable, etc).
The trade-offs:
-
Cooperative threads make preserving large invariants easy: by default, all code is a critical section, only broken up when you yield [Atul et al]. Their downside is that if they don't yield "often enough" they add large, difficult-to-debug latencies.
-
Pre-emptive threads let you to eliminate the need to trust that the threading code yields control in a well-behaved way, which is why multi-user OSes (Linux and MacOS) preempt user processes. However, they make it much more difficult to write correct code. (E.g., would your gpio implementation work if it was called by multiple pre-emptive threads? Uart?)
Generally, at a minimum, each thread has its own stack --- which has to be "large enough" so that it doesn't get overrun --- and the thread package has to provide an assembly routine to "context switch" from one thread to another. Context switching works by:
- Saving all registers from the first thread onto its stack (or as we do, into the thread structure itself).
- Loading all registers for the second from where they were saved.
- Changing the program counter value to resume the new thread.
The main magic you'll do today is writing this context-switching code. Context-switching doesn't require much work --- less than 15 instructions using special ARM instructions, but mistakes can be extremely hard to find. So we'll break it down into several smaller pieces.
- You have a working
kmalloc
implementation. - You pass the tests in the 2-threads` directory.
While all this is exciting, a major sad is that a single bug can lead to your code jumping off into hyperspace. This is hard to debug. So before you write a bunch of code:
- Try to make it into small, simple, testable pieces.
- Print all sorts of stuff so you can sanity check! (e.g., the value of the stack pointer, the value of the register you just loaded). Don't be afraid to call C code from assembly to do so.
The single biggest obstacle I've seen people make when writing assembly is that when they get stuck, they wind up staring passively at instructions for a long time, trying to discern what it does or why it is behaving "weird."
-
Instead of treating the code passively, be aggressive: your computer cannot fight back, and if you ask questions in the form of code it must answer. When you get unsure of something, write small assembly routines to clarify. For example, can't figure out why
LDMIA
isn't working? Write a routine that uses it to store a single register. Can't figure out if the stack grows up or down? take addresses of local variables in the callstack print them (or look in the*.list
files). -
Being active will speed up things, and give you a way to ferret out the answer to corner case questions you have.
You should have read the ARM assembly chapter in the PRELAB
(docs/subroutines.hohl-arm-asm.pdf
); this is
just a cliff-notes version.
Context-switching may sound mysterious, but mechanically it is simple: we
just need to need to save all the registers in use. The ARM has sixteen
general-purpose registers r0
---r15
which can be saved and restored
using standard load and store instructions. If you look through the
.list
files in any of the pi program directories, you'll see lots of
examples of saving and loading registers.
Even easier, because we are doing non-preemptive threads we don't have to save all the registers, just the "callee saved" ones. The ARM, like most modern architectures, splits registers in into two categories (Q: why?):
-
"Callee-saved": must be saved by a procedure before any use and then restored before returning to the calling routine.
-
"Caller-saved": can be used as scratch registers by any procedure without saving/restoring them first. The downside is that if a routine needs the contents of a caller-saved register preserved across a function call it must save the register before the call and restore it afterwards.
As we mentioned, a non-preemptive thread only context switches if it
explicitly calls a thread yield procedure of some kind (in our case,
rpi_yield()
). As a result we only need to save the "callee-saved"
registers, since any live caller-saved registers must be already saved
by the compiler.
The ARM general-purpose registers:
-
r0
---r3
: caller-saved argument registers. -
r4 --- r11
: callee-saved. -
r12
may also be caller-saved, depending on the compiler, but I can't find a definitive statement forarm-none-eabi-gcc
, so we will treat it, and all the rest as callee-saved: i.e., they must be saved before we can use their contents since the caller assumes their values will be preserved across calls. -
r13
: stack pointer (sp
). Today it doesn't matter, but in the general case where you save registers onto the sack, you will be using the stack pointersp
. Be careful when you save this register. (How to tell if the stack grows up or down?) -
r14
: link register (lr
), holds the address of the instruction after the call instruction that called the current routine (in our caserpi_yield
). Since any subsequent call will overwrite this register we need to save it during context switching. -
r15
: program counter (pc
). Writing a valueval
topc
causes in control to immediately jump toval
. (Irrespective of whetheraddr
points to valid code or not.) From above: movinglr
topc
will return from a call. Oddly, we do not have to save this value!
So, to summarize, context-switching must save:
r4 --- r12, r13, r14
. Note: if you useLDM
andSTM
instructions, they cannot have the stack pointer in the register list (it can be used as the base).
For reference, the ARM procedure call ABI document:
The link here has a more informal rundown of the above.
You can find a bunch of of information in the ARM Manual:
I've tried to break down the lab into isolated pieces so you can test
each one. The first is to get a general feel for the rpi_thread
interfaces by getting "run-to-completion" (RTC) threads working: these
execute to completion without blocking.
RTC threads are an important special case that cover a surprising number of cases. Since they don't block, you can run them as deferred function calls. They should be significantly faster than normal threads since:
- you don't need to context switch into them.
- you don't need to context switch out of them.
- you won't have to fight bugs from assembly coding. Most interrupt handlers will fall into this category.
What to change:
rpi_exit
just return immediately (not correct in general, but ok for our test).rpi_yield
should just return immediately.rpi_fork
create a thread and put it in the run-queue.rpi_thread_start
remove threads in FIFO order, run each function, free the thread block.
With these changes, 1-test-thread
should complete successfully.
Our next step is to start writing assembly code and build up the context-switching code in pieces.
To get started, write a simple assembly routine store_one_asm(p,u)
in test-asm.S
that stores u
into *p
. When you run 2-test
the first test should work (the second will fail until the next part).
Next, we build a routine that correctly saves callee-saved registers.
If you look in test-asm.S
at check_callee_save
:
- Write the code to load a known constant into each register --- with the exception of the stack pointer --- that you will save so you can check their saved values.
- Write the code to store each callee-saved register into the given block of memory in ascending order register number.
- The code then
calls
print_and_die
with a pointer to the block as the first argument.
When you run the test, each printed register other than the stack pointer
should match its expected value (the location for r4
should hold
4
etc).
Note that the bulk register save and restore instructions store the smallest registers so that the smallest is at the lowest address; also, don't include the stack pointer in the register list! Useful ARM doc or this.
Don't be afraid to go through the ARM manual (docs/armv6.pdf
) or the
lectures we posted in 6-threads/docs
.
Now that you can save registers, it makes sense to restore them.
Using the starter code in test-asm.S
build a context switch routine cswitch(uint_t *old, uint_t *new)
that:
- Saves the callee-saved registers into the block pointed to by
old
(as above). - Loads the callee-saved registers previously saved into the block
pointed to by
new
. - Jumps back to the address in
lr
.
Testing: 3-test-cswitch
has two parts:
- Checks that inserting calls that save and restore from the same
block (e.g., calling
cswitch(p,p)
) does not change the behavior of code. - Check that reloading from a previously saved block the code to jump backwards and re-execute as expected. (How does this test work?)
Now we make your RTC threads work with context switching. Make the following changes:
-
rpi_start_thread
will context switch into the first thread it removes from the run queue. Doing so will require creating a dummy thread so that when there are no more runnable threads,rpi_cswitch
will transfer control back and we can return back to the main program. (If you run into problems, try first just rebooting when there are no runnable threads.) -
rpi_fork
should setup the thread's register save area so that an initial context switch will work. (Discussed more below.) -
rpi_exit
if it can dequeue a runnable thread, context switch into it. Otherwise resume the initial start thread created in step 1. -
Move your
cswitch
code intorpi_cswitch
inthread-asm.S
Make sure you verify that your
rpi_cswitch
code works when you pass the current thread as both arguments --- behaviorally this is a no-op since your code should save and restore the state, and resume right after the call. -
Change
rpi_yield
so that it works as expected.
Given that you have context-switching, the main tricky thing is figuring
out how to setup a newly created thread so that when you run context
switching on it, the right thing will happen (i.e., it will invoke to
code(arg)
).
-
The standard way to do this is by manually storing values onto the thread's stack (sometimes called "brain-surgery") so that when its state is loaded via
rpi_cswitch
control will jump to a trampoline routine (written in assembly) withcode
witharg
in known registers The trampoline will then branch-and-link to thecode
witharg
inr0
. -
The use of a trampoline lets us handle the problem of missing
rpi_exit
calls. The problem: If the thread code does not callrpi_exit
explicitly but instead returns, the value in thelr
register that it jumps to will be nonsense. Our hack: have our trampoline that calls the thread code simply callrpi_exit
if the intial call tocode
returns. -
To help debug problems: you can initially have the trampoline code you write (
rpi_init_trampoline
) initially just call out to C code to print out its values so you can sanity check that they make sense.
Checking:
- When you run
3-test
it should work and printSUCCESS
. - You should also be able to run
1-test
now that your yield and exit are not broken. - Finally you should be able to run two LEDs in
4-test-yield
.
Congratulations! Now you have a simple, but working thread implementation and understand the most tricky part of the code (context-switching) at a level most CS people do not.
There's a lot more you can do. We will be doing a bunch of this later in the class:
-
Save state onto the stack itself.
-
Make versions of the
libpi
delay
routines that use your function rather than busy wait. Make sure to check if threads are enabled inrpi_yield()
! -
Have a
sleep_until_us(us)
to put the current thread to sleep with a somewhat hard-deadline that it is woken up inusec
micro-seconds. Keep track of your cumulative error to see how well you're scheduling is doing. -
Tune your threads package to be fast. In particular, you can avoid context switching when a thread is run for the first time (there is no context to load) and you don't thave to save state when it exits (it's done). You can also try tuning the interrupt handling code, since it is at least 2x slower than optimal. (I'm embarrassed, but time = short. Next year!)
-
See how many PWM threads you can run in
4-yield-test
.