Course taught by Dr. Ada Gavrilovska and offered by Georgia Tech on Udacity.
- Processes and process management
- Threads and concurrency
- Resource management:
- Scheduling
- Memory management
- OS services for communication and I/O
- OS support for distributed services
- System software for data center and cloud environments
No notes from this lesson
No notes from this lesson
- Topics to be covered in this lesson:
- What is an OS (operating system)?
- What are key components of an OS?
- Design and implementation considerations of OSs
- An OS is a special piece of software that abstracts and arbitrates the use of a computer system
- An OS is like a toy shop manager in that an OS:
- Directs operational resources
- Enforces working policies
- Mitigates difficulty of complex tasks
- By definition, an OS is a layer of systems software that:
- Directly has privileged access to the underlying hardware
- Hides the hardware complexity
- Manages hardware on behalf of one of more applications according to some predefined polices
- In addition, it ensures that applications are isolated and protected from one another
- Abstractions:
- Process, thread, file, socket, memory page
- Mechanisms
- Create, schedule, open, write, allocate
- Policies
- Least recently used (LRU), earliest deadline first (EDF)
- Separation of mechanisms to policy:
- Implement flexible mechanisms to support many policies
- Optimize for common case:
- Where will the OS be used?
- What will the user want to execute on that machine?
- What are the workload requirements?
- Generally, applications operate in unprivileged mode (user level) while operating systems operate in privileged mode (kernel level)
- Kernel level software is able to access hardware directly
- User-kernel switch is supported by hardware
- Applications will need to utilize user-kernel transitions which is accomplished by hardware, this involves a number of instructions and switches locality
- Switching locality will affect hardware cache (transitions are costly)
- Hardware will set traps on illegal instructions or memory accesses requiring special privilege
- Pros:
- Everything included
- Inlining, compile-time optimizations
- Cons:
- Customization, portability, manageability
- Memory footprint
- Performance
- Pros:
- Maintainability
- Smaller footprint
- Less resource needs
- Cons:
- Indirection can impact performance
- Maintenance can still be an issue
- Pros:
- Size
- Verifiability
- Cons:
- Portability
- Complexity of software development
- Cost of user/kernel crossing
- Topics to be covered in this lesson:
- What is a process?
- How are processes represented by OS's?
- How are multiple concurrent processes managed by OS's?
- What is a process?
- A process is like an order of toys in that a process:
- Has a state of execution (e.g., program counter, stack)
- Has parts and temporary holding area (e.g., data, register state occupies state in memory)
- May require special hardware (e.g., I/O devices)
- A process is like an order of toys in that a process:
- OS manages hardware on behalf of applications
- Application - program on disk, flash memory (static entity)
- Process - state of a program when executing loaded in memory (active entity)
- A process encapsulates all data of a running application
- Every single element of the process state has to be uniquely identified by its address (OS abstraction used to encapsulate the process state is an address space)
- Some types of state include:
- Text and data (static state when process first loads)
- Heap - dynamically created during execution
- Stack - grows and strinks (has LIFO queue)
- Address space - in memory representation of a process
- Page tables - mapping of virtual to physical addresses
- Physical address - locations in physical memory
- Parts of virtual address space may not be allocated
- May not be enough physical memory for all state
- Solution: the operating system dynamically decides which portion of which address space will be present where in physical memory
- How does the OS know what a process is doing?
- The program counter allows the OS to know where a process currently is in the instruction sequence
- The program counter is maintained on a CPU register while the process is executing
- There also exists a stack pointer which points to the top of the stack (useful for LIFO operations)
- To maintain all of the above, the OS maintains a PCB (process control block)
- What is a PCB?
- A PCB (process control block) is a data structure that the OS maintains for every one of the processes that it manages
- A PCB is created when process is created
- Certain fields are updated when process state changes
- Other fields change too frequently
- Context switch - switching the CPU from the context of one process to the context of another
- Context switching is expensive!
- Direct costs - number of cycles for load to store instructions
- Indirect costs - COLD cache! Cache misses!
- Ultimately, we want to limit how frequently context switching is done!
- Processes can be running or idle
- Process states can be: new, ready, running, waiting, or terminated
- Two mechanisms for process creation:
- Fork:
- Copies the parent PCB into new child PCB
- Child continues execution at instruction after fork
- Exec:
- Replace child image
- Load new program and start from first instruction
- Fork:
- A CPU Scheduler determines which one of the currently ready processes will be dispatched to the CPU to start running, and how long it should run for
- In general, the OS must be efficient:
- Preempt - interrupt and save current context
- Schedule - run scheduler to choose next process
- Dispatch - dispatch process to switch into its context
- Useful CPU work can be determined by the following:
total processing time / total time
- In general, total scheduling time should be considered overhead, we want most of the CPU time to be spent doing useful work
- Time-slice - time allocated to a process on the CPU
- An OS must provide mechanisms to allow processes to interact with one another
- IPC mechanisms:
- Help transfer data/info between address spaces
- Maintain protection and isolation
- Provide flexibility and performance
- Message-passing IPC:
- OS provides communication channel liked shared buffer
- Processes write (send)/read (receive) messages to/from channel
- Pros: OS manages
- Cons: overheads
- Shared Memory IPC:
- OS establishes a shared channel and maps it into each process address space
- Processes directly read/write from this memory
- Pros: OS is out of the way!
- Cons: may need to re-implement code
- Topics to be covered in this lesson:
- What are threads?
- How do threads differ from processes?
- What data structures are used to implement and manage threads?
- What is a thread?
- A thread is like a worker in a toy shop in that a thread:
- Is an active entity (e.g., executing unit of a process)
- Works simultaneously with others (e.g., many threads executing)
- Requires coordination (e.g., sharing of I/O devices, CPUs, memory, etc.)
- A thread is like a worker in a toy shop in that a thread:
- A single thread of process is represented by its address space
- Threads represent multiple, independent execution contexts
- Threads are part of the same virtual address space all threads share all of the virtual to physical address mappings as well as the code, data, and files
- Key differences:
- However, threads will potentially execute different instructions, access different portions of that address space, operating on different portions of the input and differ in other ways
- Each thread will need to have a different program counter, stack pointer, stack, thread-specific registers
- Implication: for each thread we must have separate data structures to represent this per-thread information; consequently, the OS has a more complex PCB structure than a process
- Threads can implement parallelization which can process the input much faster than if only a single thread on a single CPU had to process say, an entire matrix for example
- Threads may execute completely different portions of the program
- Threads can also utilize specialization which takes advantage of the hot cache present on each thread
- A multi-threaded application is more memory efficient and has lower memory requirements than its multi-processor alternative
- Additionally, a multi-threaded application incurs lower overheads for their inter-thread communication then the corresponding inter-process alternatives
- Thread data structure - identify threads, keep track of resource usage, etc.
- Mechanisms to create and manage threads
- Mechanisms to safely coordinate among threads running concurrently in the same address space
- Processes:
- Operate within their own address space
- OS and hardware makes sure that no access from one address space is allowed to be performed on memory that belongs to another
- Threads:
- Share the same virtual-to-physical address mappings
- Can access the same data at the same time (concurrency issue)
- To address concurrency issues we use synchronization mechanisms:
- Mutual exclusion:
- Exclusive access to only on thread at a time
- Mutex (mutual exclusion object) - a program object that is created so that multiple program threads can take turns sharing the same resource
- Waiting on other threads:
- Specific condition before proceeding
- Condition variable - a container of threads that are waiting for a certain condition
- Waking up other threads from wait state
- Mutual exclusion:
- There are three main steps in thread creation process:
- Thread type:
- Thread data structure
- Fork (proc, args):
- Create a thread
- Not UNIX fork
- Join (thread):
- Terminate a thread
- Thread type:
- Mutex - a lock that should be used whenever accessing data or state that's shared among threads
- When a thread locks a mutex (also termed acquiring the mutex) it has exclusive access to a resource until the thread decides to unlock the mutex
- A mutex has the following information:
- Is the mutex locked?
- Which thread owns the mutex?
- Which threads are blocked?
- Critical section - portion of the code protected by the mutex
- Condition variables can be used in conjuction with mutexes to control the behavior of concurrent threads
- In the consumer and producer example in lecture, there is a condition where both consumer and producer checks if lists is/is not full, move foward
- We combat this wait condition with a condition variable which releases the mutex to allow for producers to finish filling up the list and then acquires the mutex after the
Wait()
statement is finished
- We combat this wait condition with a condition variable which releases the mutex to allow for producers to finish filling up the list and then acquires the mutex after the
- A condition variable API conists of the following:
- Condition type
- Wait (mutex, condition):
- Mutex is automatically released and re-acquired on wait
- Signal (condition):
- Notify one thread waiting on condition
- Broadcast (condition):
- Notify all waiting threads
- Keep track of mutex/condition variables used with a resource:
- e.g., mutextype _m1; // mutex for file1
- Check that that you are always (and correctly) using lock and unlock:
- e.g., did you forget to lock/unlock? What about compilers?
- Use a single mutex to access a single resource!
- We do not want reads and writes to happen concurrently!
- Check that you are signaling correct condition
- Check that you are not using signal when broadcast is needed:
- Signal - only one thread will proceed, remaining threads will continue to wait, possibly indefinitely!
- Ask yourself: do you need priority guarantees?
- Thread execution order not controlled by signals to condition variables!
- Other pitfalls include:
- Spurious wake-ups
- Deadlocks
- Spurious wake-ups occur when cycles are wasted via context switching threads to run on the CPU and then back again to wait on the wait queue
- When you unlock a mutex after broadcast/signal, no other thread can get lock
- Solution: broadcast/signal after mutex is unlocked, this only works in some cases however (write to file)
- Deadlocks occur when two or more competing threads are waiting on each other to complete but none of them ever do
- Solution: a good general solution is to maintain lock order, e.g., first m_a then m_b
- Kernel level:
- Kernel level threads imply that the OS itself is multi-threaded
- Kernel threads are managed bny kernel level components like the kernel level scheduler (the OS scheduler will decide how kernel level threads will be mapped onto the physical CPUs and which one of the threads will execute)
- User level:
- The processes are multi-threaded
- For a user level thread to execute it must be associated with a kernel level thread and the OS level scheduler must schedule that kernel level thread onto a CPU
- What is the relationship between a kernel level thread and a user level thread?
- One-to-one model:
- Pros:
- OS sees/understands threads, synchronization, blocking, etc.
- Cons:
- Must go to OS for all operations (may be expensive)
- OS may have limits on policies, thread number
- Portability
- Pros:
- Many-to-one model:
- Pros:
- Totally portable, does not depend on OS limits and polices
- Cons:
- OS has no insights into application needs
- OS may block entire process if one user level thread blocks on I/O
- Pros:
- Many-to-many model:
- Pros:
- Can be best of both worlds
- Can have bound or unbound threads
- Cons:
- Requires coordination between user and kernel level thread managers
- Pros:
- Process scope:
- User level library manages threads within a single process
- System scope:
- System-wide thread management by OS level thread managers (e.g., CPU scheduler)
- Boss-workers:
- Boss: assigns work to workers
- Worker: performs entire tasks
- Scenario 1: boss assigns work by directly signaling specific worker
- Pros:
- Workers don't need to synchronize
- Cons:
- Boss must track what each worker is doing
- Throughput will do down!
- Pros:
- Scenario 2: boss assigns work in producer/consumer queue
- Pros:
- Boss does not need to know details about workers
- Cons:
- Queue synchronization
- Pros:
- Scenario 3: worker pool (static or dynamic)
- Pros:
- Simplicity
- Cons:
- Thread pool management
- Locality
- Pros:
- Boss-workers variants:
- All workers created equal versus workers specialized for certain tasks
- Pros:
- Better locality
- Quality of service management
- Cons:
- Load balancing
- Pipeline pattern:
- Threads assigned one subtask in the system
- Entire tasks are pipeline of threads
- Multiple tasks concurrently in the system, in different pipeline stages
- Throughput is the longest stage in the pipeline (weakest link) in the pipeline
- Pipeline stages can be managed via thread pool
- The best way to pass work is through a shared-buffer based communication between stages
- Pros:
- Specialization and locality
- Cons:
- Balancing and synchronization overheadss
- Layered pattern:
- Each layer group of related subtasks
- End-to-end task must pass up and down through all layers
- Pros:
- Specialization
- Less fine-grained than pipeline
- Cons:
- Not suitable for all applications
- Synchronization
- Topics to be covered in this lesson:
- What are PThreads?
- POSIX Threads
- What is POSIX?
- POSIX stands for Portable Operating System Interface
- What are PThreads?
- POSIX Threads:
- POSIX versions of Birrell's API
- Specifies syntax and semantics of the operations
- PThread creation is similar to thread abstraction proposed by Birrell:
Birrell's Mechanisms | PThreads |
---|---|
Thread |
pthread_t (type of thread) |
Fork() |
pthread_create() |
Join() |
pthread_join() |
-
pthread_attr_t
:- Specified in
pthread_create
- Defines features of the new thread
- Has default behavior with NULL in
pthread_create
- Specified in
-
Detaching PThreads:
- Mechanism not considered by Birrell
- Default: joinable threads
- Parent thread creates children threads and joins them at a later time
- The parent thread should not terminate until the children threads have completed their executions and have been joined via the explicit join operation
- If parent threads exits early, children threads can become zombies
- Detached threads:
- There is a possibility for children threads to be detached from the parent
- Once detached, threads cannot join
- If a parent exits, children threads are free to continue their execution
- Parent and children are equivalent to one another
- Ensure to include the PThread header file,
pthread.h
, in your main file that contains the PThreads code, otherwise the program will not compile - Compile source with
-lpthread
or-pthread
- Check the return values of common functions
- PThread mutexes were designed to solve mutual exclusion problems among concurrent threads
- Below is a comparison of Birrell's mechanisms and PThreads for mutexes:
Birrell's Mechanisms | PThreads |
---|---|
Mutex |
pthread_mutex_t (mutex type) |
Lock() (to lock) |
pthread_mutex_lock() (explicit lock) |
Lock() (also to unlock) |
pthread_mutex_unlock() (explicit unlock) |
- Mutex safety tips:
- Shared data should always be accessed through a single mutex!
- Mutex scope must be visible to all!
- Globally order locks
- For all threads, lock mutexes in order
- Always unlock a mutex
- Always unlock the correct mutex
- Below is a comparison of Birrell's mechanisms and PThreads for condition variables:
Birrell's Mechanisms | PThreads |
---|---|
Condition |
pthread_cond_t (type of cond variable) |
Wait() (to lock) |
pthread_cond_wait() |
Signal() (also to unlock) |
pthread_cond_signal() |
Broadcast() (also to unlock) |
pthread_cond_broadcast() |
- There are also other condition variables such as
pthread_cond_init()
andpthread_cond_destroy()
- Condition variable safety tips:
- Do not forget to notify waiting threads!
- Predicate change => signal/broadcast correct condition variable
- When in doubt broadcast
- However, broadcast too often will result in performance loss
- You do not need a mutex to signal/broadcast (it may be necessary to wait until mutex is removed before signaling/broadcasting)
- Do not forget to notify waiting threads!
No notes from this lesson
- Topics to be covered in this lesson:
- Kernel vs user-level threads
- Threads and interrupts
- Threads and signal handling
- User-level library:
- Provides thread abstraction scheduling, sync
- OS kernel:
- Maintains thread abstraction scheduling, sync
- ULT (user-level thread):
- User-level thread ID
- User-level refs
- Thread stack
- PCB:
- Virtual address mapping
- KLT (kernel-level thread):
- Stack
- Registers
- When running multiple processes:
- We need copies of KLT, PCB, and KLT structures
- We will need to have a relationship between ULT, PCB, and KLT to what is the address space within which that thread executes
- For a system with multiple CPUs we will need to have another data structure to represent the CPU as well as a relationship between the KLTs and the CPU
- When the kernel is multi-threaded:
- We can have multiple kernel-level threads supporting a single user-level process
- When kernel needs to context switch among KLTs that belong to different processes, it can quickly determined that they point to a different PCB
- When two KLTs belong to the same address space:
- Information in the PCB are split into a hard and light process state
- Hard process state - relevant for all of the ULTs that execute within that process
- Light process state - relevant for a subset of the ULTs that are currently associated with a particular KLT
- Single PCB:
- Large continuous data structure
- Private for each entity
- Saved and restored on each context switch
- Update for any changes
- Multiple data structures:
- Smaller data structures
- Easier to share
- On context switch only save and restore what needs to change
- User-level library need only update portion of the state
- In general, pivoting to multiple data structures improves scalability, overheads, performance, and flexibility
- Modern OS adopt multiple data structures for organizing information about their execution contexts
- Problem:
- User-level library does not know what is happening in the kernel
- Kernel does not know what is happening in the user-level library
- Solution:
- System calls and special signals allow kernel and ULT to interact and coordinate (as shown in Solaris 2.0 demo in lecture)
- Problem:
- Visibility of state and decisions between kernel and user-level library
- User-level library sees:
- ULTs
- Available KLTs
- Kernel sees:
- KLTs
- CPUs
- Kernel-level scheduler
- Invisible to kernel:
- Mutex variable
- Wait queues
- Additionally there are many to many:
- User-level scheduling decisions
- Change ULT-KLT mapping
- One way to address visibility issue is to use one-to-one models
- How/when does the user-level library run?
- Process jump to user-level libary scheduler when:
- ULTs explicitly yield
- Timer set by user-level library expires
- ULTs call library functions like lock/unlock
- Blocked threads become runnable
- User-level library scheduler:
- Runs on ULT operations
- Runs on signals from timer
- Process jump to user-level libary scheduler when:
- Problem:
- Have ULTs split running on multiple CPUs, how to get CPUs to communicate?
- Solution:
- On the kernel level, need to send signal to other KLT on the other CPU to run library code locally
- Problem:
- Have ULTs split running on multiple CPUs, how to get CPUs to synchronize?
- Solution:
- Use adaptive mutexes:
- If critical section short do not block, spin instead!
- For long critical sections we resort to default blocking behavior
- Use adaptive mutexes:
- Destroying threads:
- Instead of destroying, we should reuse threads
- When a thread exits:
- Put on death row
- Periodically destroyed by reaper thread
- Otherwise thread structures/stacks are reused (performance gains!)
- Interrupts:
- Events generated externally by components other than the CPU (I/O devices, timers, other CPUs)
- Determined based on the physical platform
- Appear asynchronously
- Signals:
- Events triggered by the CPU and software running on it
- Determined based on the operating system
- Appear synchronously or asynchronously
- Interrupts and Signals:
- Have a unique ID depending on the hardware or OS
- Can be masked and disabled/suspended via corresponding mask
- Per-CPU interrupt mask, per-process signal mask
- If enabled, trigger corresponding handler
- Interrupt handler set for entire system by OS
- Signal handlers set on per process basis, by process
- Recall that interrupts are generated externally
- When a device sends an interrupt to the CPU it is basically sending a signal through the interconnect that connects the device to the CPU complex
- For most modern devices there is an MSI (message signal interrupter) message that can be carried on the same interconnect that connects the devices to the CPU complex
- Based on the MSI message, the interrupt can be uniquely identified through a interrupt handler table
- Recall that signals are generated internally
- If a thread decides to access illegal memory, a signal (
SIGSEGV
) will be generated from the OS - The OS maintains a signal handler table for every process in the system
- A process may specify how a signal should be handled, this is because the OS actually specifies some default actions for handling signals
- Handlers/actions (default actions):
- Terminate
- Ignore
- Terminate and core dump
- Stop or continue
- Synchronous:
SIGSEGV
(access to protected memory)SIGFPE
(divide by zero)SIGKILL
(kill, id) can be directed to s specific thread
- Asynchronous:
SIGKILL
(kill)SIGALARM
- Problem:
- Interrupts and signals are handled on the thread stack which can cause handler code to deadlock
- Solution:
- Control interruptions by handler code (user interrupt/signal masks)
- A mask is a sequence of bits where each bit corresponds to a specific interrupt or signal and the value of the bit, zero or one, will indicate whether the specific interrupter signal is disabled or enabled
- Interrupt masks are per CPU:
- If mask disables interrupt, hardware interrupt routing mechanism will not deliver interrupt to CPU
- Signal masks are per execution context (ULT on top of KLT) if mask disables signal, kernel sees mask and will not interrupt corresponding thread
- Interrupts can be directed to any CPU that has them enabled
- May set interrupt on just a single core
- Avoids overheads and perturbations on all other cores
- One-shot signals:
n_signals_pending == one_signal_pending
at least once- Must be explicitly re-enabled
- Real-time signals:
- If n signals raised, then handler is called n times
- Problem:
- Deadlocks can happen for signal handling routines
- Solution:
- As mentioned in the SunOS paper, we can allow interrupts to become full-fledged threads every time interrupts are performing blocking operations
- However, dynamic thread creation is expensive!
- Dynamic decision:
- If handler doesn't lock, execute on interrupted thread's stack
- If handler can block, turn into real thread
- Optimization:
- Pre-create and pre-initialize thread structures for interrupt routines
- Dynamic decision:
- Interrupts as threads can be handled in two ways (see diagram from lecture):
- Top half:
- Fast, non-blocking
- Min amount of processing
- Bottom half:
- Arbitrary
- Complexity
- Top half:
- Bottom line:
- To permit arbitrary functionality to be incorporated into the interrupt-handling operations, the handling routine must be executed by another thread where synchronization and blocking is a possibility
- Overall cost:
- Overhead of 40 SPARC instructions per interupt
- Saving of 12 instructions per mutex
- No changes in interrupt mask, level, etc.
- Fewer interrupts than mutex lock/unlock operations
- To summarize, optimize for the common case!
- Main execution abstraction: task
- KLT
- Single-threaded process: one task
- Multi-threaded process: many tasks
- Task creation - use
clone
command - Linux threads model:
- NPTL (Native POSIX Threads Library) - one-to-one model:
- Kernel sees each ULT info
- Kernel traps are cheaper
- More resources: memory, large range of IDs, etc.
- Order Linux Threads - many-to-many model:
- Similar issues to those described in Solaris papers
- NPTL (Native POSIX Threads Library) - one-to-one model:
- Topics to be covered in this lesson:
- Performance comparisons:
- Multi-process vs multi-threaded vs event-driven
- Event-driven architectures
- "Flash: An Efficient and Portable Web Server" vs Apache
- Performance comparisons:
- Consider the Boss-worker vs Pipeline example as discussed in lesson 5 (see lecture for specific initial conditions):
- We care about two metrics, execution time and average time to complete order
- The Boss-worker model has an execution time greater than that of the Pipeline model (undesirable)
- However, the Boss-worker model has an average time to complete order less than that of the Pipeline model (desirable)
- Which model is better?
- It really depends on the metrics!
- Threads are useful because of:
- Parallelization: speed up
- Specialization: hot cache!
- Efficiency: lower memory requirement and cheaper synchronization
- Threads hide latency of I/O operations (single CPU)
- Now consider what is useful...
- For a matrix multiply application: execution time
- For a web service application:
- Number of client requests/time
- Response time
- For hardware: higher utilization (e.g., CPU)
- Again, it depends on the metrics!
- Metrics exist for OS and for toy shops (some examples below):
- Throughput:
- Process completion rate
- Response time:
- Average time to respond to input (e.g., mouse click)
- Utilization:
- Percentage of CPU
- Throughput:
- Metrics are a measurement standard measurable and/or quantifiable property (e.g., execution time) of the system (software implementation of a problem) we're interested in that can be used to evaluate the system behavior (its improvement compared to other implementations)
- What are some performance metrics computer scientists typically care about?
- Previously covered metrics:
- Execution time
- Throughput
- Request rate
- CPU utilization
- Other metrics one might care about:
- Wait time
- Platform efficiency
- Performance/cost
- Performance/power
- Percentage of SLA violations
- Client-perceived performance
- Aggregate performance
- Average resource usage
- Previously covered metrics:
- Performance metrics are a measurable quantity obtained from:
- Experiments with real software deployment, real machines, real workloads
- Toy experiments representative of realistic settings
- Simulation: test-bed
- Depends on metrics!
- Depends on workload!
- Bottom line: it depends!
- Consider how to best provide concurrency (see lecture for simple web server example):
- Multi-process web server:
- Pros: simple programming
- Cons:
- Many processes which means high memory usage costly context switch hard/costly to maintain shared state (tricky port setup)
- Multi-threaded web server:
- Pros:
- Shared address space
- Shared state
- Cheap context switch
- Cons:
- Not simple implementation
- Requires synchronization
- Underlying support for threads
- Pros:
- Multi-process web server:
- An event-driven model contains the following elements (see lecture for diagram):
- Event handlers:
- Accept connection
- Read request
- Send header
- Read file/send data
- Dispatch loop
- Events:
- Receipt of request
- Completion of send
- Completion of disk read
- Event handlers:
- An event driven model has a single address space, single process, and single thread of control!
- The dispatcher is in charge of state machine external events (call handler and jump to code)
- The handler:
- Runs to completion
- Facilitates blocking:
- Initiate blocking operation and pass control to dispatch loop
- If the event-driven model operates on a single thread, how to achieve concurrency?
- Single thread switches among processing of different requests
- Multi-process and multi-threaded:
- One request per execution context (process/thread)
- Event-driven:
- Many requests interleaved in an execution context
- Why does this work?
- On one CPU, threads hide latency:
if (t_idle > 2 * t_context_switch)
, context switch to hide latencyif (t_idle == 0)
, context switching just wastes cycles that could have been used for request processing
- On one CPU, threads hide latency:
- Event-driven:
- Process request until wait necessary then switch to another request
- Multiple CPUs:
- Multiple event-driven processes
- How does this work?
- Event is equal to input on FD (file descriptor)
- Which file descriptor?
select()
poll()
epoll()
- Benefits of event-driven model:
- Single address space and single flow of control
- Smaller memory requirement and no context switching
- No synchronization
- Problem: for the event-driven model, a blocking request/handler will block the entire process
- Solution:
- Use asynchronous I/O operations:
- Process/thread makes system call
- OS obtains all relevant into from stack and either learns where to return results, or tells caller where to get results later
- Process/thread can continue
- Use asynchronous I/O operations:
- However, an asynchronous system call requires support from kernel (e.g., threads) and/or device
- In general, asynchronous system calls fit nicely with the event-driven model!
- Another problem: what if async calls are not available?
- Use helpers:
- Designated for blocking I/O operations only
- Pipe/socket based communication with event dispatcher
- Helper blocks! But main event loop (and process) will not!
- Use helpers:
- Before, there were no multi-threaded solutions, therefore, a AMPED (Asymmetric Multi-process Event-driven Model) solution was created similar to that mentioned above
- With the addition of multi-threaded capabilities, the multi-threaded event-driven model discussed in previously became known as the AMTED (Asymmetric Multi-threaded Event-driven Model) solution
- In summary, helper threads/processes:
- Pros:
- Resolves portability limitations of basic event-driven model
- Smaller footprint than regular worker thread
- Cons:
- Applicability to certain classes of applications
- Event-driven routing on multi CPU systems
- Pros:
- Flash: event-driven web server:
- An event-driven web server (AMPED) with asymmetric helper processes
- Helpers used for disk reads
- Pipes used for communication with dispatcher
- Helper reads file in memory (via memory map)
- Dispatcher checks (via mincore) if pages are in memory to decide local handler or helper
- In general, a flash web server can offer possible big savings!
- Flash: additional optimizations:
- Application-level caching (data and computation)
- Alignment for DMA (direct memory access)
- Use of DMA with scatter-gather, vector I/O operations
- Back in the day these optimizations would be novel, now they are fairly common
- An Apache web server (diagram available in lecture slides) consists of the following elements:
- Core - basic server skeleton
- Modules - per functionality
- Flow of control - similar to event-driven model
- However, an Apache web server differs in:
- Combination of MP and MT:
- Each process is equivalent to boss/worker with dynamic thread pool
- Number of processes can also be dynamically adjusted
- Combination of MP and MT:
- To set up performance comparisons consider the following:
- First, define the comparison points:
- What systems are you comparing?
- Second, define inputs:
- What workloads will be used?
- Third, define metrics:
- How will you measure performance?
- First, define the comparison points:
- When data is in cache:
- SPED (single-process event-driven) >> AMPED Flash:
- Unnecessary test for memory presence
- SPED and AMPED Flash >> MT/MP:
- Sync and context switching overhead
- SPED (single-process event-driven) >> AMPED Flash:
- Disk-bound workload:
- AMPED Flash >> SPED:
- Blocks because no async I/O
- AMPED Flash >> MT/MP:
- More memory efficient and less context switching
- AMPED Flash >> SPED:
- Design relevant experiments: statements about a solution that others believe in and care for
- Purpose of relevant experiments (e.g., web server experiment):
- Clients: response time
- Operations: throughput
- Possible goals:
- Increase response time and throughput
- Increase response time
- Increase response time while decreasing throughput
- Maintains response time when request rate increases
- Goals: metrics and configuration of experiments
- Rule of thumb for picking metrics:
- Standard metrics equals broader audience
- Metrics answering the "Who? What? Why?" questions:
- Client performance: response time, timed-out request, etc.
- Operator costs: throughput, costs, etc.
- Picking the right configuration space:
- System resources:
- Hardware and software
- Workload:
- Web server: request rate, number of concurrent requests, file size, access pattern
- Now pick!:
- Choose a subset of configuration parameters
- Pick ranges for each variable factor
- Pick relevant workload
- Include the best/worst case scenarios
- System resources:
- Are you comparing apples to apples?:
- Pick useful combination of factors, many just reiterate the same point
- What about the competition and baseline?:
- Compare system to:
- State-of-the-art
- Most common practice
- Ideal best/worst case scenario
- Compare system to:
- If you have designed the experiments you should consider:
- Running test cases n times
- Compute metrics
- Represent results
- Additionally, do not forget about making conclusions!
No notes from this lesson
- Topics to be covered in this lesson:
- Scheduling mechanisms, algorithms and data structures
- Linux O(1) and CFS schedulers
- Scheduling on multi-CPU platforms
- Like an OS scheduler, a toy shop manager schedules work:
- Assign task immediately:
- Scheduling is simple (first-come first-serve)
- Assign simple tasks first:
- Maximize throughput (shortest job first)
- Assign complex tasks first:
- Maximize utilization of CPU devices, memory, etc.
- Assign task immediately:
- The CPU scheduler:
- Decides how and when processes (and their threads) access shared CPUs
- Schedules tasks running user-level processes/threads as well as KLTs
- Chooses one of ready tasks to run on CPU when:
- CPU becomes idle
- New task becomes ready
- Time-slice expired timeout
- Thread is dispatched on CPU
- Scheduling is equivalent to choosing a task from ready queue:
- Which task should be selected?
- Scheduling policy/algorithm
- How is this done?
- Depends on run-queue data structure (run-queue is the scheduling algorithm)
- Which task should be selected?
- Initial assumptions:
- Group of tasks/jobs
- Known execution times
- No preemption
- Single CPU
- Metrics:
- Throughput
- Average job completion time
- Average job wait time
- CPU utilization
- First-come first-serve (FCFS):
- Schedules tasks in order of arrival
- Run-queue is the same as queue (FIFO)
- Shortest job first (SJF):
- Schedules tasks in order of their execution time
- Run-queue is the same as ordered queue or tree
- SJF + Preemption:
- T2 arrives first
- T2 should be preempted
- Heuristics based on history: job running time
- How long did a task run last time?
- How long did a task run last n times?
- Priority scheduling:
- Tasks have different priority levels
- Run highest priority tasks next (preemption)
- Run-queue is equal to per priority queues, tree ordered based on priority, etc.
- Low priority tasks stuck in a run-queue (starvation)
- Priority aging is where
priority = f(actual priority, time spend in run queue)
- Eventually task will run (prevents starvation!)
- Assume SJF (see lecture for table and graph):
- Priority: T1, T2, T3
- Order of execution: T2, T3, T1 (priorities inverted)
- Solution:
- Temp boost priority of mutex owner
- Lower again release
- Pick up first tasks from queue (like FCFS)
- Task may yield, to wait on I/O (unlike FCFS)
- Round robin with priorities:
- Include preemption
- Round robin with interleaving:
- Time-slicing
- Time-slice - maximum amount of uninterrupted time given to a task (time quantum)
- Task may run less than time-slice time:
- Has to wait for I/O, synchronization, etc. (will be placed on a queue)
- Higher priority task becomes runnable
- Using time-slices tasks are interleaved (time-sharing the CPU):
- CPU bound tasks (preempted after time-slice)
- Pros:
- Short tasks finish sooner
- More responsive
- Lengthy I/O operations initiated sooner
- Cons:
- Overheads (interrupt, schedule, context switch)
- How long should a time-slice be?
- CPU bound tasks prefer longer time-slices:
- Limits context switching overheads
- Keeps CPU utilization and throughput high
- I/O bound tasks prefer shorter time-slices:
- I/O bound tasks can issue I/O operations earlier
- Keeps CPU and device utilization high
- Better used perceived performance
- CPU bound tasks prefer longer time-slices:
- If we want I/O and CPU bound tasks have different time-slice values, then...
- Same run-queue, check type, etc.
- Two different structures
- One solution: use a multi-queue data structure with separate internal queues
- First time-slice is most I/O intensive (highest priority)
- Second time-slice is medium I/O intensive (mix of I/O and CPU processing)
- Third and beyond time-slice is CPU intensive (lowest priority)
- Pros:
- Time-slicing benefits provided for I/O bound tasks
- Time-slicing overheads avoided for CPU bound tasks
- Handling different time-slice values:
- Tasks enter top-most queue
- If task yields voluntarily keep task at this level
- If task uses up entire time-slice push down to lower level
- Task in lower queue gets priority boost when releasing CPU due to I/O waits
- In summary, MLFQ (multi-level feedback queue) is not a priority queue (MLFQ has a feedback mechanism) and offer different treatment of threads at each level
- The Linux O(1) scheduler has several of unique characteristics:
- The name O(1) means it takes constant time to select/add task, regardless of task count
- Preemptive, priority-based:
- Real time (0-99)
- Time-sharing (100-139)
- User processes:
- Default 120
- Nice value (-20 to 19)
- Time-slice value for the Linux O(1) scheduler:
- Depends on priority
- Smallest for low priority
- Highest for high priority
- Feedback for the Linux O(1) scheduler:
- Sleep time: waiting/idling
- Longer sleep: interactive
- Smaller sleep: compute-intensive
- Run-queue for Linux O(1) scheduler: two arrays of tasks...
- Active:
- Used to pick next task to run
- Constant time to add/select
- Tasks remain in queue in active array until time-slice expires
- Expired:
- Inactive list
- When no more tasks in active array (swap active and expired)
- Active:
- Problems with Linux O(1) scheduler:
- Performance of interactive tasks is not satisfactory
- Lacks fairness during task prioritization
- Solution: Linux CFS (Completely Fair Scheduler)
- CFS is the default scheduler since Linux 2.6.23
- Run-queue is based on a red-black tree:
- Ordered by
vruntime
wherevruntime
is time spent on CPU
- Ordered by
- CFS scheduling works as follows:
- Always pick left-most node
- Periodically adjust
vruntime
- Compare to left-most
vruntime
:- If smaller, continue running
- If larger, preempt and place appropriately in the tree
vruntime
progress rate depends on priority and niceness:- Rate fast for low-priority
- Rate slower for high-priority
- Same tree for all priorities
- Performance:
- Select task: O(1)
- Add task: O(log(n))
- Cache-affinity important!
- Keeps tasks on the same CPU as much as possible
- Hierarchical scheduler architecture
- Per-CPU run-queue and scheduler:
- Load balance across CPUs based on queue length or when CPU is idle
- NUMA (Non-uniform Memory Access):
- Multiple memory nodes
- Memory node closer to a socket of multiple processors:
- Access to local memory node faster than access to remote memory node
- Keep tasks on CPU closer to memory node where their state is
- NUMA-aware scheduling
- Multiple hardware-supported execution contexts
- Still one CPU but with very fast context switch:
- Hardware multi-threading
- Hyper-threading
- CMT (chip multi-threading)
- CMT (simultaneous multi-threading)
- Assumptions:
- Thread issues instruction on each cycle (one max IPC or instruction per cycle)
- Memory access (four cycles)
- Hardware switching instantaneous
- SMT with two hardware threads
- Threads interfere and contend for CPU pipeline resource:
- Performance degrades
- Memory degrades
- CPU idle: waste CPU cycles
- Mix of CPU and memory-intensive threads:
- Avoid/limit contention on processor pipeline
- All components (CPU and memory) well utilized
- However, still leads to interference and degradation but minimal
- Use historic information:
- Sleep time won't work:
- The thread is not sleeping when waiting on memory
- Software takes too much time to compute
- Sleep time won't work:
- What about hardware counters?
- Hardware counters estimate what kind of resources a thread needs
- Scheduler can make informed decisions:
- Typically multiple counters
- Models with per architecture thresholds
- Based on well-understood workloads
- Resource contention in SMTs for process pipeline
- Hardware counters can be used to characterize workload
- Schedulers should be aware of resource contention, not just load balancing
- Topics to be covered in this lesson:
- Physical and virtual memory management
- Review of memory management mechanisms
- Illustration of advanced services
- OS and toy shops each have memory (part) management systems:
- Uses intelligently sized containers:
- Memory pages or segments
- Not all memory is needed at once:
- Tasks operate on subset of memory
- Optimized for performance:
- Reduce time to access state in memory (better performance)
- Uses intelligently sized containers:
- Virtual vs physical memory:
- Allocate: Allocation, replacement, etc.
- Arbitrate: Address translation and validation
- Page-based memory management:
- Allocate: Pages to page frames
- Arbitrate: Page tables
- Segment-based memory management:
- Allocate: Segments
- Arbitrate: segment registers
- MMU (memory management unit):
- Translate VA (virtual address) to PA (physical addresses)
- Reports faults: illegal access, permission not present in memory
- Registers:
- Pointers to page table
- Base and limit size, number of segments, etc.
- Cache - TLB (translation look-aside buffer):
- Valid virtual address to physical address translations: TLB
- Translation:
- Actual PA generation done in hardware
- Virtual memory pages and physical memory page frames are the same size
- Useful acronyms for page tables:
- VPN (virtual page number)
- PFN (physical frame number)
- Page table has allocation on first touch!
- Unused pages are reclaimed:
- Mapping invalid
- Hardware will fault
- Re-establish on re-access
- In summary, the OS creates a page table for every process:
- On context switch, switch to valid page table
- Update register
- Flags:
- Present (valid/invalid)
- Dirty (written to)
- Accessed (for read or write)
- Protection bits (read, write, and execute)
- Page fault:
- Two options for PFN:
- Generate physical address and access
- Generate error code on kernel stack (trap into kernel)
- Page fault handler determines action based on error code and faulting address:
- Bring page from disk to memory
- Protection error (
SIGSEGV
) - Error code from PTE flags
- Faulting address CR2 register
- Two options for PFN:
- 64-bit Architecture:
- PTE (Page Table Entry): 8 bytes including PFN + flags
- VPN: 2^64 / page size
- Page size: (2^64 / 2^12) * 8 bytes = 32 petabytes per process
- Process does not use entire address space:
- Even on 32-bit architecture will not always use all of 4 GB
- But page table assumes an entry per VPN, regardless of whether corresponding virtual memory is needed or not
- Outer page table: page table directory
- Internal page table: only for valid virtual memory regions
- Additional layers:
- Page table directory pointer (third level)
- Page table directory point map (fourth level)
- Important on 64-bit architectures
- Larger and more sparse
- Larger gaps could save more internal page table components
- Multi-level page tables:
- Pros:
- Smaller internal page tables/directories
- Granularity of coverage (potential reduced page table size)
- Cons:
- More memory access required for translation (increased translation latency)
- Pros:
- Overhead of address translation:
- Single-level page table:
- 1x access to page table entry
- 1x access to memory
- Four-level page table:
- 4x accesses to page table entries
- 1x access to memory (can lead to slow down!)
- Single-level page table:
- Page table cache (TLB):
- MMU-level address translation cache
- On TLB miss: page table access from memory
- Has protection/validity bits
- Small number of cached address, high TLB hit rate and temporal and spatial locality
- Another way of organizing the address translation process (see lecture for the inverted page table diagram):
- Components:
- Logical address
- Physical address
- Physical memory
- Page table
- Search
- TLB to catch memory references
- Components:
- Inverted page tables use hashing page tables (see lecture for diagram) to optimize efficiency:
- Speeds up linear search process and narrows it down to few possible entries into the inverted page table, this speeds up address translation
- Segments are arbitrary and granular:
- E.g., code, heap, data, stack, etc.
- Address is equivalent to the segment selector + offset
- Segment is a contiguous physical memory:
- Segment size is equivalent to segment base + limit registers
- Segmentation + paging:
- IA x86_32: segmentation + paging
- Linux up to 8K per process / global segment
- IA 86x_64: paging
- IA x86_32: segmentation + paging
- 10-bit offset: 1 KB page size
- 12-bit offset: 4 KB page size
- Below is a table detailing large vs huge pages
Large | Huge | |
---|---|---|
Page size | 2 MB | 1 GB |
Offset bits | 21 bits | 30 bits |
Reduction factor (on page table size) | x512 | x1024 |
- In general, for larger pages:
- Pros: fewer page table entries, smaller page tables, more TLB hits, etc.
- Cons: internal fragmentation (wastes memory)
- Memory allocator:
- Determines VA to PA mapping, address translation, page tables, etc.
- Simply determine PA from VA and check validity/permissions
- Kernel-level allocators:
- Kernel state, static process state
- User-level allocators:
- Dynamic process state (heap), malloc/free
- E.g.,
dimalloc
,jemalloc
,hoard
,tcmalloc
- Problem: external fragmentation
- Occurs with multiple interleaved allocate and free operations, and as a result of them, we have holes of free memory that is not contiguous
- Requests for larger contiguous memory allocations cannot be satisfied
- Solution:
- When pages are freed, the allocator can aggregate adjacent areas of free pages into one larger free area, this allows for larger future requests
- The Linux kernel relies on two basic allocation mechanisms:
- Buddy:
- Starts with consecutive memory region that's free (2^x area)
- On request, sub-divide into 2^x chunks and find smallest 2^x chunk that can satisfy request (fragmentation still there)
- On free:
- Check buddy to see if it can aggregate into a larger chunk
- Aggregate more up the tree (aggregation works well and fast)
- Slab:
- Addresses 2^x granularity in Buddy
- Addresses internal fragmentation
- Slab allocator:
- Caches for common object types/sizes, on top of contiguous memory
- Pros:
- Internal fragmentation avoided
- External fragmentation not an issue
- Buddy:
- Virtual memory >> physical memory:
- Virtual memory page note always in physical memory
- Physical page frame saved and restored to/from secondary storage
- Demand paging: pages swapped in/out of memory and a swap partition
- Original PA is not equal to PA after swap
- If page is pinned swapping disabled
- When should pages be swapped out?
- OS runs page (out) daemon:
- When memory usage is above threshold (high watermark)
- When CPU usage is below threshold (low watermark)
- OS runs page (out) daemon:
- Which pages should be swapped out?
- Pages that won't be used
- History-based prediction:
- LRU (least-recently used) policy: access bit to track if page is referenced
- Pages that don't need to be written out
- Dirty bit to track if modified
- Avoid non-swappable pages
- In Linux:
- Parameters to tune thresholds: target page count, etc.
- Categorize pages into different types: e.g., claimable, swappable, etc.
- Second chance variation of LRU
- MMU Hardware:
- Perform translation, track access, enforce protection, etc.
- Useful to build other services and optimizations
- COW (copy-on-write):
- On process creation:
- Copy entire parent address space
- Many pages are static, don't change (why keep multiple copies?)
- On create:
- Map new VA to original page
- Write protect original page
- If only read: save memory and time to copy
- On write:
- Page fault copy
- Pay copy cost only if necessary
- On process creation:
- Check-pointing: failure and recovery management technique
- Periodically save process state
- Failure may be unavoidable but can restart from checkpoint so recovery much faster
- Simple approach: pause and copy
- Better approach:
- Write-protect and copy everything once
- Copy diffs of dirtied pages for incremental checkpoints, rebuild from mutiple diffs, or in background
- Debugging:
- RR (rewind-replay)
- Rewind means to restart from checkpoint
- Gradually go back to older checkpoints until error found
- Migration:
- Continue on another machine
- Disaster recovery
- Consolidation
- Repeated checkpoints in a fast loop until pause-and-copy becomes acceptable (or unavoidable)
- Topics to be covered in this lesson:
- IPC (inter-process communication)
- Shared Memory IPC
- An IPC is like working together in the toy shop:
- Processes share memory:
- Data shared in memory
- Processes exchange messages:
- Message passing via sockets
- Requires synchronization:
- Mutexes, writing, etc.
- Processes share memory:
- IPC: OS-supported mechanisms for interaction among processes (coordination and communication)
- Message passing: e.g., sockets, pipes, message queues
- Memory-based IPC: shared memory, memory mapped files
- Higher-level semantics: files, RPC
- Synchronization primitives
- Processes create messages and send/receive them:
- Send/write messages to a port
- Receive/read messages from a port
- OS creates and maintains a channel (i.e., buffer, FIFO queue, etc.):
- OS provides interface to processes - a port
- Kernel required to:
- Establish a connection
- Perform each IPC operation
- Send: system call + data copy
- Receive: system call + data copy
- Pros:
- Simplicity: kernel does channel management and synchronization
- Cons: overheads
- Pipes:
- Carry byte stream between two processes (e.g., connect output from one process to input of another)
- Message queues:
- Carry messages among processes
- OS management includes priorities, scheduling of message delivery, etc.
- APIs:
SysV
andPOSIX
- Sockets:
send()
,recv()
to pass message bufferssocket()
to create kernel-level socket buffer- Associate necessary kernel-level processing (TCP/IP)
- If different machines, channel between process and network device
- If same machine, bypass fall protocol stack
- Read and write to shared memory region
- OS establishes shared channel between the processes
- Physical pages mapped into virtual address space
- VA (P1) and VA (P2) map to the same physical address (see lecture for diagram)
- VA (P1) is not equal to VA (P2)
- Physical memory does no need to be contiguous
- Pros:
- System calls only for setup data copies potentially reduced (but not eliminated)
- Cons:
- Explicit synchronization
- Communication protocol, shared buffer management, etc. (programmer responsibility)
- Goal: transfer data from one into target address space
- Copy:
- CPU cycles to copy data to/from port
- Large data:
t(copy) >> t(map)
- Map:
- CPU cycles to map memory into address space
- CPU to copy data to channel
- Set up once use many times (good payoff)
- Can perform well for one-time use
- Tradeoff exercised in Windows LPC (local producer callsS)
- Segments of shared memory: not necessarily contiguous physical pages
- Shared memory is system-wide: system limits on number of segments and total size
- Create: OS assigns unique key
- Attach: map VA to PA
- Detach: invalidate address mappings
- Destroy: only remove when explicitly deleted (or reboot)
- Like threads accessing shared state in a single address space but for processes
- Synchronization method:
- Mechanisms supported by process threading library (pthreads)
- OS-supported IPC for synchronization
- Either method must coordinate:
- Number of concurrent access to shared segment
- When data is available and ready for consumption
- Consider the following when designing for memory:
- Different API/mechanisms for synchronization
- OS provides shared memory and is out of the way
- Data passing/sync protocols are up to the programmer
- One large segment: manager for allocating/freeing memory from shared segment
- Many small segment:
- Use pool of segments, queue of segment ids
- Communicate segment IDs among processes
- Consider the following questions:
- What size segments?
- What if data doesn't fit
- Segment size is equivalent to data size:
- Works for well-known static sizes
- Limits max data size
- Segment size is greater than message size:
- Transfer data in rounds
- Include protocol to track progress
- Topics to be covered in this lesson:
- More synchronization constructs
- Hardware supported synchronization
- Synchronization is like waiting for a coworker to finish so you can continue working:
- May repeatedly check to continue:
- Sync using spinlocks
- May wait for a signal to continue:
- Sync using mutexes and condition variables
- Waiting hurts performance:
- CPUs waste cycles for checking
- Cache effects
- May repeatedly check to continue:
- Limitation of mutexes and condition variables:
- Error prone/correctness/ease-of-use:
- Unlock wrong mutex, signal wrong condition variable
- Lack of expressive power:
- Helper variables for access or priority control
- Low level support:
- Hardware atomic instructions
- Error prone/correctness/ease-of-use:
- Spinlock is like a mutex:
- Mutual exclusion
- Lock and unlock (free)
- When lock is busy that means the thread is spinning
- Semaphores are a common sync construct in OS kernels:
- Similar to a traffic light (stop and go state)
- Similar to a mutex but more general
- Count-based sync (semaphores can be an integer value):
- On init: assigned a max value positive integer (maximum count)
- On try (wait): if non-zero, decrement and proceed (counting semaphore)
- If initialized with
1
: semaphore is equal to mutex (binary semaphore) - On exit (post): increment
- Syncing different types of accesses:
- Read (never modify): shared access
- Write (always modify): exclusive access
- Read/write locks:
- Specify the type of access then lock behaves accordingly
- Monitors are a high-level synchronization construct
- Monitors specify:
- Shared resource
- Entry procedure
- Possible condition variables
- On entry: lock, check, etc.
- On exit: unlock, check, signal, etc.
- Problem: concurrent check/update on different CPUs can overlap
- Solution: hardware-supported atomic instructions
- Hardware specific:
test_and_set()
read_and_compare()
compare_and_swap()
- Guarantees:
- Atomicity
- Mutual exclusion
- Queue all concurrent instructions but one
- Atomic instructions are critical sections with hardware-supported synchronization:
// Specialize/optimize to available atomics
spinlock_lock(lock): // Spin until free
while(test_and_set(lock) == busy);
test_and_set(lock)
: atomically returns(tests)
original value and sets new value to1
(busy)- First thread:
test_and_set(lock) == 0
(free) - Next ones:
test_and_set(lock) == 1
(busy)- Reset lock to
1
(busy)
- Reset lock to
- A multi-processor system consists of more than one CPU and has memory accessible to all CPUs (see lecture slide for bus-based vs interconnect based)
- SMP (shared memory multi-processors): systems where a bus is shared across all modules which allows the system's memory to be accessible to all CPUs
- SMPs also have cache:
- Hides memory latency
- Memory further away due to contention
- No write, write-through, write-back
- Atomics always issued to the memory controller:
- Pros: can be ordered and synchronized
- Cons: takes much longer and generates coherence traffic regardless of change
- Atomics and SMP:
- Expensive because of bus or I/C contention
- Expensive because of cache bypass and coherence trafficX
- Reduce latency:
- Time to acquire a free lock
- Ideally immediately execute atomic
- Reduce waiting time (delay):
- Time to stop spinning and acquire a lock that has been freed
- Ideally immediately
- Reduce contention:
- Bus/network I/C traffic
- Ideally zero
- Test and set spinlock implementation (see lecture):
- Pros:
- Latency: minimal (atomic)
- Delay: Potentially min (spinning continuously on the atomic)
- Cons:
- Contention: processors go to memory on each spin
- Pros:
- Test and test and set spinlock implementation (see lecture):
- Spin on read
- Spin on cached value
- Pros:
- Latency: ok
- Delay: ok
- Cons:
- Contention: better than test and set spinlock but...
- Non-cached coherent architecture: no difference
- Cache coherence with write update architecture: ok
- Cache coherence with write invalidate architecture: horrible
- Contention due to atomics + caches invalidated means more contention
- Everyone sees lock is free at the same time
- Everyone tries to acquire the lock at the same time
- Contention: better than test and set spinlock but...
- Delay after lock release:
- Everyone sees lock is free
- Not everyone attempts to acquire it
- Pros:
- Contention improved
- Latency ok
- Cons:
- Delay is much worse
- Delay after each lock reference:
- Does not spin constantly
- Works on non-cached coherent architectures
- Can hurt delay even more however
- Pros:
- Contention improved
- Latency ok
- Cons:
- Delay is much worse
- Static Delay (based on fixed value, e.g., CPU ID):
- Simple approach
- Unnecessary delay under low contention
- Dynamic Delay (backoff-based):
- Random delay in a range that increases with perceived contention
- Perceived is the same as failed
test_and_set()
- Delay after each reference will keep growing based on contention or length of critical section
- Common problem in spinlock implementations:
- Everyone tries to acquire lock at the same time once lock is freed: delay alternatives
- Everyone sees the lock is free at the same time (Anderson's Queueing Lock)
- Solution:
- Set unique ticket for arriving thread
- Assigned
queue[ticket]
is private lock - Enter critical section when you have lock:
queue[ticket] == must_wait
(spin)queue[ticket] == has_lock
(enter critical section)
- Signal/set next lock holder on exit:
queue[ticket + 1] = has_lock
- Cons:
- Assumes
read_and_increment
atomic - O(n) size
- Assumes
- Pros:
- Delay: directly signal next CPU/thread to run
- Contention: better but requires cache coherence and cache line aligned elements
- Only one CPU/thread sees the lock is free and tries to acquire lock!
- Cons:
- Latency: more costly read and increment
- Setup (see lecture for figure):
- N processes running critical section one million times
- N varied based on system
- Metrics:
- Overhead compared to ideal performance
- Theoretical limit based on number of critical sections to be run
- Under high loads:
- Queue best (most scalable),
test_and_test_and_set
worst - Static better than dynamic, reference better than release (avoids extra invalidations)
- Queue best (most scalable),
- Under light loads:
test_and_test_and_set
good (low latency)- Dynamic better than static (lower delay)
- Queueing lock worst (high latency due to read and increment)
- Topics to be covered in this lesson:
- OS support for I/O devices
- Block device stack
- File system architecture
- I/O is like a top shop shipping department:
- Have protocols:
- Interfaces for device I/O
- Have dedicated handlers:
- Device drivers, interrupt handlers, etc.
- Decouple I/O details from core processing:
- Abstract I/O device detail from applications
- Have protocols:
- Basic I/O device features:
- Control registers:
- Command
- Data transfers
- Status
- Micro-controller (device's CPU)
- On device memory
- Other logic: e.g., analog to digital converters
- Control registers:
- Peripheral Component Interconnect (PCI):
- PCI express (PCIe)
- Other types of interconnects:
- SCSI (small computer system interface) bus
- Peripheral bus
- Bridges handle differences
- Per each device type
- Responsible for device access, management and control
- Provided by device manufacturers per OS/version
- Each OS standardizes interfaces:
- Device independence
- Device diversity
- Block: disk
- Read/write blocks of data
- Direct access to arbitrary block
- Character: keyboard
- Get/put character
- Network devices
- OS representation of a device: special device file
- Access device registers: memory load/store
- Memory-mapped I/O:
- Part of host physical memory dedicated for device interactions
- BAR (base address registers)
- I/O port:
- Dedicated in/out instructions for device access
- Target device (I/O port) and value in register
- Interrupt:
- Pros: can be generated as soon as possible
- Cons: interrupt handling steps
- Polling:
- Pros: When convenient for OS
- Cons: delay or CPU overhead
- No additional hardware support
- CPU programs the device:
- Via command registers
- Via data movement
- An example of a PIO (programmed I/O): NIC data (network packet shown in lecture)
- Write command to request packet transmission
- Copy packet to data registers
- Repeat until packet sent
- Relies on DMA (direct memory access) controller
- CPU programs the device:
- Via command registers
- Via DMA controls
- An example of a DMA: NIC data (network packet shown in lecture)
- Write command to request packet transmission
- Configure DMA controller with in-memory address and size of packet buffer
- Less steps but DMA config is more complex
- For DMAs:
- Data buffer must be in physical memory until transfer completes: pinning regions (non-swappable)
- See lecture for diagram
- Typical device access includes the following:
- System call
- In-kernel stack
- Driver invocation
- Device request configuration
- Device performs request
- See lecture for diagram
- Device regs/data directly accessible
- OS configures then out-of-the-way
- User-level driver:
- OS retains coarse-grain control
- Relies on device features:
- Sufficient registers
- Demux capability
- See lecture for diagram
- Synchronous I/O operations: process blocks
- Asynchronous I/O operations: process continues
- Process checks and retrieves result
- Process is notified that the operation completed and results are ready
- See lecture for diagram
- Processes use files: logical storage unit
- Kernel file system (FS):
- Where and how to find and access file
- OS specifies interface
- Generic block layer:
- OS standardized block interface
- Device driver
- Problem: how to address the following?
- What if files are on more than one device?
- What if devices work better with different file system implementations?
- What if files are not on a local device (accessed via network)?
- Solution: use a file system
- File: elements on which the VFS (virtual file system) operations
- File descriptor: OS representation of file
- Open, read, write, send file, lock, close, etc.
- Inode: persistent representation of file index
- List of all data blocks
- Device, permissions, size, etc.
- Dentry: directory entry, corresponds to single path component
- Superblock: file system specific information regarding the file system layout
- File: data blocks on disk
- Inode: track files' blocks and also resides on disk in some block
- Superblock: overall map of disk blocks
- Inode blocks
- Data blocks
- Free blocks
- For each block group:
- Superblock: number of inodes, disk blocks, start of free blocks
- Group descriptor: bitmaps, number of free nodes, directories
- Bitmaps: tracks free blocks and inodes
- Inodes: one to max number (one per file)
- Data blocks: file data
- Inodes: index of all disk blocks corresponding to a file
- File: identified by inode
- Inode: list of all blocks plus other metadata
- Pros: easy to perform sequential or random access
- Cons: limit on file size
- Inodes with indirect pointers: index of all disk blocks corresponding to a file
- Inodes contain:
- Metadata
- Pointers to blocks
- Direct pointer: points to data block
- Indirect pointer: block of pointers
- Double indirect pointer: block of block of pointers
- Pros: small inode means large file size
- Cons: file access slow down
- Caching/buffering: reduce number of disk accesses
- Buffer cache in main memory
- Read/write from cache
- Periodically flush to disk (
fsync()
)
- I/O scheduling: reduce disk head movement
- Maximize sequential vs random access
- Prefetching: increase cache hits
- Leverages locality
- Journaling/logging: reduce random access
- Describe write in log: block, offset, value, etc.
- Periodically apply updates to proper disk locations
- Topics to be covered in this lesson:
- Overview of virtualization
- Main technical approaches in popular virtualization solutions
- Virtualization-related hardware advances
- Virtualization allows concurrent execution of multiple OS (and their applications) on the same physical machine
- Virtual resources: each OS thinks that it owns hardware resources
- Virtual machine (VM): OS applications and virtual resources (guest domain)
- Virtualization layer: management of physical hardware (virtual machine monitor, hypervisor)
- A virtual machine is an efficient, isolated duplicated of the real machine supported by a VMM (virtual machine monitor):
- Provides environment essentially identical with the original machine
- Programs show at worst only minor decrease in speed
- VMM is in complete control of system resources
- Pros:
- Consolidation: decrease cost; improve manageability
- Migration: availability, reliability
- Security, debugging, support for legacy OS
- Bare-metal: hypervisor-based
- VMM (hypervisor) manages all hardware resources and supports execution of VMs
- Privileged, service VM to deal with devices
- Xen (open source or Citrix XenServer):
- DomO and DomU
- Drivers in DomO
- ESX (VMWare):
- Many open APIs
- Drivers in VMM
- Used to have Linux control core, now remote APIs
- Hosted:
- Host OS owns all hardware
- Special VMM module provides hardware interfaces to VMs and deals with VM context switching
- Example: KVM (kernel-based VM shown in lecture)
- Based on Linux
- KVM kernel module plus QEMU (Quick Emulator) for hardware virtualization
- Leverages Linux open-source community
- Trap-and-emulate:
- Guest instructions are executed directly by hardware
- For non-privileged operations: hardware speeds must provide efficiency
- For privileged operations: trap to hypervisor
- Hypervisor determines what needs to be done:
- If illegal operation: terminate VM
- If legal operation: emulate the behavior the guest OS was expecting from the hardware
- x86 virtualization pre-2005
- Four rings, no root/non-root modes yet
- Hypervisor in ring 0, guest OS in ring 1
- However, 17 privileged instructions do not trap, they fail silently!
- Cons:
- Hypervisor does not know so it does not try to change settings
- OS does not know, so it assumes change was successful
- Main idea: rewrite the VM binary to never issue those 17 instructions
- Pioneered by Mendel Rosenblum's group at Stanford, commercialized as VMware
- Binary translation:
- Goal: full virtualization (guest OS not modified)
- Approach: dynamic binary translation
- Inspect code blocks to be executed
- If needed, translate to alternate instruction sequence
- Otherwise, run at hardware speeds
- Goal: performance, give up on unmodified guests
- Approach: paravirtualization, modify guest so that...
- It knows it's running virtualized
- It makes explicit calls to the hypervisor (hypercalls)
- Hypercall: system calls
- Package context info
- Specify desired hypercall
- Trap to VMM
- All guests expect contiguous physical memory, starting at 0
- Virtual vs physical vs machine addresses (VA vs PA vs MA) and page frame numbers
- Still leverages hardware MMU, TLB, etc.
- Option 1:
- Guest page table: VA to PA
- Hypervisor: PA to MA
- Too expensive!
- Option 2:
- Guest page tables: VA to PA
- Hypervisor shadow PT: VA to MA
- Hypervisor maintains consistence
- Paravirtualized:
- Guest aware of virtualization
- No longer strict requirement on contiguous physical memory starting at 0
- Explicitly registers page tables with hypervisor
- Can batch page table updates to reduce VM exits
- Other optimizations
- Bottom line: overheads eliminated or reduced on newer platforms
- For CPUs and memory:
- Less diversity
- ISA level (instruction set architecture level) standardization of interface
- For devices:
- High diversity
- Lack of standard specification of device interface and behavior
- Three key models for device virtualization (see later slides)
- Approach: VMM-level driver configures device access permissions
- Pros:
- VM provided with exclusive access to the device
- VM can directly access the device (VMM-bypass)
- Cons:
- Device sharing difficult
- VMM must have exact type of device as what VM expects
- VM migration tricky
- Approach:
- VMM intercepts all device accesses
- Emulate device operation:
- Translate to generic I/O operation
- Transverse VMM-resident I/O stack
- Invoke VMM-resident driver
- Cons:
- Latency of device operations
- Device driver ecosystem complexities in hypervisor
- Approach:
- Device access control split between:
- Front-end driver in guest VM (device API)
- Back-end driver in service VM (or host)
- Modified guest drivers
- Device access control split between:
- Pros:
- Eliminate emulation overhead
- Allow for better management of shared devices
- AMD Pacifica and Intel Vanderpool Technology (Intel-VT), 2005:
- Close holes in x86 ISA
- Modes: root/non-root (or host and guest mode)
- VM control structure
- Extended page tables and tagged TLB with VM ids
- Multi-queue devices and interrupt routing
- Security and management support
- Additional instructions to exercise previously mentioned features
- Topics to be covered in this lesson:
- RPC (remote procedure calls)
- Example 1: Get File App
- Client-server
- Create and init sockets
- Allocate and populate buffers
- Include protocol info (e.g., get file, size, etc.)
- Copy data into buffers (e.g., filename, file, etc.)
- Example 2: Mod Image App
- Client-server
- Create and init sockets
- Allocate and populate buffers
- Include protocol info (e.g., algorithm, parameters, etc.)
- Copy data into buffers (e.g., image data, etc.)
- Common steps (highlighted in bold) related to remote IPC (inter-process communication): RPC
- RPC is intended to simplify the development of cross-address space and cross-machine interactions
- Pros:
- Higher-level interface for data movement and communication
- Error handling
- Hiding complexities of cross-machine interactions
- Some requirements of RPCs include:
- Client/server interactions
- Procedure call interface:
- Sync call semantics
- Type checking
- Error handling
- Packet bytes interpretation
- Cross-machine conversion
- E.g., big/little endian
- Higher-level protocol
- Access control, fault tolerance, etc.
- different transport protocols
- See lecture slides for figure
- The execution order of an RPC is as follows:
- Client call to procedure
- Stub builds message
- Message is sent across the network
- Server OS hands message to server stub
- Stub unpacks message
- Stub makes local call to add
- There are some general steps in RPC:
- Register: server registers procedure, argument types, location, etc.
- Bind: client finds and binds to desired server
- Call: client makes RPC call; control passed to sub, client code blocks
- Marshal: client stub marshals arguments (serialize arguments into buffer)
- Send: client sends message to server
- Receive: server receives message; passes message to server-stub; access control
- Unmarshal: server stub unmarshals arguments (extracts arguments and creates data structures)
- Actual call: server stub calls local procedure implementation
- Result: server performs operation and computes result of RPC operation
- The above steps are similar on return
- What can the server do?
- What arguments are required for the various operations (need agreement!)?
- Why:
- Client-side bind decision
- Run-time to automate stub generation: IDL (interface definition language)
- An IDL used to describe the interface the server exports
- RPC can use IDL that is:
- Language-agnostic
- Language-specific
- Remember that an IDL is just an interface not an implementation!
- Client determines:
- Which server should it connect to?
- How will it connect to that server?
- Registry: database of available services
- Search for service name to find a service (which) and contact details (how)
- Distributed:
- Any RPC service can register
- Machine-specific:
- For services running on same machine
- Clients must known machine address: registry provides port number needed for connection
- Needs naming protocol:
- Exact match for add
- Or consider summation, sum, addition, etc.
- Applications use binding and registries like toy shops use directories of outsourcing services:
- Who can provide services?
- Look u registry for image processing
- What services are provided?
- Compress, filter, version number, etc. (IDL)
- How will they send/receive?
- TCP/UDP (registry)
- Solutions:
- No pointers!
- Serialize pointers; copy referenced pointed to data structure to send buffer
- When a client hangs, what is the problem?
- Server, service, or network down? Message lost?
- Timeout and retry (no guarantees!)
- Special RPC error notification (signal, exception, etc.):
- Catch all possible ways in which the RPC call can (partially) fail
- Design decisions for RPC systems (e.g., Sun RPC, Java RMI)
- Binding: how to find the server
- IDL: how to talk to the server; how to package data
- Pointers as arguments: disallow or serialize pointed data
- Partial failures: special error notifications
- Sun RPC was developed in the 80x by Sun for UNIX; now widely available on other platforms
- Design choices:
- Binding: per-machine registry daemon
- IDL: XDR (for interface specification and for encoding)
- Pointers as arguments: allowed and serialized
- Partial failures: retries; return as much information as possible
- See lecture for figure
- Client-server via procedure calls
- Interface specified via XDR (x file)
rpcgen
compiler: converts x to language-specified stubs- Server registers with local registry damon
- Registry:
- Name of service, version, protocol(s), port number, etc.
- Binding creates handle:
- Client uses handle in calls
- RPC run-time uses handle to track per-client RPC state
- Client and server on same or different machines
- Documentation, tutorials and examples now maintained by Orcale
- TI-RPC: Transport-independent Sun RPC
- Provides Sun RPC/XDR documentation and code examples
- Older online references still relevant
- Linux man pages for rpc
rpcgen
compiler:square.h
: data types and function definitionssquare_svc.c
: server stub and skeleton (main)square_clnt.c
: client stubsquare_xdr.c
:common marshalling routines
- See lecture for figure
- Developer:
- Server code: implementation of
square.proc_1_svc
- Client side: call
squareproc_1()
#include.h
- Link with stub objects
- Server code: implementation of
- RPC run-time - the rest:
- OS interactions, communication management, etc.
rpcgen -C square.x
: not thread safe!rpcgen -C -M square.x
: multi-threading safe!- Does not make a multi-threaded svc.c server
- RPC daemon: port mapper
- Query with
rpcinfo -p
:/usr/sbin/rpcinfo -p
- Program id, version, protocol (tcp, udp), socket port number, service name, etc.
- Port mapper runs with tcp and udp on port 111
- Client type:
- Client handle
- Status, error, authentication, etc.
- Default types:
char
,byte
,int
,float
- Additional XDR types:
const
(#define
)hyper
(64-bitinteger
)quadruple
(128-bitfloat
)opaque
(Cbyte
)- Uninterpreted binary data
- Fixed-length array (e.g.,
int data[80]
) - Variable-length array (e.g.,
int data<80>
): translates into a data structure with len and val fields- Except for strings:
- String line
<80>
: c pointer tochar
- Stored in memory as a normal null-terminated string
- Encoded (for transmission) as a pair of length and data
- String line
- Except for strings:
- Marshalling/unmarshalling: found in
square_xdr.c
- Clean-up:
xdr_free()
- User-defined
freeresult
procedure (e.g.,square_prog_1_freeresult
) - Called after results returned
- What goes on the wire?
- Transport header (e.g., TCP, UDP)
- RPC header: service procedure ID, version number, request ID, etc.
- Actual data:
- Arguments or results
- Encoded into a byte stream depending on data type
- XDR: IDL + the encoding (i.e., the binary representation of data on-the-wire)
- XDR encoding rules:
- All data types are encoded in multiples of four bytes
- Big endian is the transmission standard
- Two's complement is used for integers
- IEEE format is used for floating point
- Java RMI (Remote Method Invocations):
- Among address spaces in JVM(s)
- Matches Java OO semantics
- IDL: Java (language specific)
- RMI run-time:
- Remote reference layer:
- Unicast, broadcast, return-first response, return-if-all-match
- Transport:
- TCP, UDP, shared memory, etc.
- Remote reference layer:
- Topics to be covered in this lesson:
- DFS (distributed file systems) design and implementation
- NFS (network file systems)
- Distributed file systems are like distributed storage facilities:
- Accessed via well-defined interface:
- Access via VFS
- Focus on consistent state:
- Tracking state, file updates, cache coherence, etc.
- Mix distribution models possible:
- Replicated vs partitioned, peer-like systems, etc.
- Accessed via well-defined interface:
- See lecture for figure
- DFS: an environment where multiple machines are involved in the delivery of the file system service
- Includes file-system interfaces which use VFS interface to abstract and hide file system organizations
- Uses OS to hide local physical storage on a machine
- Has files maintained on a remote machine or on a remote file system that is being accessed over the network
- Client/server on different machines
- File server distributed on multiple machines
- Replicated (each server: all files)
- Partitioned (each server: part of files)
- Both (files partitioned; each partition replicated)
- Files stored on and served from all machines (peers)
- Blurred distinction between clients and servers
- Extreme 1: upload/download
- Like FTP, SVN, etc.
- Extreme 2: true remote file access
- Every access to remote file, nothing done locally
- Pros:
- File accesses centralized, easy to reason about consistency
- Cons:
- Every file operation pays network cost
- Limits server scalability
- A more practical remote file access (with caching)
- Allow clients to store parts of files locally (blocks)
- Pros:
- Low latency on file operations
- Server load reduced: is more scalable
- Pros:
- Force clients to interact with server (frequently)
- Pros:
- Server has insights into what clients are doing
- Server has control into which accesses can be permitted: easier to maintain consistency
- Cons:
- Server more complex, requires difference file sharing semantics
- Pros:
- Stateless: keeps no state, ok with extreme models but cannot support practical model
- Pros:
- No resources are used on server side (CPU/memory)
- ON failure, just restart
- Cons:
- Cannot support caching and consistency management
- Every request self-contained: more bits transferred
- Pros:
- Stateful: keeps client state, needed for practical model to track what is cached/accessed
- Pros:
- Can support locking, caching, incremental operations
- Cons:
- On failure, need check-pointing and recovery mechanisms
- Overheads to maintain state and consistency: depends on caching mechanism and consistency protocol
- Pros:
- Locally clients maintain portion of state (e.g., file blocks)
- Locally clients perform operations on cached state (e.g. open/read/write, etc.): requires coherence mechanisms
- How?
- SMP: write-update/write-invalidate
- DFS: client/server-driven
- When?
- SMP: on write
- DFS: on demand, periodically, on open, etc.
- Details depend on file sharing semantics
- UNIX semantics: every write visible immediately
- Session semantics: (between open-close: session)
- Write-back on
close()
, update onopen()
- Easy to reason, but may be insufficient
- Write-back on
- Periodic updates:
- Client writes-back periodically: clients have a lease on cached data (not exclusive necessarily)
- Server invalidates periodically: provides bounds on inconsistency
- Augment with
flush()
/sync()
API
- Immutable files: never modify, new files created instead
- Transactions: all changes atomic
- Too many options?
- Sharing frequency
- Write frequency
- Importance of consistent view
- Optimize for common case
- Two types of files:
- Regular files vs directories
- Choose different policies for each
- E.g., session-semantics for files, UNIX for directories
- E.g., less frequent write-back for files than directories
- Replication: each machine holds all files
- Pros: load balancing, availability, fault tolerance
- Cons:
- Writes become more complex:
- Synchronously to all
- Or write to one then propagated to others
- Replicas must be reconciled:
- E.g., voting
- Writes become more complex:
- Partitioning: each machine has subset of files
- Pros:
- Availability vs single server DFS
- Scalability with file system size
- Single file writes simpler
- Cons:
- On failure, lose portion of data
- Load balancing harder; if not balanced them hot-spots possible
- Pros:
- Can combined both techniques, replicate each partition
- See lecture for figure
- A NFS typically includes a client and a server; however, clients act as the remote server over a network
- Client:
- Client requests for file access starts at system call layer and moves to VFS layer
- At the VFS layer, a decision will be made for where the file belongs to (the local file system interface or the NFS client)
- If NFS client is chosen, it will move on to the RPC client stub which communicates with the RPC server sub
- Server:
- Continuing from the RPC server stub, the call could make it's way to the NFS server which resides on a remote machine
- The NFS server could communicate with the VFS layer on the server side to get access to the file
- From the VFS layer, the layout is about the same as the client side
- When an open request comes from the client, the NFS server will create a file handle (i.e. a byte sequence that encodes both the server machine as well as the server local file information which will be return to the client)
- If files are deleted or the server machine dies, the file handle will return an error for stale data (invalid data)
- Since 80s, currently NFSv3 and NFSv4
- NFSv3: stateless, NFSv4: stateful
- Caching:
- Session-based (non-concurrent)
- Periodic updates
- Default: three seconds for files; 30 seconds for directory
- NFSv4: delegation to client for a period of time (avoids update checks)
- Locking:
- Lease-based
- NFSv4: also share reservation, reader/writer lock
- Caching in the Sprite Network File System, by Nelson et al.
- Research DFS
- Great value in the explanation of the design process
- Takeaway: used trace data on usage/file access patterns to analyze DFS design requirements and justify decisions
- Access pattern (workload) analysis:
- 33% of all file accesses are writes
- Caching ok but write-though not sufficient
- 75% of files are open less than 0.5 seconds
- 90% of files are open less than 10 seconds
- Session semantics still too high overhead
- 20-30% of new data deleted within 30 seconds
- 50% of new data deleted within 5 minutes
- File sharing is rare!
- Write-back on close not necessary
- No need to optimize for concurrent access but must support it
- 33% of all file accesses are writes
- From analysis to design:
- Cache with write-back
- Every 30 seconds write-blocks that have NOT been modified for the last 30 seconds
- When another client opens file: get dirty blocks
- Open goes to server, directories not cached
- On concurrent write: disable caching
- Sprite sharing semantics:
- Sequential write sharing: caching and sequential semantics
- Concurrent write sharing: no caching
-
$R_1... R_n$ readers, w, writer:- All
open()
go through server - All clients cache blocks
- Writer keeps timestamps for each modified block
- All
-
$w_2$ sequential writer (sequential sharing):- Server contacts last last writer for dirty blocks
- Since
$w_2$ has not closed: disabled caching!
- Topics to be covered in this lesson:
- DSM (distributed shared memory)
- Distributed state management and design alternatives
- Consistency model
- Managing distributed shared memory is like managing a tools/parts across all workspaces in a toy shop:
- Must decide placement:
- Place memory (pages) close to relevant processes
- Must decide migration:
- When to copy memory (pages) from remote to local
- Must decide sharing rules:
- Ensure memory operations are properly ordered
- Must decide placement:
- Clients:
- Send requests to file service
- Caching:
- Improve performance (seen by clients) and scalability (supported by servers)
- Servers:
- Own and manage state (files)
- Provide service (file access)
- Each node:
- Owns state
- Provides service
- All nodes are peers
- In peer-to-peer even overall control and management done by all
- Each node:
- Owns state: memory
- Provides service:
- Memory reads/writes from any node
- Consistency protocols
- Permits scaling beyond single machine memory limits:
- More shared memory at lower cost
- Slower overall memory access
- Commodity interconnect technologies support this RDMA
- Hardware-supported (expensive!):
- Relies on interconnect
- OS manages larger physical memory
- NICs (network interconnect card) translate remote memory access to messages
- NICs involved in all aspects of memory management; support atomics, etc.
- Software-supported:
- Everything done by software
- OS, or language run-time
- Cache line granularity?
- Overheads too high for DSM
- Pros:
- Page granularity (OS-level)
- Object granularity (language run-time)
- Cons:
- Variable granularity
- Beware of false sharing, e.g., X and Y are on the same page!
- What types of applications use DSM?
- Application access algorithm:
- Single reader/single writer (SRSW)
- Multiple readers/single writer (MRSW)
- Multiple readers/multiple writers (MRMW)
- Application access algorithm:
- DSM performance metric: access latency
- Achieving low latency through:
- Migration:
- Makes sense for SRSW
- Requires data movement
- Replication (caching):
- More general
- Requires consistency management
- Migration:
- DSM: shared memory in SMPs
- In SMP:
- Write-invalidate
- Write-update
- Coherence operations on each write: overhead too high
- Push invalidations when data is written to:
- Proactive
- Eager
- Pessimistic
- Pull modification info periodically
- On-demand (reactive)
- Lazy
- Optimistic
- When these methods get triggered depends on the consistency model for the shared state!
- DSM architecture (page-based, OS-supported):
- Each node contributes part of memory pages to DSM
- Need local caches for performance (latency)
- All nodes responsible for part of distributed memory
- Home node manages access and tracks page ownership
- Exact replicas (explicit replication):
- For load balancing, performance, or reliability
- home/manager node controls management
- Each page (object) has:
- Address: node ID and page frame number
- Node ID: home node
- Global map (replicated):
- Object (page) ID: manager node ID
- Manager map available on each node!
- Metadata for local pages (partitioned):
- Per-page metadata is distributed across managers
- Global mapping table:
- Object ID: index into mapping table: manager node
- Problem: DSM must intercept accesses to DSM state
- To send remote messages requesting access
- To trigger coherence messages
- Overheads should be avoided for local, non-shared state (pages)
- Dynamically engage and disengage DSM when necessary
- Solution: Use hardware MMU support!
- Trap in OS if mapping invalid or access not permitted
- Remote address mapping: trap and pass to DSM to send message
- Cached content: trap and pass to DSM to perform necessary coherence operations
- Other MMU information useful (e.g., dirty page)
- Consistency model: agreement between memory (state) and upper software layers
- Memory behaves correctly if and only if software follows specific rules
- Memory (state) guarantees to behave correctly:
- Access ordering
- Propagation/visibility of updates
- Strict consistency: updates visible everywhere immediately
- In practice:
- Even on single SMP, no guarantees on order without extra locking and synchronization
- In distributed systems, latency and message reorder/loss make this even harder (impossible to guarantee)
- Sequential consistency:
- Memory updates from different processors may be arbitrarily interleaved
- All processes will see the same interleaving
- Operations from same process always appear in order they were issued
- Casual consistency:
- Concurrent writes: No guarantees
- Casually related writes: ordered
- Synchronization points: operations (read, write, sync)
- All updates prior to a sync point will be visible
- No guarantee what happens in between
- Variations:
- Single sync operations (sync)
- Separate sync per subset of state (page)
- Separate entry/acquire vs exit/release operations
- Pros: limit data movement and coherence operations
- Cons: maintain extra state for additional operations
- Topics to be covered in this lesson:
- Brief and high-level overview of challenges and technologies facing data centers
- Goal: provide context for mechanisms from previous lessons
- Multi-tier architectures for internet services
- Cloud computing
- Cloud and big data technologies
- Internet service: any type of service provided via web interface
- Presentation: static content
- Business logic: dynamic content
- Database tier: data store
- Not necessarily separate processes on separate machines
- Many available open source and proprietary technologies
- Middleware: supporting, integrative or value-added software technologies
- In multi-process configurations: some form of IPC used, including RPC/RMI, shared memory, etc.
- For scale: multi-process, multi-node (scale out architecture)
- Boss-worker: front-end distributes requests to nodes
- All equal: all nodes execute any possible step in request processing, for any request (functionally homogeneous)
- Specialized nodes: nodes execute some specific step(s) in request processing for some request type (functionally heterogeneous)
- Functionally homogeneous:
- Each node can do any processing step
- Pros:
- Keeps front-end simple
- Does not mean that each node has all data, just each node can get to all data
- Cons:
- How to benefit from caching?
- Functionally heterogeneous:
- Different nodes, different tasks/requests
- Data does not have to be uniformly accessible everywhere
- Pros:
- Benefit of locality and caching
- Cons:
- More complex front-end
- More complex management
- Traditional approach:
- Buy and configure resources: determine capacity based on expected demand (peak)
- When demand exceeds capacity:
- Dropped requests
- Lost opportunity
- Ideal cloud:
- Pros:
- Capacity scales elastically with demand
- Scaling instantaneous, both up and down
- Cost is proportional to demand, to revenue opportunity
- All of this happens automatically, no need for hacking wizardry
- Can access anytime, anywhere
- Cons:
- Don't own resources
- Pros:
- Cloud computing requirements (summarized):
- On-demand, elastic resources and services
- Fine-grained pricing based on usage
- Professionally managed and hosted
- API-based access
- Shared resources:
- Infrastructure and software/services
- APIs for access and configuration:
- Web-based, libraries, command line, etc.
- Billing/accounting services:
- Many models: spot, reservation, entire marketplace
- Typically discrete quantities: tiny, medium, x-large, etc.
- Managed by cloud provider
- Law of large numbers:
- Per customer there is large variation in resource needs
- Average across many customers is roughly constant
- Economies of scale:
- Unit cost of providing resources or service drops at bulk
If computers of the kind I have advocated become computers of the future, then computing many some day be organized as a public utility, just as the telephone system is a public utility... The computer utility could become the basis of a new and important industry. (John McCarthy, MIT Centennial, 1961)
- Computing: fungible utility
- Limitations exist: API lock-in, hardware dependence, latency, privacy, security, etc.
- Public: third-party customers/tenants
- Private: leverage technology internally
- Hybrid (public + private): fail over, dealing with spikes, testing
- Community: used by certain type of users
- On-premise:
- You must manage all components and services
- IaaS (Infrastructure as a Service):
- You manage components such as applications, data, run-time, middleware, OS
- Others manage virtualization, servers, storage, and networking
- PaaS (Platform as a Service):
- You manage components such as applications and data
- Others manage run-time, middleware, OS, virtualization, servers, storage, and networking
- SaaS (Software as a Service):
- Opposite of on-premise, others manage all components and services
- Fungible resources
- Elastic, dynamic resource allocation methods
- Scale: management at scale, scalable resources allocations
- Dealing with failures
- Multi-tenancy: performance and isolation
- Security
- Virtualization:
- Resource provisioning (scheduling)
- Big data processing (Hadoop MapReduce, Spark, etc.)
- Storage:
- Distributed front-end (append only)
- NoSQL, distributed in-memory caches
- Software-defined networking, storage, data centers, etc.
- Monitoring: real-time log processing (e.g., AWS CloudWatch)
- Data storage layer
- Data processing layer
- Caching layer
- Language front-ends (e.g., querying)
- Analytics libraries (e.g., ML)
- Continuously streaming data
No notes from this lesson