Notes and labs for the course 15-213 Introduction to Computer Systems at CMU
- Data types
- char, short, int, long, float, double, pointer
- Word size equals length of pointer datatype
- Bit-level operations
- Signed / unsigned conversion
- Byte ordering
- Big Endian: Sun, PPC Mac, Internet
- Little Endian: x86, ARM
- Numeric form: (-1)^s*M*(2^E)
- Encoding
- s: sign bit
- exp: E
- frac: M
- Three kinds of values
- Denormalized: exp = 0
- Normalized: 0 < exp < 11..11
- Special: exp = 11...11 (e.g. inf & NaN)
- Roundings
- History
- Registers
- Addressing modes
- Normal:
(R)->Mem[Reg[R]] - Displacement:
D(R)->Mem[Reg[R] + D] - Complete:
D(Rb,Ri,S)->Mem[Reg[Rb] + S*Reg[Ri] + D](Rb,Ri)->Mem[Reg[Rb] + Reg[Ri]]D(Rb,Ri)->Mem[Reg[Rb] + Reg[Ri] + D](Rb,Ri,S)->Mem[Reg[Rb] + S*Reg[Ri]]
- Normal:
- Some instructions
movq Src, Dst- Cannot do memory-memory transfer with a single instruction
- Intel docs use
mov Dst, Src
leaq Src, Dst- Src is address mode expression, set Dst to address denoted by expression
- Similar to
p = &x[i] - Used for arithmetics for form like
x + k * y - Does not change condition codes
addq/subq Src, Dstimulq Src, Dstsalq/sarq/shrq Src, Dstxorq/andq/orq Src, Dstpushq srcpopq destincr dest
- Compiler, Assembler, Linker & Loader
- Compiler
- Translates C files (.c) into assembly files (.s)
- Assembler
- Translates assembly files (.s) into object files (.o)
- Missing linkage between compilation units
- Linker
- Resolve references between object files
- Combine with static libraries (malloc, printf, etc)
- Dynamic linked libraries
- Linking occurs at runtime
- Does not take too much disk space
- Compiler
- Controls
- Procedures
- Passing control
- Procedure call:
call label- Push return address into stack
- Jump to label
- Procedure return:
ret- Pop return address from stack
- Jump to this address
- Return address: Address of next instruction after the call statement
- Procedure call:
- Passing data
- First 6 arguments:
%rdi,%rsi,%rdx,%rcx,%r8,%r9 - Other arguments passed using stack
- Return value:
%rax - IA-32 pass all arguments in stack
- Concept of stack frames:
- Marked by
%rbp(optional) and%rsp - No additional mechanism for recursion is needed
- Marked by
- Register saving conditions
- Caller saved
%rdi,%rsi,%rdx,%rcx,%r8,%r9,%rax,%r10,%r11
- Callee saved
%rbx,%r12,%r13,%r14,%rbp%rspis also a special form of callee-saved
- Caller saved
- First 6 arguments:
- Memory management
- ABI: Application Binary Interface
- Passing control
- Data
- Address space
- Vulnerablities
- Protection
- Use routines limiting string lengths (user-level)
- Randomized stack offsets
- Nonexecutable code segments
- Stack canaries
- Optimization by programmer
- Optimization blockers
- Procedures: Seen as a "black box"
- Procedures may have side effects
- May not return same result with same argument
- Fix: Use inline functions (GCC with -O1 within single file)
- Memory aliasing: Two memory references specify single location
- The following code does memory load and store every time, because compiler assume possibility of memory aliasing:

- Load and store take multiple clock cycles
- Easily caused by direct access to storage structures
- Fix: Define local variable to tell compiler not to check for aliasing

- Get in habit of introducing local variables accumulating within loops
- The following code does memory load and store every time, because compiler assume possibility of memory aliasing:
- Procedures: Seen as a "black box"
- Optimization (by programmer) limitations
- Most performed within procedures. Newer versions of GCC do interprocedual optimization, but not between codes in different files
- Based on static information
- Conservative: Must not change program behavior
- Instruction-level parallelism
- Superscalar processor: Issue and execute multuple instructions per cycle, and instructions are scheduled dynamically
- Some instruction have >1 clock cycle latency, but can be pipelined:

- Unrolling
- Break sequential dependency to break through latency bound (to approach throughput bound)
can be optimized to:for(int i = 0; i < limit; ++i) x = x + d[i];but to break sequential dependency:for(int i = 0; i < limit; i += 2) x = (x + d[i]) + d[i + 1];for(int i = 0; i < limit; i += 2) x = x + (d[i] + d[i + 1]); - adding separate accumulators
- Break sequential dependency to break through latency bound (to approach throughput bound)
- Branch prediction
- Backward branches are often loops, predict taken
- Forward branches are often if, predict not taken
- Average better than 95% accuracy
- Storage technologies
- RAMs
- Volatile: SRAM & DRAM (caches & main memories)
- Nonvolatile: ROM, PROM, EPROM, EEPROM (firmware, ssd & disk caches)
- Rotating disks
- SSDs
- Page can be written only after its block has been erased
- RAMs
- Locality
- Temporal locality
- Spatial locality
- Hierarchy

- Caches
- Cache memories
- Concept of locality

- General organization
- Metrics
- Miss rate
- Hit time
- Miss penalty
- Write cache-friendly code
- Make the common cases go first
- Minimize the misses in inner loops
- Try to maximize spatial locality by reading objects sequentially with stride 1
- Try to maximize temporal locality by using an object as often as possible once it's read from memory
- Example of matrix multiplication
- In which order to arrange the loops? Do miss rate analysis!
- It turns out: kij/ikj > ijk/jik > jki/kji
- Use blocking: multiplying by sub-matrices
- Concept of locality
- Why linkers?
- Modularity
- Efficiency (separate complilation)
- Two kind of linking
- Static linking
- Dynamic linking
- What does linker do?
- Symbol resolution
- Functions,
globalvars,staticvars - Definitions are stored in symbol table, an array of entries (name, size, location)
- Three kind of symbols:
- Global symbols: non-static functions and non-static vars
- External symbols: defined in other modules
- Local symbols: static functions and static vars
- Note: Do not confuse local symbols with local variables. Local variables are allocated in stack at runtime, and have nothing to do with linker.
- Symbol resolution
- Symbols are strong or weak:
- Strong: functions and initialized globals
- Weak: uninitialized globals
- Multiple strong symbols are not allowed
- Choose the strong symbol over weak symbols
- If there are multiple weak symbols, choose arbitrary one
- May cause undefined behavior over different compilers
- Fix: use
staticand explicitextern
- Symbols are strong or weak:
- Functions,
- Relocation
- Merge text and data segment
- Relative location -> absolute location
- Updates symbol table
- Relocation entries are used to aid symbol resolving:
a: R_X86_64_32 array
- Relocation entries are used to aid symbol resolving:
- Symbol resolution
- Three kinds of object files
- Relocatable object file (.o file)
- Executable object file (a.out file)
- Shared object file (.so file or .dll file)
- ELF format (Executable and Linkable Format)
- All 3 object files use ELF format
- Static libraries (.a archive files)
- Concatenate related relocatable object files into a single file with an index (called an archive)
- During linking, only referenced .o files are linked
- Command line order matters!
- During scan, keep a list of currently unresolved references
- If any entries in the unresolved list at end of scan, then error
- Fix: put libraries at the end of command line
- Commonly used libraries:
libc.a(the C standard library)limb.a(the C math library)
- Disadvantages
- Duplication in storage
- Bug fixes require relink
- Fix: shared libraries
- Shared libraries
- Dynamic linking can happen at:
- Load time
- Handled by the dynamic linker
libc.sousually dynamically linked
- Run time
dlopen()interface in linux
- Load time
- Dynamic linking can happen at:
- Library interpositioning
- Can happen at:
- Compile time
- Link time
- Load/run time
- Can be used for:
- Detecting memory leaks
- Generating address traces
- Can happen at:
- ECFs exists in all levels:
- Exceptions (low level)
- Processor responses to external events
- Exception tables
- Context switch
- Signals
- Nonlocal jumps
- Exceptions (low level)
- Exceptions (equivalent to user-kernel transition)
- Asynchronous (Interrupts)
- Indicated by INT pin
- Control flow returns to next instruction
- Synchronous
- Traps
- Intentional (syscall, breakpoints)
- Control flow returns to next instruction
- Faults
- Unintentional but possibly recoverable
- Control flow returns to current instruction or aborts
- Aborts
- Unintentional and unrecoverable
- Traps
- Asynchronous (Interrupts)
- Context switches
- From a programmer's perspective, a process can be:
- Running: Executing or will be scheduled
- Stopped: Suspended and will not be scheduled until further notice
- Terminated: Stopped permanently (zombie)
- Process terminates when:
SIGTERMreceived- Return from
main() - Called
exit()
- Process terminates when:
- Creating process:
fork() - Reaping child processes:
wait()- Terminated processes become zombies, because its parent may use its exit status or OS tables
wait()andwaitpid()reap zombie child processes- If parent don't reap:
- If parent doesn't terminate: Never diminishes (a kind of memory leak)
- If parent does terminate: Reaped by
initprocess (pid == 1)
- So only need to explicitly reap long-running processes
- Loading and running processes:
execve()int execve(char *filename, char *argv[], char *envp[])- Loads and runs in the current process
- Overwrites code, data and stack
- Retains PID, open files (e.g.
stdout), and signal context - Called once and never return (except error)
- Process groups
- Can be get and set by
getpgrp()andsetpgid() - Kill all process in a group with
kill -n -<pid>
- Can be get and set by
- Unix shell: An application that runs program on behalf of the user
- Shell contains a basic loop and a
eval()function - Two cases in
eval():- Shell built-in command
- Not build-in, use
fork()andexecve()
- Motivation: How to reap both foreground and background jobs?
- Basic loop: Only reaps foreground jobs
- Fix: Signals
- Shell contains a basic loop and a
- Signals
- Akin to exceptions and interrupts
- Sent from signal (sometimes at the request of another process via
kill) - Identified by an integer
- Controlled by per-process
pendingandblockedbit vectorspendingvector set and cleared by kernel when signals is sent or receivedblockedvector can be manipulated bysigprocmask()function- So, signals cannot be queued
- Send:
pendingbit set - Receive: process reacts to the signal, clears
pendingbit- Ignore
- Terminate
- Catch (using user-level function called signal handler)
- Kernels checks for
pnb = pending & ~blockedat beginning of a time-slice- If
pnb == 0:- Pass control to next instruction in the process logical flow
- Else
- Choose lease non-zero bit in
pnband forces the process to receive the signal - The receipt of the signal triggers some action by the process (clears
pendingbit) - Repeat for all remaining nonzero bits
- Pass control to next instruction in the process logical flow
- Choose lease non-zero bit in
- If
- Default action can be one of:
- Termination
- Stop until restarted by
SIGCONT - Ignore
- Override default action by installing
signal handlers:handler_t *signal(int signum, handler_t *handler)handlercan be one of:SIG_IGN: IgnoreSIG_DFL: Revert to default- Function pointer to a user-level signal handler
- Signal handlers are a form of concurrency

- Signal handlers can be nested
- So we need blocking
- Implicit blocking: blocks pendings signals of same type
- Explicit blocking:
sigprogmask()with supporting functions of:sigemptyset()sigfillset()sigaddset()sigdelset()
- So we need blocking
- How to write safe handlers?
- Keep handlers as simple as possible
- Call only
async-signal-safefunction in handlersasync-signal-safefunctions are reentrent (access only local variables on stack), or cannot be interrupted by another signal handlerprintf(),malloc()andexit()are not safewrite()is the only signal-safe output function
- Save and restore
errnoon entry and exit - Protect accesses to shared data structures by temporarily blocking all signals in both handler and
main() - Declare global variables to be
volatile, to prevent from being optimized into registers - Declare global flags as
volatile sig_atomic_t- Flag: variable only read or written (not
flag++orflag+=10) volatile sig_atomic_tare ints on most systems
- Flag: variable only read or written (not
- Avoid race conditions
- Cannot make
anyassumption regarding execution order - However, we can control when handlers run by blocking
- Cannot make
- Explicitly waiting for signals: suppose handler sets global variable
pid:- Spin wait:
while(!pid) {}- Wasteful
- Pause:
while(!pid) pause()- Race condition
- Sleep:
while(!pid) sleep(1)- Too slow
- Solution:
sigsuspendint sigsuspend(const sigset_t *mask)- Equivalent to atomic:
sigprocmask(SIG_BLOCK, &mask, &prev); pause() sigprocmask(SIG_BLOCK, &prev, NULL);
- Spin wait:
- Portable signal handling
- Problem: Different versions of unix have different signal handling semantics
- Solution: Use
sigaction
- Physical Addressing: Used in microcontrollers, embedded systems, etc.
- Mentality: Main memory is a fully-associative cache for disk
- Load doesn't necessarily happen with
execve(). It only allocates virtual address space with valid bit of 0 - Loading is a result of a page fault (demand paging)
- Load doesn't necessarily happen with
- Kernel memory invisible to application program. Kernel's address space starts with 1.
- Every memory access go through cache memory:
- Address translation: Multi-level page tables
- TLB: Small set-associative hardware cache in MMU
- Works only because of locality
- Unix I/O
- Opening and closing files:
open(),close() - Reading and writing files:
read(),write() - Changing file position:
lseek() - View file metadata:
stat()stat()are both a syscall and a linux program- Syscalls are in second section of man:
man 2 stat
- Always check return codes for these syscalls
- Opening and closing files:
- File types: Regular, directory, socket, named pipes, symlinks, character and block devices
- Short counts: (
nbytes < sizeof(buf)) are possible - Wrapper: RIO (robust I/O) package
- Unbuffered I/O of binary data:
rio_readn()andrio_writen() - Buffered I/O of text or binary:
rio_readlineb()andrio_readnb()
- RIO package is better for input and output on network sockets
- Unbuffered I/O of binary data:
- Standard I/O
- Opening and closing:
fopen()andfclose() - Reading and writing bytes:
fread()andfwrite() - Reading and writing text lines:
fgets()andfputs() - Formatted reading and writing:
fscanf()andfprintf()
- C program begin with 3 open files:
stdin(descriptor 0)stdout(descriptor 1)stderr(descriptor 2)
- Opening and closing:
- Trace syscalls with the Linux
straceprogram - Choosing I/O functions
- General: Use highest-level functions
- When to use Unix I/O: Signal handlers because unix I/O functions are
async-signal-safe - When to use standard I/O: Disks, terminals
- When to use RIO: Network sockets
- How kernel represents open files
- Open file table: An instance of opening file
- If a process opens a file twice, there are two open file tables pointing to the same v-node table
- V-node table: File metadata (regardless of whether file is open)
- After
fork(),refcntis incremented:
- Two processes share a same instance of opened file (including file position)
dup2(int oldfd, int newfd): Used for I/O redirection
- Open file table: An instance of opening file
- Recommended references:
- W. Richard Stevens & Stephen A. Rago, Advanced Programming in the Unix Environment, 2 nd Edition, Addison Wesley, 2005
- End-to-end Core i7 Address Translation

- L1 d-cache
indexandoffsethave 12 bits is NOT a coincidence: Speed up address translation

- Linux organizes VM as collections of areas:
- Fault handling: Traverse the
vm_area_structs to check if page is allocated
- Fault handling: Traverse the
- Private Copy-on-write (COW)
- Memory Mapping:
void *mmap(void *start, int len, int prot, int flags, int fd, int offset)start: A hint addressprot:PROT_READ,PROT_WRITE,PROT_EXECflags:MAP_ANON,MAP_PRIVATE,MAP_SHARED- Returns a pointer to the start of mapped area (may not be start)
- Allocators: Maintain the heap as a collection of variable sized blocks, which are either allocated or free
- Explicit allocator: Application allocates and frees
- Implicit allocator: Application allocates but not frees
- The
mallocpackagevoid *malloc(size_t size)void free(void *p)calloc: Initializes allocated blocks to 0realloc: Changes size of previously allocated blocksbrk: Used internally by allocators to grow and shrink the heap
- Constraints:
- Applications have few constraints
- Allocators have many constraints:
- Can't assume allocation patterns
- Must respond immediately to
malloc(can't defer allocation) - Can't relocate allocated memory
- Performance goal (2 conflicting goals)
- Throughput: Number of completed requests per unit time
- Peak memory utilization: How to efficiently use memory
- Fragmentation
- Internal fragmentation:
- Payload smaller than block size
- Easy to measure
- External fragmentation:
- Enough aggregate heap memory, but no single free block large enough
- Difficult to measure
- Internal fragmentation:
- Keeping track of free blocks

- How to find a free block
- First fit
- Next fit
- Best fit
- Know how much to free: Header
- Implicit list
- Allocating a free block: May need to split the block
- Freeing a block: Have to coalesce free blocks (4 cases):

- Singly-linked list cannot free previous block in constant time
- Fix: Doubly-linked list (head and footer)
- Optimization:
- Allocated blocks doesn't need coalescing
- We have extra bits to encode whether previous block is allocated
- So, allocated blocks doesn't need footer
- Implicit lists are not commonly used because of linear time. However, the concepts of splitting and coalescing are general to all allcators
- Explicit free list
- Maintain list of free blocks using payload area
- Blocks can be in any order (depending on insertion policy)
- Unordered: LIFO, FIFO
- Address-ordered
- Much faster than implicit lists when memory is full
- Segregated list

- Garbage collection
- Mark and sweep collecting
- Allocate using
mallocuntil run out of space - Use extra mark bit for each block
- Root nodes: Pointers in stack/data section
- Does not distinguish between pointers/non-pointers, thus "safe"
- Mark: Start at root nodes and do DFS
- Sweep: Start at beginning of VM, and free unmarked blocks
- How to find beginning of block? -- Use a balanced tree
- Allocate using
- Mark and sweep collecting
- Client-Server Architecture
- Network Architecture
- Socket
- Host and Service Conversion:
getaddrinfo - Client/Server interface

telnet: Testing servers- Creates TCP connection with a server (starts a session)
- Since the encoding of HTTP is ascii, we can hard-code http requests
- HTTP
- Content: A sequence of bytes in MIME (Multipurpose Internet Mail Extension) type
- The contents can be either static or dynamic
- Dynamic content:
- Produced by server-side program
- If URI containts
cgi-binthen serve dynamic content - Use
fork()andexec()to execute new program - Use env-var
QUERY_STRINGto pass parameters
- Iterative servers have serious flaws.
- Easily get blocked by single misbehaving client
- Note: Blocking does not happen upon client calling
connect()orwrite(), but uponread(). This is because server's kernel provides buffering
- Note: Blocking does not happen upon client calling
- So we need concurrent servers
- Easily get blocked by single misbehaving client
- Process-based servers
- Parent must close connected socket (parent doesn't get reaped)
- Child should close listening socket (child gets reaped)
- Reap child with
SIGCHLDhandler
- Event-based servers
- Manage multiple connections in user space
- Determine events using
select()orepoll() - Design of choice for high-performance web servers
- However, hard to provide find-grained concurrency
- Cannot take advantage of multi-core
- Thread-based servers
- Can run threads in
detachedmode. It will run independently, and get reaped automatically - Possible race conditions when passing parameters to new thread in
pthread_create()
- Can run threads in


























