/** @mainpage 15-410 Project 2 @author Jian Wang (jianwan3) @author Ke Wu (kewu) 1. Mutex: Mutex is a low level construct that serves the purpose of mutual exclusion. We have devised both a basic spinlock mutex and an advanced mutex that approximates bounded waiting. 1.1 Spinlock: The spinlock mutex is used in basic libraries like malloc and serves as the inner lock of the advanced mutex. It simply tries to acqure a lock (using xchg) for a limited times until it defers trying through yielding. The design choice related to it is whether to yield immediately if a lock is not available. In a single core machine, if a lock is not available, then it's likely other thread not running is holding the lock, so the current thread should yield immediately; in a multi-core machine, the thread that is holding the lock may be running as well and is likely to release the lock in a short time, so it makes sense for the current thread to try acquring for a few times instead of yielding immediately. To adapt to work well in a multi-threaded environment, our spinlock tries a few times before it yields. 1.2 Advanced mutex approximating bounded waiting: The advanced mutex uses a FIFO queue to manage threads that want the lock object and a spinlock as an inner lock to guard against accesses to the queue and the mutex object. When a thread tries to acquire a mutex lock and can't make it because other thread is holding it, it enqueues itself to the waiting queue, yields itself and checks for lock availability next time it awakes. When the lock holder thread unlocks the mutex, it deques the first thread in the waiting queue of the lock, and yields the exection to it. The advanced mutex achives bounded waiting through the usage of a waiting queue and a spinlock. Although spinlock itself doesn't satisfy bounded waitting, the critical section (which is the code of advanced mutex) that protected by spinlock are guaranteed to be short. Becuase only when one thread is in the advanced mutex code need to obtain the spinlock. It is unlikely that there are always some threads that hold the spinlock of mutex. So it is better than we use spinlock (xchg) as implementation of mutex directly because we can not predict what the critical section mutex is trying to protect. So spinlock help mutex somewhat approximate bounded waiting. 2. Condition variable: A condition variable is used to wait for an event with efficiency by relinquishing CPU voluntarily to other threads until the event changes signals. In our implementation, it uses a mutex to guard against its accessing to the waiting FIFO queue, which contains the threads that are waiting for an event to change. There are race conditions involved when threads dechedule to wait for an event and when threads make others runnable to signal an event. The atomicity is achieved through the usage of a flag called reject and syscall deschedule(reject). 3. Semaphore: A semaphore is implemented with mutex, condition variable, and a counter. The counter gives the idea of how many resources are left. The mutex is used to guard against accessing and modifying counter, while the condition variable blocks and signals threads as number of resources fluctuates. 4. Readers/writers locks: Our implementation favors writer locks. The drawback of this approach may be that readers may be starved, though. To achieve our goal, rwlock uses a variable lock_state to store the state of the rwlock (unlocked, shared, exclusive, destroied). rwlock is shared by multiple readers, lock_state also cotains the info that how many readers are sharing. Rwlock also records how many readers and/or writers are waiting for the rwlock. When a new lock() request comes in, rwlock will decide whether to block the thread based on the lock type it requests, the state of the rwlock and how many readers/writers are waiting. Similarly, when a thread unlock(), rwlock will decide who will get the lock next based on the info above. To achieve blocking, two conditional variables are used for readers and writers. To protect rwlock data structre, a mutex is used. We use mutex/cvar to implement rwlock because it gives us more flexibility to design the rwlock. 5. Thread library: 5.1 Data structures related to thread management: An expandable array called arraytcb is used to manage thread's information. Each thread has an associated tcb (thread control block) to manage it. The tcb of a thread contains information like: state (RUNNING, JOINED), ktid (thread id in the kernel side), tid (Thread id used by the thread lib, increamented monotonically with the root thread's tid as 0, the first new thread's tid as 1, the second as 2, etc), and a condition variable for it (so that other threads can join on it). When a thread is created, its personal tcb is created and inserted into the arraytcb. Later, the thread's information can be achieved from the arraytcb using its index, which is the same as its stack position index. A hashtable is used to manage threads's exit status. When a thread exits, either by explicitly calling thr_exit(), or returns directly, thr_exit() will be called anyway to delete the tcb of the thread in the arraytcb, put in the hashtable with thread id as key, and return value as value. The stack space of the exiting thread is released immediately by remove_pages() except the pages that are shared by other alive threads. Later, when a thread joins other thread, it will look up the hashtable to get the exit status of the thread. 5.2 Stack space: Thread stack space management: Each thread's stack space is ajacent to each other, with the highest stack being originally the root thread's stack. Each thread get's an index specifying where its stack region is when it's created. The index is determined by the current available slots on the stack, which corresponds to an available slot in the arraytcb. After the index is assigned, the pages of the thread's stack region are allocated and the thread's stack pointer is set to the top of its private stack. To achieve memory-efficient thr_exit(), the stack space of the exiting thread will be deallcated at the end of thr_exit() so that "zombie thread" will not hold onto large amounts of memory. we use a hash table to store the exit status of the exiting thread. Stack memory management: Stack spaces are allocated through new_pages() syscall each time a new thread is created and removed through remove_pages() syscall when it exits. Our implementation favors this approach over the one that preserves allocated stack spaces for future threads' use without deallocating them until the entire task vanishes, for the reason that though there may be some overhead of repetedly calling syscalls of new_pages() and remove_pages(), in the real application usage, if the user application is a long running program like a server or database, it may ocasionally create a lot of worker threads that will exit after their works are done, but itself, the root thread, will live for a long time and the task will not vanish, thus the previously allocated stack space for work threads will be held by this task wastefully, and no other tasks can use these spaces until the server or database vanishes. Autostack for single root thread: Autostack is supported for single-threaded programs. The root thread's stack can grow down beyond the orinal limit allocated by the kernel until there's not enough memory resources. This function is achieved through the usage of an exception handler dedicated to page fault caused by accessing memory outside of stack space previously allocated. The exception handler is registered for the root thread at the entry point of the program and re-registered again after it handles a fixable page fault opearation on the stack. Autostack is disabled when a new thread is created and the program enters multi-threaded mode. The root thread's stack low will become fixated and will not be extended any more. If in the future the root thread uses more space than its current max stack size and causes a page fault, the kernel will handle it instead of the user exception handler. Tricky parts of thr_create() and thr_exit(): When creating new threads in thr_create(), at the very beginning, a new thread's %esp is not moved down to its own stack, it shouldn't use the caller's stack, so some of the goals need to be done in registers to avoid using stack. Similarly, When thr_exit() is called, the calling thread needs to remove pages of its stack space which also means after a point, it cannot use its original stack but it still have some clean up work to do. Those works need to be written in Assembly with the help of registers. The challenging parts are that there are a limited amount of registers availale, (some of them needs to be avoided to preserve program states). 5.3 Function call failure handling: For function calls that have a non-void return type, we report the error to callers. For function calls that have a void return type, For some situations that it is possible to recover (e.g. malloc calls, lock an unlocked mutex) we print a log message, wait a while and retry. Because it is possiable that memory becomes enough for malloc or mutex is locked by other threads. For some situations that it is impossible to recover (e.g. try to use a destroied mutex), the program will call panic(). 5.4 Thr_exit for root thread: Ideally, root thread of multi-thread program should also call thr_exit() to get its return status collected by other threads join on it, but there's no guarantee the root thread will do so. So our implementation modifies the return address of the root thread's main in thr_init() to achieve this purpose. 5.5 Efficient way to implement thr_getid() and thr_getktid(): When a thread is created, its personal tcb is created and inserted into the arraytcb. Later, the thread's information (e.g. tid, ktid) can be achieved from the arraytcb using its index, which is the same as its stack position index. Because stack position index of the current thread can be calculated easily (by looking at the value of %esp and do some math), it gives an very efficient way to impelement thr_getid() and thr_getktid(). 6. Discussions: 6.1 Buffer zone: In the current implementation, stack spaces of different threads are ajacent to each other without a "buffer zone", say, a page that doesn't belong to any thread. In other words, there's no protection and warning if a thread accesses stack space that doesn't belong to it due to stack autogrwoing and that stack space happenes to have been allocated for another thread. Although, a thread can always access other's stack space if it really wants to by skipping the buffer zone, but the idea of buffer zone gives protection to some extent. 6.2 Multi-threaded autostack: The other idea we have considered but not implemented is mutlti-threaded autostack. The idea is to allocate exact amout of virtual address space that each thread asks for, but at most one page physical space initially for each thread and only after it uses more than that amount of memory do we give them more. This approach is likely to work well in a running environment with limited physical space that needs a lot of threads set with large stack size but don't actually use that many. 6.3 TCB with more granularity: One of the goals of the thread library is to achieve high concurrency with thread-safety. Currently, many thread lib operations needs to access and modify arraytcb, which is a concurrency bottle neck in the sense that the the arraytcb is guarded by a single mutex. Although we try our best to reduce the critical section of arraytcb (e.g. avoid some syscall in critical section), it may still cause some threads to wait on the global mutex. A better approach would be to have a private mutex for each tcb, which will increase memory usage as a side effect though. */