mapreduce

Author	: 	Anshil Gandhi
CCID	:	1523205

Overview

mapreduce library is a programming model and a distributed computing paradigm for large-scale data processing. It allows for applications to run tasks in parallel, making them scalable and fault-tolerant.

To use the mapreduce library, developers must implement two callback functions, namely map and reduce.

The map function takes in filename as input and generates intermediate key/value pairs from the data stored in that file.
The reduce function processes a subset of intermediate key/value pairs, summarizing all values associated with the same key.

Modifications

changed threadpool function types to void and pass threadpool by reference for in-place update, for computation speedup
removed ThreadPool_destroy function, each thread is destroyed as soon as there is no job left for it to work on
ThreadPool_add_work is changed from being a bool function to a void function
ThreadPool_work_t *ThreadPool_get_work(ThreadPool_t *tp) is replaced as a method within the ThreadPool_work_queue_t struct for better code style

Additional notes

Threads in the thread pool are referred by their thread ID, each of them stored in std::vector
Task queue for the thread pool is implemented using std::priority_queue, with priority given to tasks with longer expected processing time, pushing and popping of a job takes O(log(n)) time where n is the number of jobs in the queue. Access of a job takes O(1) time
A mutex lock is used for thread-safe access and pop from the wait queue
std::vector stores partitions, where each partition is a std::multimap<std::string, std::string> storing the intermediate key/value pairs resulting from execution of the map function. std::multimap provides a convenience of storing duplicate keys while maintaining the ascending order of the key/value pairs
MR_Emit(key, value) takes an asymptotic running time of O(k + log(|T|)) where k is the length of the string key and |T| is the size of the partition.
Running time of MR_GetNext(key, partition_number) is dominated by an erase call upon a partition given an iterator as argument, hence the worst-case running time for this function is O(log|T|), where |T| is the size of the partition.
Testing of this library was done by generating test-cases and running on them. Furthermore, test-cases generated by the class TAs were also used to evaluate the performance.

Resources

https://eclass.srv.ualberta.ca/course/view.php?id=53653 (CMPUT 379 class examples and notes)
https://en.cppreference.com/w/ (cpp reference)
Stevens, W. Richard, and Stephen A. Rago. Advanced Programming in the UNIX Environment. Addison-Wesley, 2014.

gandhi56/mapreduce

mapreduce

Overview

Modifications

Additional notes

Resources