/mapreduce

Primary LanguageC++MIT LicenseMIT

mapreduce

Author	: 	Anshil Gandhi
CCID	:	1523205

Overview

mapreduce library is a programming model and a distributed computing paradigm for large-scale data processing. It allows for applications to run tasks in parallel, making them scalable and fault-tolerant.

To use the mapreduce library, developers must implement two callback functions, namely map and reduce.

  • The map function takes in filename as input and generates intermediate key/value pairs from the data stored in that file.
  • The reduce function processes a subset of intermediate key/value pairs, summarizing all values associated with the same key.

Modifications

  • changed threadpool function types to void and pass threadpool by reference for in-place update, for computation speedup
  • removed ThreadPool_destroy function, each thread is destroyed as soon as there is no job left for it to work on
  • ThreadPool_add_work is changed from being a bool function to a void function
  • ThreadPool_work_t *ThreadPool_get_work(ThreadPool_t *tp) is replaced as a method within the ThreadPool_work_queue_t struct for better code style

Additional notes

  • Threads in the thread pool are referred by their thread ID, each of them stored in std::vector
  • Task queue for the thread pool is implemented using std::priority_queue, with priority given to tasks with longer expected processing time, pushing and popping of a job takes O(log(n)) time where n is the number of jobs in the queue. Access of a job takes O(1) time
  • A mutex lock is used for thread-safe access and pop from the wait queue
  • std::vector stores partitions, where each partition is a std::multimap<std::string, std::string> storing the intermediate key/value pairs resulting from execution of the map function. std::multimap provides a convenience of storing duplicate keys while maintaining the ascending order of the key/value pairs
  • MR_Emit(key, value) takes an asymptotic running time of O(k + log(|T|)) where k is the length of the string key and |T| is the size of the partition.
  • Running time of MR_GetNext(key, partition_number) is dominated by an erase call upon a partition given an iterator as argument, hence the worst-case running time for this function is O(log|T|), where |T| is the size of the partition.
  • Testing of this library was done by generating test-cases and running on them. Furthermore, test-cases generated by the class TAs were also used to evaluate the performance.

Resources