IEMS5730_HW1: A Java repository from RickAi

a. You need to design one or more MapReduce job(s) to perform the matrix multiplication described above in the small dataset

Commands:

# compile source code and generate jar
mvn compile
mvn install

# download and prepare data
hadoop dfs -mkdir -p hw1/input
hadoop dfs -mkdir -p hw1/tmp
hadoop dfs -mkdir -p hw1/output

hadoop dfs -copyFromLocal *.dat hw1/input

# execute job
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_small.dat hw1/input/N_small.dat hw1/tmp/small hw1/output/small

# query result
# my student id is: 1155114481, so the line should end with 81
hadoop dfs -cat hw1/output/small/part-r-00000 | awk '{if(($1-81)%100==0) print}' > small.txt
cat small.txt

Results:

The results can be found in the result/small.txt

b. Perform the matrix multiplication in the median dataset[2] and large dataset[3].

Commands:

# median dataset
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_median.dat hw1/input/N_median.dat hw1/tmp/median hw1/output/median

# large dataset
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large

# my student id is: 1155114481, so the line should end with 481
hadoop dfs -cat hw1/output/median/part-r-00000 | awk '{if(($1-481)%1000==0) print}' > median.txt
cat median.txt

hadoop dfs -cat hw1/output/large/part-r-00000 | awk '{if(($1-481)%1000==0) print}' > large.txt
cat large.txt

Results:

The results for median and large datasets can be found in the result/median.txt and result/large.txt

c. Re-run your program in b) multiple times using different number of mappers and reducers for your MapReduce job each time. You need to examine and report the run-time performance statistics of your program for at least 4 different combinations of number of mappers and reducers.

Here I use large dataset for testing purpose.

Commands:

# 55 mapper, 1 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_55m_1r 1

# 55 mapper, 10 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_55m_10r 10

# 55 mapeer, 20 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_55m_20r 20

# change mapred.min.split.size from 128MB to 64MB to double the mapper
# 110 mapper, 20 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_110m_20r 20 64

# 110 mapper, 40 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_110m_40r 40 64

# change mapred.min.split.size from 128MB to 32MB to double the mapper
# 220 mapper, 40 reducer
hadoop jar ybai-1.0-SNAPSHOT.jar Main hw1/input/M_large.dat hw1/input/N_large.dat hw1/tmp/large hw1/output/large_220m_40r 40 32

Results:

Mapper Count	Reducer Count	Mapper Time Cost	Reducer Time Cost	Total Time Cost
55	1	15min	22min	37min
55	10	15min	7min	22min
55	20	15min	4min	19min
110	20	16min	4.5min	20.5min
110	40	16min	5.5min	21.5min
220	40	18min	6.5min	24.5min

Max mapper time	Min mapper time	Avg mapper time	Max reducer time	Min reducer time	Avg reducer time	Max total job time	Min total job time	Avg total job time
18min	15min	15.8min	22min	4min	8.25	37min	19min	24.08min

Explanations:

The mapper count 55 is the optimal count. When I try to increase the mapper count to 110 and 220, the mapper time cost will also increase. The immdiate data from first reducer for large dataset is around 6995 MB, the default block size is 128MB, so the default mapper count for the second mapper is 6995/128, which is 55 mapper. The total CPU and memory power in the cluster is fixed, even when I try to increase mapper count, the parallelism will not increase too much. However, however, as the mapper count increase, the overhead from HDFS read will increase since a single mapper map need read muliply HDFS data block. The HDFS read and CPU switch overhead is higher than the increase of mapper parallelism, which is the main reason for the increase of mapper time cost.
The reducer count 20 is the optimal count. The reducer time cost will decrease when the reducer count increase from 1 to 20, however, the reducer time cost will increase when the reducer count increase from 20 to 40. The main reason for the decrease time cost from 1 to 20 reducer is the increase of compute parallelism, more reducer can utilize more CPU power in the cluster. The main reason for the increase time cost from 20 to 40 reducer is similar to the increase of mapper count.

Notes

Immediate data size

Type	Immediate Data Size(MB)
Small	8.45
Median	874.59
Large	6995.32

RickAi/IEMS5730_HW1

a. You need to design one or more MapReduce job(s) to perform the matrix multiplication described above in the small dataset

b. Perform the matrix multiplication in the median dataset[2] and large dataset[3].

c. Re-run your program in b) multiple times using different number of mappers and reducers for your MapReduce job each time. You need to examine and report the run-time performance statistics of your program for at least 4 different combinations of number of mappers and reducers.

Notes

Immediate data size