Parallel Travelling Salesman Problem using MPI (Message Passing Interface)

1. Overview

In this repo, I had developed a pipeline implementation for the Dynamic Programming solution of Travelling Salesman Problem. Which show performance 2 times faster than the serial version in terms of timing runs.

2. Serial Solution

Reconstructed from this Leetcode post by DBabichev which solves the problem of Find the Shortest Superstring. Under the hood, it's actuall a Travelling Salesman Problem.

Time Complexity:

The implementation is identical to the Java solution provided in the assignment by williamfiset. The overall Time Complexity is O(n^2 * 2^n):

Assuming the number of nodes are n.

We first initialize a 2-D matrix dp[state][node] representing the minimal cost/path visiting all nodes from bit mask state and last visited node node.

Then we construct the dp row by row starting from dp[0] to dp[(1 << n) - 1]. Inside of each row current_mask, we will construct it by reusing all the previous row state_dest_not_visited which with Hamming Distance of 1 comparing to current_mask.

At last, we will return dp[(1 << n) - 1][0] which mean all nodes have been visited and last visited node is 0, i.e, visiting all n nodes and back to the starting node.

Hence, the time complexity, to construct the dp matrix, is number of possible bit mask by number of permutation of any 2 nodes, i.e., O(n^2 * 2^n).

3. Multi-Threaded Solution

Reconstructed from the iterative version. Instead of solving the dp matrix row by row, we put each row's processing to a helper function in a thread pool created by ThreadPoolExecutor with thread count of p = min(os.cpu_count(), len(weights)). Inside of each thread, it will call helper function which update the dp matrix in the share memory for each row.

Time Complexity:

Instead of processing each row in sequential, we solve a p number of sub-problems in parallel. Updating the dp matrix is happening in shared memory, hence no extra communication overhead needed. Hence the overall Time Complexity is O(n^2 * 2^n / p), when p == 2^n, the time complexity becomes O(2^n)

4. MPI Pipeline Design

Overview

In this setup, I first let the rank 0 processor do a OneToAll broadcast of the weight matrix with size of n².

Doing this will need:

$$T_{OneToAllBroadcast} = (T_{s} + T_{m} * n ^ 2) * \log p$$

Then, the rank 0 processor will generate the initial task with time:

$$T_{InitialTaskGeneration} = n$$

After that, all the processors will start working on solving the sub problem of the whole DP table as shown below.

We can separate the DP table into layers, each layer representing number of nodes have been visited and among them having task dependency:

The 1^stlayer representing no node has been visited and needed by the 2^nd layer

The 2^nd represeniting all masks with 1 node been visited and needed by the 3^rd layer

The 3^rd layer represeniting all masks with 2 nodes been visited and needed by the 4^th layer

...

![4NODE TSP 5](README.assets/4NODE TSP 5.png)

For each iteration, all processors will work on one layer simultaneously.

For each bit mask being handled, Let's assume the mask having b 1 bit in the bit mask,the processor will do the following 3 tasks:

PointToPoint Recv b number of dependent tasks with size n from above layer
Calculate the optimal path. For each un-visited node, find the optimal path from visited nodes. Time will be n²
PointToPoint Send n - b number of optimal path with size n to the corresponding below layers' processors.

All together, the time need to process one bit mask:

$$T_{Bit Mask} = n * (T_{s} + T_{m} * n) + n^2$$

Processor Load Balancing

Assuming our number of processor p is the power of 2, i.e.: 2, 4, 8, 16...

I find the folling mapping rule will give a better load balancing across all processor:

(mask + 1) // 2 % p

For example, for a 4 nodes 4 processors mapping:

Decimal Mask	MapProcessor	Bit Mask
0	0	0000

2	1	0010
4	2	0100
8	0	1000

6	3	0110
10	1	1010
12	2	1100

14	3	1110

15	0	1111

Each processor will handle equal number of mask:

Processor Rank	Task Count
0	2
1	2
2	2
3	2

Similarly for a 5 nodes 4 processors mapping:

Decimal Mask	MapProcessor	Bit Mask
0	0	00000

2	1	00010
4	2	00100
8	0	01000
16	0	10000

6	3	00110
10	1	01010
12	2	01100
18	1	10010
20	2	10100
24	0	11000

14	3	01110
22	3	10110
26	1	11010
28	2	11100

30	3	11110

31	0	11111

Processor Rank	Task Count
0	4
1	4
2	4
3	4

5. Performance Analysis

Time complexity

Below table is a preview of the overall time complexity:

	Large Number of Processor	Small Number of Processor
$$T_{\text{Computing TSP}}$$	$$O(n^3) $$	$${n ^ 2 \cdot 2 ^ {n - 1} \over p} $$
$$T_{p}$$	$$O(n ^ 3) $$	$${n ^ 2 \cdot 2 ^ {n} \over p} + n^2\log p + n$$

As we can see from the above task dependency graph figure, the middle layer requires the most computation and processors $$N_{Processor} = {n - 1 \choose {n \over 2}}$$.

Case 1: $$p \geq N_{Processor}$$

If we have number of processor $$p \geq N_{Processor}$$, the number of processor is large enough to handle each layer in one step concurrently, i.e:

$$ N_{\text{Step}} = 1 $$

Then the time complexity will be:

$$ T_{\text{Computing TSP}} = T_{Bit Mask} * N_{Step} * Layer = n^2(T_{s} + T_{m}n) + n^3 = O(n^3) $$

$$ T_{\text{p}} = T_{\text{Computing TSP}} + T_{OneToAllBroadcast} + T_{InitialTaskGeneration} $$

$$ = n^3 + n^2\log p + n = O(n^3)$$

Case 2: $$p < N_{Processor}$$

For the case that p is lesser, which is most possibly the case as we can seen from the table below.

The processors needed will grow near exponentially. It's imposisble to have this many number of processors:

Number of Node	Mask count of the middle layer
10	126
11	252
12	462
13	924
14	1,716
15	3,432
16	6,435
17	12,870
18	24,310
19	48,620
20	92,378
21	184,756
22	352,716

If we assume each layer will have the average identical number of masks, i.e.,

$$N_{Average} = {N_{Total Number Of Mask Of All Layers} \over N_{Number Of Layer}} = {2 ^ {n} \ \over n}$$

Then each layer will need N_Step iterations to finish computing all masks:

$$ N_{\text{Step}} = {N_{Average} \over p} = {2^{n} \over np} $$

Then for this case the overall time needed will be :

$$ T_{\text{Computing TSP}} = T_{Bit Mask} * N_{Step} * Layer = (n (T_{s} + T_{m} * n) + n^2) * {2^{n} \over np} * n = {n ^ 2 \cdot 2 ^ {n} \over p} $$