Project 3 Sorting

Author Zhen Tong 120090694@link.cuhk.edu.cn

Projects Home: https://58191554.github.io/project_page.html

Quick Sort with MPI

In the quick sort with multi-process parallel code, the sort is being done in chunks, and the output of the sort function is P number of sorted array. P is the number of process. To fully use all the computing resource, the master is working with slaves, and counted in P. In a even divided array, each process is sorting with quick sort algorithm, which is O(NPlogNP). Then, all the process send the data back to the Master node, which takes ttrans×N amount of time. And later, merge P number of sorted array into the output array. The time complexity of merge is O(N), and we use the heap to store the smallest element of P arrays. When the smallest element pops out, we stack the next element of the same array to the heap, which takes O(logP) time. So, the overall merging is O(NlogP), and the overall time complexity of Quick Sort with MPI is

O(NPlogNP)+ttrans×N+O(NlogP)

Bucket Sort with MPI

The idea of Bucket Sort is first divide the array into some bucket interval, sorted the bucket, and merge data with O(1). The motivation of this algorithm the merging time is small, however the divide data into bucket can cost O(N). In the MPI code, we first get the range of the dataO(N), and then get the each bucket range divided the data range by the bucket number. Then, we allocate the buckets to the processes, and let them do the data division. After that, use insertion sort in each bucket. The master is prepared M×N space to receive data. M is the number of all buckets, and N is the number of the total array. The MPI_Send() function will send a structure of [#data, data], we need to tell how many data in one bucket, so that the master can move the data into the vector.

 

Indeed, the number of buckets in a bucket sort algorithm can significantly impact its performance. The choice of the number of buckets affects the balance between the size of each bucket and the number of comparisons and movements required during the sorting process.

Bucket Size:

Having too few buckets may lead to large bucket sizes, which could result in inefficient sorting within each bucket. If a bucket becomes too large, it might be better to use a more efficient sorting algorithm, such as insertion sort, within that bucket. In the experiment, for array size as 108, bucket size is 106.

Numerical Analysis

 

Odd-Even Sort with MPI

The Odd-Even Sort is not a good algorithm in sequential, because it takes O(N2) amount of time. However, because it is not data dependent in each iteration, it is good for parallel. The key point in MPI is the check the edge of the array to see, whether two adjacent process need to exchange data.

There is a trick in this question, we know that the MPI communication can be very expansive, and send the last signal sorted = ture/false can be too trivial to send. Therefore, we can first assume that the with 10 to 100 iteration (take 24 in experiment), we can check sorted for one time. We can do useless multiple iteration in return of saving expensive MPI Communication. For example, if the iteration time itself with out MPI Communication with Master is t1, and the total time of communication with master and all slave is t2. And we do n times of iteration with one central communication. Then, we are saving amount of time overall.

Tn×t2t1

Merge Sort with OMP(Bonus)

The merge sort is divide into two steps: the independent neighborhood sort, and recursive merge. The Sorting parallel is simple, we set the maximum number n number of thread to sort in n sub-arrays. The most difficult part is the merge parallelism. The way to merge independently and using parallelism first find the position of element x in the right array Y, and the position of x in output array Z using binary sort. Assume we have t number threads to use in one merge function, we need to fund the position of x1,x2,...,xt1, then we can let the thread to do sub-merge for X[i,i+1], and Y[posY(xi),posY(xi+1)] , and write the output in the output array Z[posZ(xi),posZ(xi+1)]. The time complexity is O(tlogN) for binary search, plus O(N/t) parallel merge.

merge

The binary search mentioned above is not really doing the binary search, but to return the rightmost element of the array.

Although the performance degrade when thread number exceed 16, I try my best to control the thread distribution. The thread thread strategy is like this. Suppose there are T number of threads

Performance

My Time Cost

WorkersQuickSort(MPI)BucketSort(MPI)Odd-Even-Sort(MPI)MergeSort(Threads)QuickSort(Threads)
11439111500372401207012642
212690787731547674118162
49145447619561483718079
88177255911544418018069
16788616916476917618162
328397127635981797218298

Baseline

WorkersQuickSort(MPI)BucketSort(MPI)Odd-Even-Sort(MPI)MergeSort(Threads)QuickSort(Threads)
11370811669370772551013006
2124518901290692043110879
49093510718603105397804
8797934571156155374566
1680142583741834473808
3288492469591920633597

image-20231117220644648image-20231117220747787

Profiling Results & Analysis with perf

 

Recommendations:

How to Run My Code

You can compile the code and run the code with one line.

Or you can manually compile the code, and run the sbatch, the output will be in the log/Project3.txt, and log log/bonus.txt

Or you can run the tasks separately after compile. The output of perf will be in the log/task*.txt

Or you can execute the code with DIY parameter

perf analysis

using my python code analysis.py it will extract the import data and do statistic for you, using useful regular expression

Appendix

Table of data collected in the perf commend.

Quick Sort with MPI

Cores12481632
cache
miss
158025980.080729914.044110473.024038915.015048559.510519799.4
cache
reference
230906748.0118512881.064081817.533488426.819911586.614133144.1
cache
reference%
68.43766.22165.4372568.98362574.9517574.2
instruction
num
542293108.0670668560.0593783905.0602958409.5437816684.6667951908.6

Bucket Sort with MPI

Cores12481632
cache
miss
334679367.0181321147.095741341.762101736.1246433108.538768590.8
cache
reference
477131827.0259117596.0141852335.784185615.556938498.043872005.5
cache
reference%
70.14469.76966.8177573.750582.345562589.3684375
instruction
num
609065074.0547796687.0449616108.0588080301.3509110032.1456636856.3

Odd-Even Sort with MPI

Cores12481632
cache
miss
189906.0411674.02327726.53162682.53545464.6253524768.625
cache
reference
857956879.0345477901.06620118.253424899.6253998650.3754695187.75
cache
reference%
0.0220.12155.191.188.467.6
instruction
num
222015385.0554048876.5485595380.25477382203.2629073048.3473586805.375