Project 3 Sorting

Author Zhen Tong 120090694@link.cuhk.edu.cn
Projects Home: https://58191554.github.io/project_page.html

Quick Sort with MPI

$P$ $P$ $P$ $O(\frac{N}{P}\log\frac{N}{P})$ $t_{trans}\times N$ $P$ $O(N)$ $P$ $O(\log P)$ $O(N\log P)$ , and the overall time complexity of Quick Sort with MPI is

O (\frac{N}{P} \log \frac{N}{P}) + t_{t r a n s} \times N + O (N \log P)

Bucket Sort with MPI

$O(1)$ $O(N)$ $O(N)$ $M\times N$ $M$ $N$ is the number of the total array. The MPI_Send() function will send a structure of [#data, data], we need to tell how many data in one bucket, so that the master can move the data into the vector.

Indeed, the number of buckets in a bucket sort algorithm can significantly impact its performance. The choice of the number of buckets affects the balance between the size of each bucket and the number of comparisons and movements required during the sorting process.

Bucket Size:

$10^8$ $10^6$ .

Numerical Analysis

Odd-Even Sort with MPI

$O(N^2)$ amount of time. However, because it is not data dependent in each iteration, it is good for parallel. The key point in MPI is the check the edge of the array to see, whether two adjacent process need to exchange data.

sorted = ture/false $10$ $100$ $24$ $t_1$ $t_2$ $n$ times of iteration with one central communication. Then, we are saving amount of time overall.

\frac{T}{n} \times t_{2} - t_{1}

Merge Sort with OMP(Bonus)

The merge sort is divide into two steps: the independent neighborhood sort, and recursive merge. The Sorting parallel is simple, we set the maximum number nn $x$ $Y$ $x$ $Z$ $t$ $x_1, x_2, ..., x_{t-1}$ $X[i, {i+1}]$ $Y[pos_Y(x_i), pos_Y(x_{i+1})]$ $Z[pos_Z(x_i), pos_Z(x_{i+1})]$ $O(t\log N)$ $O(N/t)$ parallel merge.

merge

The binary search mentioned above is not really doing the binary search, but to return the rightmost element of the array.


xxxxxxxxxx
int binarySearch(vi &vec, int val) {
    int low = 0;
    int high = vec.size();

    while (low < high) {
        int mid = low + (high - low) / 2;

        if (vec[mid]<= val) {
            low = mid+1;
        } else {
            high = mid;
        }
    }
    return low; // Returns the rightmost position
}

$T$ number of threads

$M$ $M/2$ threads are doing the merging job
$T-M/2$ $k = \frac{T-M/2}{M/2}$ threads for the parallel merging job for 2 sorted array.
$k$ $k-1$ $k$ parallel merging between two array.

Performance

My Time Cost

Workers	QuickSort(MPI)	BucketSort(MPI)	Odd-Even-Sort(MPI)	MergeSort(Threads)	QuickSort(Threads)
1	14391	11500	37240	12070	12642
2	12690	7877	31547	6741	18162
4	9145	4476	19561	4837	18079
8	8177	2559	11544	4180	18069
16	7886	1691	6476	9176	18162
32	8397	1276	3598	17972	18298

Baseline

Workers	QuickSort(MPI)	BucketSort(MPI)	Odd-Even-Sort(MPI)	MergeSort(Threads)	QuickSort(Threads)
1	13708	11669	37077	25510	13006
2	12451	8901	29069	20431	10879
4	9093	5107	18603	10539	7804
8	7979	3457	11561	5537	4566
16	8014	2583	7418	3447	3808
32	8849	2469	5919	2063	3597

In the speed up, the odd-even algorithm parallel is the most linear-like case, this is because the MPI Communication is evenly distributed into the iteration, and we take a trick that the more slave we need to communicate, the less frequent our loop communicate to check stop.
The Quick Sort and the Bucket Sort's speedup is not as good as odd-even sort, because the merge of quick sort is sequential, and the heap size of the priority tree is getting bigger in merging state.
The Bucket Sort's efficiency is better the other 2 sort algorithm, because there is no increase of sequential part.

Profiling Results & Analysis with perf

Quick Sort
- Cache Misses and References:
  - The number of cache misses decreases as the number of cores increases, indicating potential improvements in cache utilization.
  - Cache reference percentages show how often cache was successfully referenced. Some variability is observed, possibly due to the nature of the workload or cache behavior.
- Cache Reference Percentage:
  - Generally, the cache reference percentage increases with the number of cores, suggesting better cache utilization as parallelism increases.

Bucket Sort
- Cache Miss:
  - The number of cache misses first increase then decrease, because the increase is because data dependency was destroyed by dividing data into different process. However, the cache misses then decrease because, there are few data to deal with ,and there are also fewer cold cache miss.
Odd-Even Sort
- Cache Misses and Cache References:
  - Cache References generally decreases.
  - The cache reference percentage is the percentage of cache references compared to the total number of cache accesses. It indicates the efficiency of cache utilization.
  - The percentage starts relatively low for 1 core and 2 cores, but significantly increases for 4 cores and beyond. This could be due to the parallel nature of the algorithm, where more cores result in a higher rate of cache references.
- Number of Instructions:
  - The number of instructions executed increases with the number of cores, with the highest value at 16 cores. This increase in instructions may be attributed to the parallelization overhead and the execution of more instructions in parallel across multiple cores.

Recommendations:

Cache Optimization:
- Analyze cache miss patterns to identify opportunities for optimization, such as improving cache locality or restructuring data.
Parallel Efficiency:
- Evaluate parallel efficiency and scalability by comparing cache references and misses across different core counts.
Load Balancing:
- Consider load balancing to ensure an even distribution of work among cores, reducing the potential for cache contention.

How to Run My Code

You can compile the code and run the code with one line.


xxxxxxxxxx
chmod +x run.sh
./run.sh

Or you can manually compile the code, and run the sbatch, the output will be in the log/Project3.txt, and log log/bonus.txt


xxxxxxxxxx
# Compile the src code in build
mkdir build 
cd build
cmake ..
make -j4

# Run
cd ..
sbatch src/sbach.sh
sbatch src/sbach_bonus.sh

Or you can run the tasks separately after compile. The output of perf will be in the log/task*.txt


x
# change permision to +x first
sbatch src/scripts/task1.sh 
sbatch src/scripts/task2.sh 
sbatch src/scripts/task3.sh

Or you can execute the code with DIY parameter


xxxxxxxxxx
perf_command="perf stat -e cycles,instructions,cache-references,cache-misses"


xxxxxxxxxx
srun -n $num_cores --cpus-per-task 1 --mpi=pmi2 $perf_command ${CURRENT_DIR}/../build/src/quicksort/quicksort_mpi ${big_size}


xxxxxxxxxx
srun -n $num_cores --cpus-per-task 1 --mpi=pmi2 $perf_command ${CURRENT_DIR}/../build/src/bucketsort/bucketsort_mpi ${big_size} ${big_bucket}


xxxxxxxxxx
srun -n $num_cores --cpus-per-task 1 --mpi=pmi2 $perf_command ${CURRENT_DIR}/../build/src/odd-even-sort/odd-even-sort_mpi ${oddeven_size}

perf analysis

using my python code analysis.py it will extract the import data and do statistic for you, using useful regular expression


xxxxxxxxxx
python3 analysis.py

Appendix

Table of data collected in the perf commend.

Quick Sort with MPI

Cores	1	2	4	8	16	32
cache miss	158025980.0	80729914.0	44110473.0	24038915.0	15048559.5	10519799.4
cache reference	230906748.0	118512881.0	64081817.5	33488426.8	19911586.6	14133144.1
cache reference%	68.437	66.221	65.43725	68.983625	74.95175	74.2
instruction num	542293108.0	670668560.0	593783905.0	602958409.5	437816684.6	667951908.6

Bucket Sort with MPI

Cores	1	2	4	8	16	32
cache miss	334679367.0	181321147.0	95741341.7	62101736.12	46433108.5	38768590.8
cache reference	477131827.0	259117596.0	141852335.7	84185615.5	56938498.0	43872005.5
cache reference%	70.144	69.769	66.81775	73.7505	82.3455625	89.3684375
instruction num	609065074.0	547796687.0	449616108.0	588080301.3	509110032.1	456636856.3

Odd-Even Sort with MPI

Cores	1	2	4	8	16	32
cache miss	189906.0	411674.0	2327726.5	3162682.5	3545464.625	3524768.625
cache reference	857956879.0	345477901.0	6620118.25	3424899.625	3998650.375	4695187.75
cache reference%	0.022	0.121	55.1	91.1	88.4	67.6
instruction num	222015385.0	554048876.5	485595380.25	477382203.2	629073048.3	473586805.375