Back to Dashboard

Matrix Multiplication Parallelism

Author: Zhen Tong 120090694@link.cuhk.edu.cn

Before Everything

This project implement these program for matrix multiplication:

You can find the performance in the /profiling directory, and check the output matrices in /results dierectory

The matrix multiplication in math form is:

cij=k=0Kaikbkj

aik,bkj are two element of matrix A and B, andcij​ in this equation is the element in the result matrix. An very naive way of doing this task is using three for loop Therefore, the naive code is:

However, this implementation can be trouble when the matrix is large.

Task1: Memory Locality

The naive implementation can cause problem, because it doesn't fully use the memory locality and each time they visit the data, the cache missed. More specific, the element matrix B is visiting can not be held in the cache for a long time. For example, to compute the first iteration of the first loop the matrix B needs to visit from red to purple, crossing the rainbow🌈 from left to right. The next time the program want to visit the element b0,0 the deep red element is not in the cache, causing a cache miss. What's worse, the whole Matrix needs to load to the cache again if the matrix is so large that the cache cannot store. Another bad thing is the for loop goes in vertical direction, but our memory in Matrix is implemented in horizontal aligned manner. Visiting the data in vertical order damage the special locality.

There are two way to achieve the memory locality. Reorder the for loop or using tiled matrix multiplication.

Reordering the loop

The simplest approach is using the loop as follows:

There are four thing for this implementation:

  1. Cache Efficiency: In this code, accessing elements of matrix1, matrix2, and result in a contiguous manner can give us the cache line benifits. This sequential access pattern can take better advantage of the cache hierarchy. Modern CPUs have caches with different levels (L1, L2, L3), and they work most efficiently when you access memory in a predictable, linear fashion. By iterating through the data in this way, you are more likely to reuse data stored in the cache.

  2. Data Reuse: The code minimizes the number of cache misses by reusing the values stored in mat1_ik, mat2_ptr_k, and mem_result_ptr_i within inner loops. This means that the data is loaded into the cache once and reused multiple times, reducing the need to fetch data from main memory.

  3. Spatial Locality: The innermost loop accesses elements of mat1_ik and mat2_ptr_k sequentially. Spatial locality is the principle that data items that are stored close together in memory tend to be accessed together. In this case, the elements are accessed in the order they are stored in memory, which improves spatial locality.

Using tiled matrix multiplication

Because each element of the result matrix is computed by the inner product of a whole row of matrix A and a whole column of matrix B. The row or column may be too large for the cache to hold until the next4 use. therefore divide the large matrix into small matrix can help.

MethodsMatrices 1024*1024Matrices 2048*2048
Naive8098 ms84232 ms
Reordering the loop776 ms6451 ms
Speedup1043.6%1305.7%
Profiling

In the 2048x2048 matrix case:

  1. Cache Performance: The memory-locality implement has a lower cache miss rate (23.919%) compared to the naive algorithm (33.424%). Lower cache misses generally lead to better performance.

  2. Page Faults: The locality-aware algorithm has fewer page faults (30,690) compared to the naive algorithm (54,517). Fewer page faults indicate better memory access patterns.

  3. IPC: The locality-aware algorithm achieves a higher Instructions per Cycle (IPC) of 2.79, while the naive algorithm has a lower IPC of 1.10. A higher IPC suggests better instruction-level parallelism and more efficient execution.

Task2: Data-Level Parallelism

Using AVX2 we can reach a better parallelism performance by vecterization. The SIMD should be based on the memory locality approach. For each element in row i in matrix A, it will do scale multiply with a vector [bk,j,bk,j+1,bk,j+2,bk,j+3,bk,j+4,bk,j+5,bk,j+6bk,j+7,bk,j+8]. To fully use the memory locality we do the row of matrix B continuously. After that we multiply the second element at row i in A with the second row in matrix B, so on so forth. Finally, we add these multiplied vectors up along the K dimension, and we can get a row of result vector. Another tips is expand the small for loop can further speedup.

MethodsMatrices 1024*1024Matrices 2048*2048
Reordering the loop776 ms6451 ms
SIMD (AVX2)233 ms1839 ms
Speedup333.05%350.79%
Profiling

In the 2048x2048 matrix case:

CPU Cycles and Instructions: The SIMD implementation had fewer CPU cycles (7,786,145,349) and fewer instructions (21,257,516,541) compared to the mem_local implementation (20,733,443,452 cycles and 57,790,022,221 instructions). This suggests that the SIMD implementation was more efficient in terms of CPU resource usage.

Task3: Thread-Level Parallelism

In my experiment, the tiled method is not that suitable for the openmp multiple thread level parallelism. Therefore, I use openmp with only reordering and vectorization. Because the loop size in the for loop is concrete, therefore I use the static schedule to speedup the omp parallel for .

MethodsMatrices 1024*1024Matrices 2048*2048
SIMD (AVX2)233 ms1839 ms
openmp44 ms204 ms
Speedup529.54%901.47%

Efficiency

#threads1024x1024 time(ms)Efficiency2048x2048 time(ms)Efficiency
1202118691
22000.50518730.498932
41150.439139950.469598
8690.3659425150.453641
16430.2936053210.363902
32410.1539632060.283525

openmpopenmp

Profiliing

In 2048x2048 matrix case:

  1. Execution Time: The OpenMP approach is significantly faster, taking only 206 milliseconds, while SIMD AVX2 vectorization took 1833 milliseconds. OpenMP outperforms SIMD in terms of execution time.

  2. Cache Performance: The OpenMP approach has a lower cache miss rate (12.841%) compared to SIMD (27.148%). A lower cache miss rate is generally desirable for better performance, so OpenMP has an advantage here.

  3. Instructions per Cycle: The OpenMP approach has a lower instructions-per-cycle ratio (1.49) compared to SIMD (2.73). A lower ratio indicates better instruction-level parallelism, which can contribute to better performance.

  4. User and System Time: OpenMP has higher user time (6.228936 seconds) compared to SIMD (2.9611 seconds). User time measures the time spent executing the application code, so OpenMP seems to be more computationally intensive. The system time for both approaches is relatively low.

Task4: Process-Level Parallelism

The MPI implementation is first divide the rows of result matrix into process_numdivision. Then use MPI communication. The performance is:

MethodsMatrices 1024*1024Matrices 2048*2048
P = 1, T = 3249220
P = 2, T = 1645274
P = 4, T = 833213
P = 8, T = 435200
P = 16, T = 236259
P = 32, T = 164463
Best speedup133.33%102%

This is the figure of the time cost of different setting.

hw2-mpi

Profiling Analysis

Compare 32 threads openmp with 16 process 2 threads MPI:

  1. Execution Time: The OpenMP implementation is faster with an execution time of 206 milliseconds compared to 259 milliseconds for the MPI with OpenMP implementation.

  2. Cache Performance: The OpenMP implementation has a lower cache miss rate (12.841%) compared to the MPI with OpenMP implementation (90.471%). Lower cache miss rates are generally better for performance.

  3. Page Faults: The OpenMP implementation has slightly more page faults (17,028) compared to the MPI with OpenMP implementation (10,182).

  4. Total Cycles and Instructions: The OpenMP implementation has higher total cycles and instructions, indicating more processing work. The MPI with OpenMP implementation has fewer cycles and instructions, which could be due to fewer threads and processes.

  5. User and System Time: The OpenMP implementation has higher user time, which suggests that more computational work is being done by the user's code. The system time is lower for both implementations.

Compare 32 processes MPI, and 16 process 2 thread:

  1. Execution Time:

    • Configuration 2 (2 processes, 16 threads) is significantly faster, taking only 47 milliseconds, compared to Configuration 1 (32 processes, 1 thread), which took 463 milliseconds. Configuration 2 is approximately 10 times faster.

  2. Cache Performance:

    • Configuration 1 experienced a high cache miss rate (89.27% of cache references), indicating a high level of cache inefficiency.

    • Configuration 2 had a much lower cache miss rate (1.56% of cache references), suggesting more efficient cache utilization.

  3. Page Faults:

    • Configuration 2 had fewer page faults (5,045) compared to Configuration 1 (14,126), indicating a better memory access pattern in Configuration 2.

  4. CPU Utilization:

    • Configuration 2 utilized more CPU resources in terms of user time (1.038 seconds) compared to Configuration 1 (3.059 seconds).

    • System time was lower in Configuration 2 (0.048 seconds) compared to Configuration 1 (0.126 seconds).

If we set the thread number to 1, and change the process number:

#process1024x1024 time (ms)Efficiency2048x2048 time(ms)Efficiency
1257125831
22340.45525323780.460317
41320.12840512200.11808
8890.0432887470.03615
16610.0148354270.010332
32640.0077824630.005602

This is the figure of the table above

mpimpi

Finally, this figure summary the thing we did on cpu:

Extra Task: GPU Parallelism

In the GPU acceleration, I implement it in CUDA. Choose the thread block to be 16x16.

Then launch the kernel function:

In the kernel function, each thread compute one element in the result matrix. Tiled implementation can be better because the GPU cache is small and only the small range of data in the tile can be reused.

In the GPU parallelism the cudais:

MethodsMatrices 1024*1024Matrices 2048*2048
cuda31.6947 ms205.236 ms

How to Run

Run in bash

Test robustness: I generated some cases for matrix multiply to test the program robustness

Then, you check the result matrixs in results

Run each task

Important: If you are testing data not in the range of results/answer please switch of the debug arg to 0 in bash or instruction The debug is used to check the answer correctness for developer (ME).

debug=0

Appendix

My Profiling

Reference

reference 1

reference 2

Instruction on Profiling with perf and nsys