Project 4: Parallel Programming with Machine Learning

Author: Zhen Tong 120090694@link.cuhk.edu.cn

Before Everthing

This is a project involving the C++ implementation of a neural network, specifically focusing on softmax regression and multilayer perceptron (MLP), utilizing stochastic gradient descent (SGD) training on the MNIST dataset. The project is organized into four tasks:

Task 1

softmax

As the figure above, there is only three steps in the forward computation:

  1. Use the matrix dot product to get the linear permutaion.

  2. Use the softmax σ() and get the normalized matrix

  3. get the maximum index of the softmax as the output.

Because we are doing the classification work, the loss is still the entropy loss. As for the training and parameter update, follow the fomula:

lsoftmax(z,i)=logσ(zi),(where i is the label index)=logexp(f(x)i)j=0j<kexp(f(x)j)=f(x)i+logj=0j<kexp(f(x)j),θlsoftmax(XΘ,y)=1mXT(ZIy)m is the batch size, Z=normalize(XΘ),IyRm×kis the concatenated one-hot vectorθ=θαlsoftmax

In the sequential implementation, to enhance program speed, attention should be directed toward optimizing locality. When the size of the last dimension, specifically the second dimension of the second matrix, is unknown, a general form can be employed. Two key strategies involve precomputing the pointer in advance for efficient data retrieval and reordering the sequence of nested loops to maximize data locality for a row in a matrix.

For matrix that need to first transpose than multiply, we directly multiply them leveraging the locality according to one principle: inner loop locality serve first. For example: ATB, we exanage dimension m, and n.

And for ABT, we exchange the dimension of n and k

If the last dimension is known, such as having 10 categories for the matrix θ, we can leverage the knowledge to unroll the last dimension. By doing so, the compiler can utilize software pipeline techniques to expedite the execution of the program. This unrolling process involves unfolding the loop corresponding to the last dimension, facilitating more efficient and optimized code generation.

Task 2

In OpenACC, we transfer data to the GPU by using #pragma acc enter data copyin(). This pragma is employed to allocate memory on the GPU and initialize it with the specified data. We need to try to avoid unnessory copyin() operation, because cpu gpu communication is expensive.

Following that, we can employ parallel data computation by initially stating #pragma acc data present() and subsequently adding #pragma acc parallel loop independent. It's noteworthy that matrix multiplication necessitates the use of reduction to aggregate the results.

After using all the data we need to use #pragma acc exit data delete() to free the memory on the gpu.

The output of the CPU and GPU openacc on nvidia tuling is as follows. However, the speed of task2 can be faster according to the baseline.

task12-output

Task 3

The core concept of neural net is applying learnable linear transform and non-linear transform on the input data. The non-linear we use in this project is ReLU()

Z=ReLU(XW1)W2Y^=normalize(Z)

Because we are still doing the classification work, the loss is still the entropy loss.

Z1=ReLU(XW1),Z1Rm×dG2=normalize(exp(Z1W2))Iy,G2Rm×kG1=1(Z1>0)(G2W2T)G1Rm×d,(is the element-wise product)W1lsoftmax(Y^,Y)=1mXTG1W2lsoftmax(Y^,Y)=1mZ1TG2

As mentioned above the ABT can be speedup by exchange dimension of n and k:

Task 4

Utimlize the openacc to speedup the nn training. We follow the less CPU-GPU data transition principle again. Because of the SGD and floating-point precision, the loss and err during training can be slightly different.

task34

Performance

#(params)Seq:softmax(784)OpenACC:softmax(784)Seq:NN(317,600)Seq:NN(317,600)
Time(ms)7169201773196872851

The number of parameter of softmax is 784 as the input dimension of a image. The number of the parameter of NN is 784×400+400×10=317,600.

 softmaxNN
Speedup355.43%10047.47%

Profiling OpenACC with nsys for Task 2

CUDA API Summary:

GPU Kernel Summary:

GPU Memory Transfer Summary:

GPU Memory Size Summary:

OS Runtime Summary:

Profiling OpenACC with nsys for Task 4

Here is the analysis of the provided report in the specified format:

CUDA API Summary:

Time(%)Total Time (ns)Num CallsAverageMinimumMaximumStdDevName
98.068,196,909,108216,461315,054.0713649,018,5946,234,591.1cuStreamSynchronize
2.01,368,962,408216,3606,327.23,8436,350,76314,415.3cuLaunchKernel
........................

GPU Kernel Summary:

NameTime(%)Total Time (ns)InstancesAverageMinimumMaximumStdDev
matrix_dot_trans_openacc_34_gpu54.537,103,407,97124,0001,545,975.332,5125,489,0371,514,046.1
matrix_dot_openacc_14_gpu39.126,641,504,43524,0801,106,374.89,696649,013,56318,557,523.1

GPU Memory Transfer Summary:

OperationTime(%)Total Time (ns)OperationsAverageMinimumMaximumStdDev
[CUDA memcpy HtoD]99.634,530,320167206,768.48961,375,484425,223.9
[CUDA memcpy DtoH]0.278,78280984.88641,568149.1

GPU Memory Size Summary:

OperationTotalOperationsAverageMinimumMaximumStdDev
[CUDA memcpy HtoD]408,933.9841672,448.7063.90616,384.0005,095.396
[CUDA memcpy DtoH]0.313800.0040.0040.0040.000

OS Runtime Summary:

NameTime(%)Total Time (ns)Num CallsAverageMinimumMaximumStdDev
poll50.172,827,685,29273998,548,965.210,649100,825,15912,287,789.5
pthread_cond_timedwait49.872,516,307,726145500,112,467.1499,824,488500,264,41031,418.3

The provided analysis includes summaries for CUDA API, GPU Kernels, GPU Memory Transfer, GPU Memory Size, and OS Runtime.

 

Compile and Execute

Compile the cpp (openacc) using the bash program, and the runtime output will be recored in the Project4-Results.txt

The test.sh bash code contains sbatch that you can run it yourself alone:

View the perfromance of task2 softmax openacc, and store the output in the report1.txt

View the performance of task 4 nn openacc, and store the output in the report2.txt

Reference

CUHKSZ Project Description: https://github.com/tonyyxliu/CSC4005-2023Fall/tree/main/project4

CMU Code: https://github.com/hanquanjushi/10-714/blob/main/hw0/src/simple_ml_ext.cpp