Profiling
Naive
Matrix size | Execution Time (ms) | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions (insn per cycle) | Time Elapsed (s) | User Time (s) | Sys Time (s) |
---|
4x4 | 0 | 15,059 | 7,880 (52.33%) | 350 | 1,645,064 | 1,499,957 (0.91) | 0.006876428 | 0.001486000 | 0.000743000 |
128x128 | 12 | 20,542 | 7,989 (38.89%) | 401 | 45,638,597 | 92,867,897 (2.03) | 0.024913198 | 0.015714000 | 0.001964000 |
1024x1024 | 8098 | 68,405,312 | 153,895 (0.23%) | 10,069 | 23,902,613,961 | 35,919,507,009 (1.50) | 8.410229329 | 8.265018000 | 0.029934000 |
2048x2048 | 84405 | 542,915,007 | 129,186,888 (23.80%) | 54,722 | 245,535,142,712 | 281,103,204,474 (1.14) | 85.622553390 | 84.912009000 | 0.132696000 |
Memory Locality
Matrix size | Execution Time (ms) | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions (insn per cycle) | Time Elapsed (s) | User Time (s) | Sys Time (s) |
---|
4x4 | 0 | 14,927 | 7,575 (50.75%) | 349 | 1,634,691 | 1,508,247 (0.92) | 0.006073757 | 0.000703000 | 0.001406000 |
128x128 | 1 | 24,096 | 8,014 (33.26%) | 450 | 14,265,534 | 41,337,323 (2.90) | 0.010781986 | 0.004723000 | 0.002361000 |
1024x1024 | 592 | 68,703,460 | 583,328 (0.85%) | 6,552 | 2,508,334,843 | 9,916,745,223 (3.95) | 1.011168101 | 0.878656000 | 0.035945000 |
2048x2048 | 5049 | 545,134,374 | 122,248,794 (22.43%) | 51,633 | 16,556,640,721 | 66,554,110,649 (4.02) | 6.263003508 | 5.742302000 | 0.148697000 |
SIMD + Memory Locality
Matrix Size | Execution Time (ms) | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions (insn per cycle) | Time Elapsed (s) | User Time (s) | Sys Time (s) |
---|
4x4 | 0 | 15,324 | 8,036 (52.44%) | 352 | 1,658,805 | 1,525,188 (0.92) | 0.005522463 | 0.001371000 | 0.000685000 |
128x128 | 0 | 24,126 | 8,013 (33.21%) | 465 | 11,883,415 | 29,497,713 (2.48) | 0.009254048 | 0.006170000 | 0.000000000 |
1024x1024 | 231 | 9,741,688 | 574,065 (5.89%) | 6,652 | 1,382,821,719 | 3,855,495,348 (2.79) | 0.646677788 | 0.531073000 | 0.021961000 |
2048x2048 | 1834 | 71,703,054 | 18,617,438 (25.97%) | 44,224 | 6,640,389,416 | 18,087,217,743 (2.72) | 3.051195702 | 2.553388000 | 0.133072000 |
openmp + SIMD + Memory Locality
Size = (1024x1024) x (1024x1024)
Number of Cores | Execution Time (ms) | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions (insn per cycle) | Time Elapsed (s) | User Time (s) | Sys Time (s) |
---|
1 | 180 | 35,735,697 | 529,023 (1.48%) | 6,720 | 1,719,419,319 | 3,745,134,379 (2.18) | 0.608006180 | 0.655374000 | 0.033915000 |
2 | 180 | 35,828,139 | 532,107 (1.49%) | 6,719 | 1,721,031,788 | 3,745,488,445 (2.18) | 0.607123866 | 0.664244000 | 0.024934000 |
4 | 107 | 35,742,443 | 722,581 (2.02%) | 6,725 | 1,788,278,975 | 3,756,960,867 (2.10) | 0.541872765 | 0.713217000 | 0.034010000 |
8 | 68 | 50,515,828 | 778,685 (1.54%) | 6,737 | 1,857,770,853 | 3,777,157,390 (2.03) | 0.533851150 | 0.789563000 | 0.042976000 |
16 | 52 | 51,659,963 | 720,180 (1.39%) | 6,763 | 2,176,615,555 | 3,820,804,788 (1.76) | 0.516569947 | 0.990392000 | 0.044151000 |
32 | 43 | 43,359,618 | 666,146 (1.54%) | 6,814 | 2,972,420,024 | 3,922,431,309 (1.32) | 0.512431069 | 1.430720000 | 0.069196000 |
Size = (2048x2048) x (2048x2048)
Number of Cores | Execution Time (ms) | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions (insn per cycle) | Time Elapsed (s) | User Time (s) | Sys Time (s) |
---|
1 | 1694 | 281,435,292 | 67,810,759 (24.10%) | 45,192 | 11,600,419,676 | 20,329,444,499 (1.75) | 3.342514567 | 4.497190000 | 0.134855000 |
2 | 1694 | 282,296,131 | 68,767,488 (24.36%) | 46,130 | 11,607,809,489 | 20,337,826,665 (1.75) | 3.341796224 | 4.511263000 | 0.123787000 |
4 | 1041 | 337,747,913 | 140,765,153 (41.68%) | 37,685 | 12,044,433,638 | 20,348,731,391 (1.69) | 2.795482975 | 4.696208000 | 0.148626000 |
8 | 512 | 293,565,151 | 72,124,546 (24.57%) | 25,313 | 12,358,982,875 | 20,375,709,121 (1.65) | 2.268546157 | 4.904354000 | 0.129824000 |
16 | 378 | 323,499,810 | 62,453,073 (19.30%) | 25,347 | 12,937,708,531 | 20,421,322,398 (1.58) | 2.066236412 | 5.331375000 | 0.131614000 |
32 | 221 | 301,917,831 | 38,546,078 (12.77%) | 25,414 | 13,058,545,816 | 20,467,635,895 (1.57) | 1.912390218 | 5.734781000 | 0.143543000 |
MPI + openmp + SIMD + Memory Locality
Number of Processes: 1, Number of Threads: 32
| cache-references | cache-misses:u | % of all cache refs | page-faults:u | cycles:u | instructions:u | insn per cycle | seconds time elapsed | seconds user | seconds sys |
---|
MPI | 53,531,599 | 573,448 | 1.071 | 8,193 | 2,631,554,077 | 3,145,888,030 | 1.2 | 0.40927466 | 1.313346 | 0.047192 |
Number of Processes: 2, Number of Threads: 16
| cache-references | cache-misses:u | % of all cache refs | page-faults:u | cycles:u | instructions:u | insn per cycle | seconds time elapsed | seconds user | seconds sys |
---|
Run 1 | 18,387,521 | 638,882 | 3.475% | 7,201 | 1,794,540,131 | 3,103,282,451 | 1.73 | 0.403728107 | 0.863765000 | 0.022835000 |
Run 2 | 18,512,037 | 433,639 | 2.342% | 8,032 | 1,285,180,822 | 2,355,963,191 | 1.83 | 0.544637732 | 0.674788000 | 0.046545000 |
Number of Processes: 4, Number of Threads: 8
Process | cache-references | cache-misses:u | % of all cache refs | page-faults:u | cycles:u | instructions:u | insn per cycle | seconds time elapsed | seconds user | seconds sys |
---|
1 | 11,624,426 | 347,147 | 2.986% | 6,148 | 1,285,759,934 | 2.13 | 0.411279954 | 0.549887000 | 0.039560000 | 0.039560000 |
2 | 11,373,436 | 428,685 | 3.769% | 6,661 | 1,212,875,856 | 2.24 | 0.798180252 | 0.555392000 | 0.030021000 | 0.030021000 |
3 | 11,580,639 | 1,132,947 | 9.783% | 6,660 | 1,297,436,685 | 2.10 | 0.405662909 | 0.568450000 | 0.036899000 | 0.036899000 |
4 | 10,044,076 | 676,713 | 6.737% | 7,847 | 959,176,494 | 2.09 | 0.864433109 | 0.450446000 | 0.049157000 | 0.049157000 |
Number of Processes: 8, Number of Threads: 4
Process | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions | Instructions per Cycle |
---|
1 | 7,103,554 | 5.643% | 5,887 | 1,095,412,904 | 2,644,775,225 | 2.41 |
2 | 4,974,409 | 8.437% | 5,887 | 2,644,775,225 | 2,565,722,673 | 2.49 |
3 | 4,965,116 | 29.409% | 5,887 | 1,059,540,343 | 2,714,194,905 | 2.42 |
4 | 4,952,993 | 28.320% | 5,887 | 1,076,613,438 | 2,562,567,134 | 2.39 |
5 | 4,916,100 | 38.812% | 6,399 | 1,042,072,492 | 2,440,878,243 | 2.27 |
6 | 4,913,112 | 8.880% | 6,398 | 1,816,047,284 | 2,562,567,134 | 2.46 |
7 | 4,918,485 | 33.651% | 6,398 | 1,074,992,911 | 2,440,878,243 | 2.27 |
8 | 5,224,037 | 34.665% | 8,109 | 812,317,982 | 1,816,047,284 | 2.24 |
Number of Processes: 16, Number of Threads: 2
Process | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions | Instructions per Cycle |
---|
Average | 2,726,474 | 43% | 6373 | 986,932,298 | 2,410,212,529 | 2.43 |
Number of Processes: 32, Number of Threads: 1
Process | Cache References | Cache Misses (%) | Page Faults | Cycles | Instructions | Instructions per Cycle |
---|
Average | 2,716,216 | 72.204 % | 6,231 | 1,606,922,431 | 2,286,784,103 | 1.42 |