hw1: final submission

This commit is contained in:
Claudio Maggioni 2022-10-12 11:38:39 +02:00
parent 980eb5f0b9
commit 84d437f06c
4 changed files with 568 additions and 540 deletions

View File

@ -320,12 +320,23 @@ In order to achieve a correct and fast execution, my implementation:
hits are maximised. The copy operation is implemented with bulk copy
\texttt{memcpy} calls. This optimization achieves an extra half of a
percentage point in terms of CPU utilization (i.e. from the $6\%$
discussed above to a final $6.5\%$).
discussed above to roughly $6.5\%$).
\item Exploits some compiler optimizations, namely using the compiler
optimizer at the \texttt{-O3} setting and using the \texttt{-ffast-math}
and \texttt{-march=haswell} to respectively apply some floating point
arithmetic optimizations and to set the compiler target to the exact ISA
of the cluster's processor. Note that these flags are applied to all
implementations used in the benchmark as the flags were added in the
\texttt{Makefile}. In addition, the \texttt{naivemm} algoritms was
inlined in the actual \texttt{dgemm-blocked.c} source code instead of
being separated into a function to be called to achieve better
performance and compiler optimizations. All these changes increased the
CPU utilization from $6.5\%$ to an average of $13.45\%$.
\end{itemize}
The chosen matrix block size for running the benchmark on the cluster is:
$$s = 26$$
$$s = 32$$
as shown in the pseudocode. This value has been obtained by running an empirical
binary search on the value using the benchmark as a metric, i.e. by running
@ -333,7 +344,7 @@ binary search on the value using the benchmark as a metric, i.e. by running
blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
and the \texttt{C\_temp} temporary matrix block for $C$ is:
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 32^2 * 3 = 24576$$
given that a double-precision floating point number, the data type used for
matrix cells in the scope of this project, is 8 bytes long. The obtained total
@ -396,11 +407,9 @@ caches found on most processors.
The results of the matrix multiplication benchmark for the naive, blocked, and
BLAS implementations are shown in Figure \ref{fig:bench} as a graph of GFlop/s
over matrix size or in Figure \ref{fig:benchtab} as a table. The blocked
implementation achieves on average 50\% more FLOPS than the naive
implementation thanks to the optimisations in space and temporal cache locality
described. However, the blocked implementation achives less than a tenth of
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
optimization the latter one is able to exploit.
implementation achieves up to 200\% more FLOPS than the naive implementation for
the largest matrix dimensions. However, the blocked implementation achives
roughly an eighth of the FLOPS the Intel MKL BLAS based implementation achieves.
I was unable to run this benchmark suite on my personal machine due to Intel MKL
installation issues that prevented the code to compile.
@ -423,32 +432,32 @@ installation issues that prevented the code to compile.
\makecell{\% CPU} & \makecell{MFLOPS} &
\makecell{\% CPU} \\
\midrule
31 & 2393.33 & 6.50 & 2112.63 & 5.74 & 23449.20 & 63.72 \\
32 & 2400.13 & 6.52 & 2187.44 & 5.94 & 28198.90 & 76.63 \\
96 & 1998.74 & 5.43 & 2325.39 & 6.32 & 32542.30 & 88.43 \\
97 & 1996.01 & 5.42 & 2322.81 & 6.31 & 29801.30 & 80.98 \\
127 & 1923.81 & 5.23 & 2330.30 & 6.33 & 28557.80 & 77.60 \\
128 & 1731.98 & 4.71 & 2282.93 & 6.20 & 32643.30 & 88.70 \\
129 & 1903.31 & 5.17 & 2334.25 & 6.34 & 31198.20 & 84.78 \\
191 & 1736.78 & 4.72 & 2345.91 & 6.37 & 32247.30 & 87.63 \\
192 & 1694.44 & 4.60 & 2345.38 & 6.37 & 32830.60 & 89.21 \\
229 & 1715.10 & 4.66 & 2351.01 & 6.39 & 34360.90 & 93.37 \\
255 & 1720.39 & 4.67 & 2335.21 & 6.35 & 33477.70 & 90.97 \\
256 & 777.65 & 2.11 & 2306.48 & 6.27 & 33473.90 & 90.96 \\
257 & 1729.27 & 4.70 & 2330.68 & 6.33 & 33686.50 & 91.54 \\
319 & 1704.80 & 4.63 & 2360.03 & 6.41 & 34335.20 & 93.30 \\
320 & 1414.84 & 3.84 & 2364.53 & 6.43 & 36438.10 & 99.02 \\
321 & 1741.30 & 4.73 & 2366.38 & 6.43 & 35433.70 & 96.29 \\
417 & 1733.00 & 4.71 & 2378.34 & 6.46 & 36133.70 & 98.19 \\
479 & 1731.17 & 4.70 & 2233.05 & 6.07 & 32951.40 & 89.54 \\
480 & 1678.77 & 4.56 & 2187.87 & 5.95 & 37260.00 & 101.25 \\
511 & 1733.60 & 4.71 & 2224.61 & 6.05 & 34128.00 & 92.74 \\
512 & 782.96 & 2.13 & 2284.85 & 6.21 & 36526.40 & 99.26 \\
639 & 1714.42 & 4.66 & 2292.78 & 6.23 & 35249.20 & 95.79 \\
640 & 663.42 & 1.80 & 2264.70 & 6.15 & 36538.70 & 99.29 \\
767 & 1690.82 & 4.59 & 2324.83 & 6.32 & 35718.50 & 97.06 \\
768 & 792.04 & 2.15 & 2363.92 & 6.42 & 32116.80 & 87.27 \\
769 & 1696.95 & 4.61 & 2321.31 & 6.31 & 33033.90 & 89.77 \\
31 & 3140.45 & 8.53 & 3844.56 & 10.45 & 25677.4 & 69.78 \\
32 & 3364.78 & 9.14 & 5342.55 & 14.52 & 28952.1 & 78.67 \\
96 & 2703.08 & 7.35 & 5620.08 & 15.27 & 32816.4 & 89.18 \\
97 & 2729.68 & 7.42 & 4754.1 & 12.92 & 31699.2 & 86.14 \\
127 & 2556.58 & 6.95 & 4977.82 & 13.53 & 30274.5 & 82.27 \\
128 & 1803.41 & 4.90 & 4817.8 & 13.09 & 32721.7 & 88.92 \\
129 & 2669.26 & 7.25 & 4594.25 & 12.48 & 31746.4 & 86.27 \\
191 & 2290.09 & 6.22 & 4931.27 & 13.40 & 32263.1 & 87.67 \\
192 & 1801.66 & 4.90 & 5549.67 & 15.08 & 35491.2 & 96.44 \\
229 & 2218.61 & 6.03 & 4982.59 & 13.54 & 34557.2 & 93.91 \\
255 & 2178.15 & 5.92 & 4528.43 & 12.31 & 33771.3 & 91.77 \\
256 & 808.413 & 2.20 & 4652.68 & 12.64 & 35221.1 & 95.71 \\
257 & 2238.93 & 6.08 & 4512.33 & 12.26 & 33807.9 & 91.87 \\
319 & 2174.45 & 5.91 & 5093.38 & 13.84 & 34415.8 & 93.52 \\
320 & 1612.13 & 4.38 & 5674.61 & 15.42 & 36500.2 & 99.19 \\
321 & 2173.64 & 5.91 & 5111.09 & 13.89 & 35508.1 & 96.49 \\
417 & 2125.36 & 5.78 & 5143.98 & 13.98 & 36157.6 & 98.25 \\
479 & 2107.13 & 5.73 & 5152.51 & 14.00 & 36186.4 & 98.33 \\
480 & 1848.43 & 5.02 & 5703 & 15.50 & 37971.3 & 103.18 \\
511 & 2112.99 & 5.74 & 4479.96 & 12.17 & 35144 & 95.50 \\
512 & 801.127 & 2.18 & 4596.26 & 12.49 & 37362.5 & 101.53 \\
639 & 1881.94 & 5.11 & 5168.59 & 14.05 & 36989.1 & 100.51 \\
640 & 815.847 & 2.22 & 5232.97 & 14.22 & 38267.8 & 103.99 \\
767 & 1825.75 & 4.96 & 4701.09 & 12.77 & 37220.8 & 101.14 \\
768 & 812.933 & 2.21 & 4826.12 & 13.11 & 38744 & 105.28 \\
769 & 1825.38 & 4.96 & 4686.21 & 12.73 & 37076.1 & 100.75 \\
\bottomrule
\end{tabular}
\end{center}

File diff suppressed because it is too large Load Diff

Binary file not shown.