hw1: final submission
This commit is contained in:
parent
980eb5f0b9
commit
84d437f06c
4 changed files with 568 additions and 540 deletions
Binary file not shown.
|
@ -320,12 +320,23 @@ In order to achieve a correct and fast execution, my implementation:
|
|||
hits are maximised. The copy operation is implemented with bulk copy
|
||||
\texttt{memcpy} calls. This optimization achieves an extra half of a
|
||||
percentage point in terms of CPU utilization (i.e. from the $6\%$
|
||||
discussed above to a final $6.5\%$).
|
||||
discussed above to roughly $6.5\%$).
|
||||
\item Exploits some compiler optimizations, namely using the compiler
|
||||
optimizer at the \texttt{-O3} setting and using the \texttt{-ffast-math}
|
||||
and \texttt{-march=haswell} to respectively apply some floating point
|
||||
arithmetic optimizations and to set the compiler target to the exact ISA
|
||||
of the cluster's processor. Note that these flags are applied to all
|
||||
implementations used in the benchmark as the flags were added in the
|
||||
\texttt{Makefile}. In addition, the \texttt{naivemm} algoritms was
|
||||
inlined in the actual \texttt{dgemm-blocked.c} source code instead of
|
||||
being separated into a function to be called to achieve better
|
||||
performance and compiler optimizations. All these changes increased the
|
||||
CPU utilization from $6.5\%$ to an average of $13.45\%$.
|
||||
\end{itemize}
|
||||
|
||||
The chosen matrix block size for running the benchmark on the cluster is:
|
||||
|
||||
$$s = 26$$
|
||||
$$s = 32$$
|
||||
|
||||
as shown in the pseudocode. This value has been obtained by running an empirical
|
||||
binary search on the value using the benchmark as a metric, i.e. by running
|
||||
|
@ -333,7 +344,7 @@ binary search on the value using the benchmark as a metric, i.e. by running
|
|||
blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
|
||||
and the \texttt{C\_temp} temporary matrix block for $C$ is:
|
||||
|
||||
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
|
||||
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 32^2 * 3 = 24576$$
|
||||
|
||||
given that a double-precision floating point number, the data type used for
|
||||
matrix cells in the scope of this project, is 8 bytes long. The obtained total
|
||||
|
@ -396,11 +407,9 @@ caches found on most processors.
|
|||
The results of the matrix multiplication benchmark for the naive, blocked, and
|
||||
BLAS implementations are shown in Figure \ref{fig:bench} as a graph of GFlop/s
|
||||
over matrix size or in Figure \ref{fig:benchtab} as a table. The blocked
|
||||
implementation achieves on average 50\% more FLOPS than the naive
|
||||
implementation thanks to the optimisations in space and temporal cache locality
|
||||
described. However, the blocked implementation achives less than a tenth of
|
||||
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
|
||||
optimization the latter one is able to exploit.
|
||||
implementation achieves up to 200\% more FLOPS than the naive implementation for
|
||||
the largest matrix dimensions. However, the blocked implementation achives
|
||||
roughly an eighth of the FLOPS the Intel MKL BLAS based implementation achieves.
|
||||
|
||||
I was unable to run this benchmark suite on my personal machine due to Intel MKL
|
||||
installation issues that prevented the code to compile.
|
||||
|
@ -423,32 +432,32 @@ installation issues that prevented the code to compile.
|
|||
\makecell{\% CPU} & \makecell{MFLOPS} &
|
||||
\makecell{\% CPU} \\
|
||||
\midrule
|
||||
31 & 2393.33 & 6.50 & 2112.63 & 5.74 & 23449.20 & 63.72 \\
|
||||
32 & 2400.13 & 6.52 & 2187.44 & 5.94 & 28198.90 & 76.63 \\
|
||||
96 & 1998.74 & 5.43 & 2325.39 & 6.32 & 32542.30 & 88.43 \\
|
||||
97 & 1996.01 & 5.42 & 2322.81 & 6.31 & 29801.30 & 80.98 \\
|
||||
127 & 1923.81 & 5.23 & 2330.30 & 6.33 & 28557.80 & 77.60 \\
|
||||
128 & 1731.98 & 4.71 & 2282.93 & 6.20 & 32643.30 & 88.70 \\
|
||||
129 & 1903.31 & 5.17 & 2334.25 & 6.34 & 31198.20 & 84.78 \\
|
||||
191 & 1736.78 & 4.72 & 2345.91 & 6.37 & 32247.30 & 87.63 \\
|
||||
192 & 1694.44 & 4.60 & 2345.38 & 6.37 & 32830.60 & 89.21 \\
|
||||
229 & 1715.10 & 4.66 & 2351.01 & 6.39 & 34360.90 & 93.37 \\
|
||||
255 & 1720.39 & 4.67 & 2335.21 & 6.35 & 33477.70 & 90.97 \\
|
||||
256 & 777.65 & 2.11 & 2306.48 & 6.27 & 33473.90 & 90.96 \\
|
||||
257 & 1729.27 & 4.70 & 2330.68 & 6.33 & 33686.50 & 91.54 \\
|
||||
319 & 1704.80 & 4.63 & 2360.03 & 6.41 & 34335.20 & 93.30 \\
|
||||
320 & 1414.84 & 3.84 & 2364.53 & 6.43 & 36438.10 & 99.02 \\
|
||||
321 & 1741.30 & 4.73 & 2366.38 & 6.43 & 35433.70 & 96.29 \\
|
||||
417 & 1733.00 & 4.71 & 2378.34 & 6.46 & 36133.70 & 98.19 \\
|
||||
479 & 1731.17 & 4.70 & 2233.05 & 6.07 & 32951.40 & 89.54 \\
|
||||
480 & 1678.77 & 4.56 & 2187.87 & 5.95 & 37260.00 & 101.25 \\
|
||||
511 & 1733.60 & 4.71 & 2224.61 & 6.05 & 34128.00 & 92.74 \\
|
||||
512 & 782.96 & 2.13 & 2284.85 & 6.21 & 36526.40 & 99.26 \\
|
||||
639 & 1714.42 & 4.66 & 2292.78 & 6.23 & 35249.20 & 95.79 \\
|
||||
640 & 663.42 & 1.80 & 2264.70 & 6.15 & 36538.70 & 99.29 \\
|
||||
767 & 1690.82 & 4.59 & 2324.83 & 6.32 & 35718.50 & 97.06 \\
|
||||
768 & 792.04 & 2.15 & 2363.92 & 6.42 & 32116.80 & 87.27 \\
|
||||
769 & 1696.95 & 4.61 & 2321.31 & 6.31 & 33033.90 & 89.77 \\
|
||||
31 & 3140.45 & 8.53 & 3844.56 & 10.45 & 25677.4 & 69.78 \\
|
||||
32 & 3364.78 & 9.14 & 5342.55 & 14.52 & 28952.1 & 78.67 \\
|
||||
96 & 2703.08 & 7.35 & 5620.08 & 15.27 & 32816.4 & 89.18 \\
|
||||
97 & 2729.68 & 7.42 & 4754.1 & 12.92 & 31699.2 & 86.14 \\
|
||||
127 & 2556.58 & 6.95 & 4977.82 & 13.53 & 30274.5 & 82.27 \\
|
||||
128 & 1803.41 & 4.90 & 4817.8 & 13.09 & 32721.7 & 88.92 \\
|
||||
129 & 2669.26 & 7.25 & 4594.25 & 12.48 & 31746.4 & 86.27 \\
|
||||
191 & 2290.09 & 6.22 & 4931.27 & 13.40 & 32263.1 & 87.67 \\
|
||||
192 & 1801.66 & 4.90 & 5549.67 & 15.08 & 35491.2 & 96.44 \\
|
||||
229 & 2218.61 & 6.03 & 4982.59 & 13.54 & 34557.2 & 93.91 \\
|
||||
255 & 2178.15 & 5.92 & 4528.43 & 12.31 & 33771.3 & 91.77 \\
|
||||
256 & 808.413 & 2.20 & 4652.68 & 12.64 & 35221.1 & 95.71 \\
|
||||
257 & 2238.93 & 6.08 & 4512.33 & 12.26 & 33807.9 & 91.87 \\
|
||||
319 & 2174.45 & 5.91 & 5093.38 & 13.84 & 34415.8 & 93.52 \\
|
||||
320 & 1612.13 & 4.38 & 5674.61 & 15.42 & 36500.2 & 99.19 \\
|
||||
321 & 2173.64 & 5.91 & 5111.09 & 13.89 & 35508.1 & 96.49 \\
|
||||
417 & 2125.36 & 5.78 & 5143.98 & 13.98 & 36157.6 & 98.25 \\
|
||||
479 & 2107.13 & 5.73 & 5152.51 & 14.00 & 36186.4 & 98.33 \\
|
||||
480 & 1848.43 & 5.02 & 5703 & 15.50 & 37971.3 & 103.18 \\
|
||||
511 & 2112.99 & 5.74 & 4479.96 & 12.17 & 35144 & 95.50 \\
|
||||
512 & 801.127 & 2.18 & 4596.26 & 12.49 & 37362.5 & 101.53 \\
|
||||
639 & 1881.94 & 5.11 & 5168.59 & 14.05 & 36989.1 & 100.51 \\
|
||||
640 & 815.847 & 2.22 & 5232.97 & 14.22 & 38267.8 & 103.99 \\
|
||||
767 & 1825.75 & 4.96 & 4701.09 & 12.77 & 37220.8 & 101.14 \\
|
||||
768 & 812.933 & 2.21 & 4826.12 & 13.11 & 38744 & 105.28 \\
|
||||
769 & 1825.38 & 4.96 & 4686.21 & 12.73 & 37076.1 & 100.75 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
|
1031
Project1/timing.pdf
1031
Project1/timing.pdf
File diff suppressed because it is too large
Load diff
Binary file not shown.
Reference in a new issue