hw1: final submission

2022-10-12 11:38:39 +02:00 · 2022-10-12 11:38:39 +02:00 · 84d437f06c
commit 84d437f06c
parent 980eb5f0b9
4 changed files with 568 additions and 540 deletions
--- a/Project1/project_1_maggioni_claudio.pdf
+++ b/Project1/project_1_maggioni_claudio.pdf
--- a/Project1/project_1_maggioni_claudio.tex
+++ b/Project1/project_1_maggioni_claudio.tex
@ -320,12 +320,23 @@ In order to achieve a correct and fast execution, my implementation:
        hits are maximised. The copy operation is implemented with bulk copy
        \texttt{memcpy} calls. This optimization achieves an extra half of a
        percentage point in terms of CPU utilization (i.e. from the $6\%$
-        discussed above to a final $6.5\%$).
+        discussed above to roughly $6.5\%$).
+    \item Exploits some compiler optimizations, namely using the compiler
+        optimizer at the \texttt{-O3} setting and using the \texttt{-ffast-math}
+        and \texttt{-march=haswell} to respectively apply some floating point
+        arithmetic optimizations and to set the compiler target to the exact ISA
+        of the cluster's processor. Note that these flags are applied to all
+        implementations used in the benchmark as the flags were added in the 
+        \texttt{Makefile}. In addition, the \texttt{naivemm} algoritms was
+        inlined in the actual \texttt{dgemm-blocked.c} source code instead of
+        being separated into a function to be called to achieve better
+        performance and compiler optimizations. All these changes increased the
+        CPU utilization from $6.5\%$ to an average of $13.45\%$.
 \end{itemize}

 The chosen matrix block size for running the benchmark on the cluster is:

-$$s = 26$$
+$$s = 32$$

 as shown in the pseudocode. This value has been obtained by running an empirical
 binary search on the value using the benchmark as a metric, i.e. by running
@ -333,7 +344,7 @@ binary search on the value using the benchmark as a metric, i.e. by running
 blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
 and the \texttt{C\_temp} temporary matrix block for $C$ is:

-$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
+$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 32^2 * 3 = 24576$$

 given that a double-precision floating point number, the data type used for
 matrix cells in the scope of this project, is 8 bytes long. The obtained total
@ -396,11 +407,9 @@ caches found on most processors.
 The results of the matrix multiplication benchmark for the naive, blocked, and
 BLAS implementations are shown in Figure \ref{fig:bench} as a graph of GFlop/s
 over matrix size or in Figure \ref{fig:benchtab} as a table. The blocked
-implementation achieves on average 50\% more FLOPS than the naive
-implementation thanks to the optimisations in space and temporal cache locality
-described. However, the blocked implementation achives less than a tenth of
-FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
-optimization the latter one is able to exploit.
+implementation achieves up to 200\% more FLOPS than the naive implementation for
+the largest matrix dimensions. However, the blocked implementation achives
+roughly an eighth of the FLOPS the Intel MKL BLAS based implementation achieves.

 I was unable to run this benchmark suite on my personal machine due to Intel MKL
 installation issues that prevented the code to compile.
@ -423,32 +432,32 @@ installation issues that prevented the code to compile.
    \makecell{\% CPU} & \makecell{MFLOPS} &
    \makecell{\% CPU} \\
    \midrule
-        31 & 2393.33 & 6.50 & 2112.63 & 5.74 & 23449.20 & 63.72 \\
-        32 & 2400.13 & 6.52 & 2187.44 & 5.94 & 28198.90 & 76.63 \\
-        96 & 1998.74 & 5.43 & 2325.39 & 6.32 & 32542.30 & 88.43 \\
-        97 & 1996.01 & 5.42 & 2322.81 & 6.31 & 29801.30 & 80.98 \\
-        127 & 1923.81 & 5.23 & 2330.30 & 6.33 & 28557.80 & 77.60 \\
-        128 & 1731.98 & 4.71 & 2282.93 & 6.20 & 32643.30 & 88.70 \\
-        129 & 1903.31 & 5.17 & 2334.25 & 6.34 & 31198.20 & 84.78 \\
-        191 & 1736.78 & 4.72 & 2345.91 & 6.37 & 32247.30 & 87.63 \\
-        192 & 1694.44 & 4.60 & 2345.38 & 6.37 & 32830.60 & 89.21 \\
-        229 & 1715.10 & 4.66 & 2351.01 & 6.39 & 34360.90 & 93.37 \\
-        255 & 1720.39 & 4.67 & 2335.21 & 6.35 & 33477.70 & 90.97 \\
-        256 & 777.65 & 2.11 & 2306.48 & 6.27 & 33473.90 & 90.96 \\
-        257 & 1729.27 & 4.70 & 2330.68 & 6.33 & 33686.50 & 91.54 \\
-        319 & 1704.80 & 4.63 & 2360.03 & 6.41 & 34335.20 & 93.30 \\
-        320 & 1414.84 & 3.84 & 2364.53 & 6.43 & 36438.10 & 99.02 \\
-        321 & 1741.30 & 4.73 & 2366.38 & 6.43 & 35433.70 & 96.29 \\
-        417 & 1733.00 & 4.71 & 2378.34 & 6.46 & 36133.70 & 98.19 \\
-        479 & 1731.17 & 4.70 & 2233.05 & 6.07 & 32951.40 & 89.54 \\
-        480 & 1678.77 & 4.56 & 2187.87 & 5.95 & 37260.00 & 101.25 \\
-        511 & 1733.60 & 4.71 & 2224.61 & 6.05 & 34128.00 & 92.74 \\
-        512 & 782.96 & 2.13 & 2284.85 & 6.21 & 36526.40 & 99.26 \\
-        639 & 1714.42 & 4.66 & 2292.78 & 6.23 & 35249.20 & 95.79 \\
-        640 & 663.42 & 1.80 & 2264.70 & 6.15 & 36538.70 & 99.29 \\
-        767 & 1690.82 & 4.59 & 2324.83 & 6.32 & 35718.50 & 97.06 \\
-        768 & 792.04 & 2.15 & 2363.92 & 6.42 & 32116.80 & 87.27 \\
-        769 & 1696.95 & 4.61 & 2321.31 & 6.31 & 33033.90 & 89.77 \\
+31	 & 3140.45	 & 8.53 & 3844.56	 & 10.45 & 25677.4	 & 69.78 \\
+32	 & 3364.78	 & 9.14 & 5342.55	 & 14.52 & 28952.1	 & 78.67 \\
+96	 & 2703.08	 & 7.35 & 5620.08	 & 15.27 & 32816.4	 & 89.18 \\
+97	 & 2729.68	 & 7.42 & 4754.1	 & 12.92 & 31699.2	 & 86.14 \\
+127	 & 2556.58	 & 6.95 & 4977.82	 & 13.53 & 30274.5	 & 82.27 \\
+128	 & 1803.41	 & 4.90 & 4817.8	 & 13.09 & 32721.7	 & 88.92 \\
+129	 & 2669.26	 & 7.25 & 4594.25	 & 12.48 & 31746.4	 & 86.27 \\
+191	 & 2290.09	 & 6.22 & 4931.27	 & 13.40 & 32263.1	 & 87.67 \\
+192	 & 1801.66	 & 4.90 & 5549.67	 & 15.08 & 35491.2	 & 96.44 \\
+229	 & 2218.61	 & 6.03 & 4982.59	 & 13.54 & 34557.2	 & 93.91 \\
+255	 & 2178.15	 & 5.92 & 4528.43	 & 12.31 & 33771.3	 & 91.77 \\
+256	 & 808.413	 & 2.20 & 4652.68	 & 12.64 & 35221.1	 & 95.71 \\
+257	 & 2238.93	 & 6.08 & 4512.33	 & 12.26 & 33807.9	 & 91.87 \\
+319	 & 2174.45	 & 5.91 & 5093.38	 & 13.84 & 34415.8	 & 93.52 \\
+320	 & 1612.13	 & 4.38 & 5674.61	 & 15.42 & 36500.2	 & 99.19 \\
+321	 & 2173.64	 & 5.91 & 5111.09	 & 13.89 & 35508.1	 & 96.49 \\
+417	 & 2125.36	 & 5.78 & 5143.98	 & 13.98 & 36157.6	 & 98.25 \\
+479	 & 2107.13	 & 5.73 & 5152.51	 & 14.00 & 36186.4	 & 98.33 \\
+480	 & 1848.43	 & 5.02 & 5703	 & 15.50 & 37971.3	 & 103.18 \\
+511	 & 2112.99	 & 5.74 & 4479.96	 & 12.17 & 35144	 & 95.50 \\
+512	 & 801.127	 & 2.18 & 4596.26	 & 12.49 & 37362.5	 & 101.53 \\
+639	 & 1881.94	 & 5.11 & 5168.59	 & 14.05 & 36989.1	 & 100.51 \\
+640	 & 815.847	 & 2.22 & 5232.97	 & 14.22 & 38267.8	 & 103.99 \\
+767	 & 1825.75	 & 4.96 & 4701.09	 & 12.77 & 37220.8	 & 101.14 \\
+768	 & 812.933	 & 2.21 & 4826.12	 & 13.11 & 38744	 & 105.28 \\
+769	 & 1825.38	 & 4.96 & 4686.21	 & 12.73 & 37076.1	 & 100.75 \\
    \bottomrule
 \end{tabular}
 \end{center}
--- a/Project1/timing.pdf
+++ b/Project1/timing.pdf
--- a/project_1_maggioni_claudio.zip
+++ b/project_1_maggioni_claudio.zip