wip

2022-10-11 16:42:49 +02:00 · 2022-10-11 16:42:49 +02:00 · d2d8fb38d2
commit d2d8fb38d2
parent bc6d82f7b0
2 changed files with 43 additions and 13 deletions
--- a/Project1/project_1_maggioni_claudio.pdf
+++ b/Project1/project_1_maggioni_claudio.pdf
--- a/Project1/project_1_maggioni_claudio.tex
+++ b/Project1/project_1_maggioni_claudio.tex
@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
    \fill (0.5,-2.5) circle [radius=2pt];
    \fill (6,-2.5) circle [radius=2pt];
    \fill (11.5,-2.5) circle [radius=2pt];
    \foreach \r in {0,1,2}{
        \draw (\r + 0.5, -3.5) node {$\r$};
    }
@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
        \FPeval{l}{round(\r - 12, 0)}
        \draw (\r + 0.5, -3.5) node {\tiny $2^{20}  \l$};
    }
-    \foreach \r in {4.5,5.5}{
+    \draw (5, -3.5) node {\tiny $2^{19} - 1$};
-        \FPeval{l}{round(\r - 6.5, 0)}
+    \draw (6, -3.5) node {\tiny $2^{19}$};
-        \draw (\r + 0.5, -3.5) node {\tiny $2^{19}  \l$};
+    \draw (7,-3.5) node {\tiny $2^{19} + 1$};
    }
    \draw (7,-3.5) node {\tiny $2^{19}$};
    \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
    \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
    \draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
 \end{tikzpicture}
 \end{center}
    \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
@ -158,14 +154,33 @@ highest index used to access the array in the pattern.  \texttt{stride}
 determines the difference between array indexes over access iterations, i.e. a
 \texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
 skip every other index, a \texttt{stride} of 4 will access one index then skip 3
-and so on and so forth.
+and so on and so forth. The benchmark stops when the index to access is strictly
 greater than \texttt{csize - stride}.
 Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
 access all indexes between 0 and 127 sequentially, and for \texttt{csize =
 $2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
-index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
+index $2^{19}-1$. The access patterns for these
 two configurations are shown visually in Figure \ref{fig:access}.
 By running the \texttt{membench.c} both on my personal laptop and on the
 cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize}
 values are shown as different data series and labeled by byte size and
 \textit{stride} values are mapped on the $x$ axis by the byte-equivalent value
 as well\footnote{Byte values are a factor of 4 greater than the values used in
 the code and in Figure \ref{fig:mem}. This is due to the fact that the array
 elements used in the benchmark are 32-bit signed integers, which take up 4 bytes
 each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$
 bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize =
 $2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time
 is $1.156$ nanoseconds. The first set of parameters performs well thanks to the
 low \textit{stride} value, thus achieving very good space locality and
 maximizing cache hits. However, the second set of parameters achieves good
 performance as well thanks to the few values accessed with each pass, thus
 improving the temporal locality of each address accessed. This observation
 applies for the few last data points in each data series of Figure
 \ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}. 
 \subsection{Analyzing Benchmark Results}
 \begin{figure}[t]
@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
 read+write time is less than 10 nanoseconds. Temporal locality is worst for
 large sizes and strides, although the largest values of \texttt{stride} for each
 size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
-due to the few elements accessed in the pattern (this observation is also valid
+for the aformentioned effect of having \textit{stride} values close to
-for the largest strides of each size series shown in the graph).
+\textit{csize}.
 The pattern that can be read from the graphs, especially the one for the
 cluster, shows that the \textit{stride} axis is divided in regions showing
 memory access time of similar magnitude. The boundary between the first and the
 second region is a \textit{stride} value of rougly 2KB, while a \textit{stride}
 of 512KB roughly separates the second and the third region. The difference in
 performance between regions and the similarity of performance within regions
 suggest the threshold stride values are related to changes in the use of the
 cache hierarchy. In particular, the first region may characterize regions where
 the L1 cache, the fastest non-register memory available, is predominantly used.
 Then the second region might overlap with a more intense use of the L2 cache and
 likewise between the third region and the L3 cache.
 \marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}
 \section{Optimize Square Matrix-Matrix Multiplication  \punkte{60}}
@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation:
        discussed above to a final $6.5\%$).
 \end{itemize}
-The chosen matrix block size for running the benchmark on the cluster is
+The chosen matrix block size for running the benchmark on the cluster is:
 $$s = 26$$
@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the
 cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
 algorithm needs to exploit fast memory as much as possible. The reason the
 empirically best value results in a theoretical cache allocation that is only
-half of the complete cache size is due to some real-life factors. For example,
+half of the complete L1 cache size is due to some real-life factors. For example,
 cache misses tipically result in aligned page loads which may load unnecessary
 data.
@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of
 FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
 optimization the latter one is able to exploit.
 I was unable to run this benchmark suite on my personal machine due to Intel MKL
 installation issues that prevented the code to compile.
 \begin{figure}[t]
    \includegraphics[width=\textwidth]{timing.pdf}
    \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,