This commit is contained in:
Claudio Maggioni 2022-10-11 16:42:49 +02:00
parent bc6d82f7b0
commit d2d8fb38d2
2 changed files with 43 additions and 13 deletions

View file

@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
\fill (0.5,-2.5) circle [radius=2pt]; \fill (0.5,-2.5) circle [radius=2pt];
\fill (6,-2.5) circle [radius=2pt]; \fill (6,-2.5) circle [radius=2pt];
\fill (11.5,-2.5) circle [radius=2pt];
\foreach \r in {0,1,2}{ \foreach \r in {0,1,2}{
\draw (\r + 0.5, -3.5) node {$\r$}; \draw (\r + 0.5, -3.5) node {$\r$};
} }
@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
\FPeval{l}{round(\r - 12, 0)} \FPeval{l}{round(\r - 12, 0)}
\draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$}; \draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$};
} }
\foreach \r in {4.5,5.5}{ \draw (5, -3.5) node {\tiny $2^{19} - 1$};
\FPeval{l}{round(\r - 6.5, 0)} \draw (6, -3.5) node {\tiny $2^{19}$};
\draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$}; \draw (7,-3.5) node {\tiny $2^{19} + 1$};
}
\draw (7,-3.5) node {\tiny $2^{19}$};
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5); \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5); \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize = \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
@ -158,14 +154,33 @@ highest index used to access the array in the pattern. \texttt{stride}
determines the difference between array indexes over access iterations, i.e. a determines the difference between array indexes over access iterations, i.e. a
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will \texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
skip every other index, a \texttt{stride} of 4 will access one index then skip 3 skip every other index, a \texttt{stride} of 4 will access one index then skip 3
and so on and so forth. and so on and so forth. The benchmark stops when the index to access is strictly
greater than \texttt{csize - stride}.
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
access all indexes between 0 and 127 sequentially, and for \texttt{csize = access all indexes between 0 and 127 sequentially, and for \texttt{csize =
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then $2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these index $2^{19}-1$. The access patterns for these
two configurations are shown visually in Figure \ref{fig:access}. two configurations are shown visually in Figure \ref{fig:access}.
By running the \texttt{membench.c} both on my personal laptop and on the
cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize}
values are shown as different data series and labeled by byte size and
\textit{stride} values are mapped on the $x$ axis by the byte-equivalent value
as well\footnote{Byte values are a factor of 4 greater than the values used in
the code and in Figure \ref{fig:mem}. This is due to the fact that the array
elements used in the benchmark are 32-bit signed integers, which take up 4 bytes
each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$
bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize =
$2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time
is $1.156$ nanoseconds. The first set of parameters performs well thanks to the
low \textit{stride} value, thus achieving very good space locality and
maximizing cache hits. However, the second set of parameters achieves good
performance as well thanks to the few values accessed with each pass, thus
improving the temporal locality of each address accessed. This observation
applies for the few last data points in each data series of Figure
\ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}.
\subsection{Analyzing Benchmark Results} \subsection{Analyzing Benchmark Results}
\begin{figure}[t] \begin{figure}[t]
@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
read+write time is less than 10 nanoseconds. Temporal locality is worst for read+write time is less than 10 nanoseconds. Temporal locality is worst for
large sizes and strides, although the largest values of \texttt{stride} for each large sizes and strides, although the largest values of \texttt{stride} for each
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
due to the few elements accessed in the pattern (this observation is also valid for the aformentioned effect of having \textit{stride} values close to
for the largest strides of each size series shown in the graph). \textit{csize}.
The pattern that can be read from the graphs, especially the one for the
cluster, shows that the \textit{stride} axis is divided in regions showing
memory access time of similar magnitude. The boundary between the first and the
second region is a \textit{stride} value of rougly 2KB, while a \textit{stride}
of 512KB roughly separates the second and the third region. The difference in
performance between regions and the similarity of performance within regions
suggest the threshold stride values are related to changes in the use of the
cache hierarchy. In particular, the first region may characterize regions where
the L1 cache, the fastest non-register memory available, is predominantly used.
Then the second region might overlap with a more intense use of the L2 cache and
likewise between the third region and the L3 cache.
\marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}} \marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}} \section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation:
discussed above to a final $6.5\%$). discussed above to a final $6.5\%$).
\end{itemize} \end{itemize}
The chosen matrix block size for running the benchmark on the cluster is The chosen matrix block size for running the benchmark on the cluster is:
$$s = 26$$ $$s = 26$$
@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
algorithm needs to exploit fast memory as much as possible. The reason the algorithm needs to exploit fast memory as much as possible. The reason the
empirically best value results in a theoretical cache allocation that is only empirically best value results in a theoretical cache allocation that is only
half of the complete cache size is due to some real-life factors. For example, half of the complete L1 cache size is due to some real-life factors. For example,
cache misses tipically result in aligned page loads which may load unnecessary cache misses tipically result in aligned page loads which may load unnecessary
data. data.
@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
optimization the latter one is able to exploit. optimization the latter one is able to exploit.
I was unable to run this benchmark suite on my personal machine due to Intel MKL
installation issues that prevented the code to compile.
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=\textwidth]{timing.pdf} \includegraphics[width=\textwidth]{timing.pdf}
\caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive, \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,