wip
This commit is contained in:
parent
bc6d82f7b0
commit
d2d8fb38d2
2 changed files with 43 additions and 13 deletions
Binary file not shown.
|
@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
|||
|
||||
\fill (0.5,-2.5) circle [radius=2pt];
|
||||
\fill (6,-2.5) circle [radius=2pt];
|
||||
\fill (11.5,-2.5) circle [radius=2pt];
|
||||
\foreach \r in {0,1,2}{
|
||||
\draw (\r + 0.5, -3.5) node {$\r$};
|
||||
}
|
||||
|
@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
|||
\FPeval{l}{round(\r - 12, 0)}
|
||||
\draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$};
|
||||
}
|
||||
\foreach \r in {4.5,5.5}{
|
||||
\FPeval{l}{round(\r - 6.5, 0)}
|
||||
\draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$};
|
||||
}
|
||||
\draw (7,-3.5) node {\tiny $2^{19}$};
|
||||
\draw (5, -3.5) node {\tiny $2^{19} - 1$};
|
||||
\draw (6, -3.5) node {\tiny $2^{19}$};
|
||||
\draw (7,-3.5) node {\tiny $2^{19} + 1$};
|
||||
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
|
||||
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
|
||||
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
|
||||
\end{tikzpicture}
|
||||
\end{center}
|
||||
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
|
||||
|
@ -158,14 +154,33 @@ highest index used to access the array in the pattern. \texttt{stride}
|
|||
determines the difference between array indexes over access iterations, i.e. a
|
||||
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
|
||||
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
|
||||
and so on and so forth.
|
||||
and so on and so forth. The benchmark stops when the index to access is strictly
|
||||
greater than \texttt{csize - stride}.
|
||||
|
||||
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
||||
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
||||
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
|
||||
index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
|
||||
index $2^{19}-1$. The access patterns for these
|
||||
two configurations are shown visually in Figure \ref{fig:access}.
|
||||
|
||||
By running the \texttt{membench.c} both on my personal laptop and on the
|
||||
cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize}
|
||||
values are shown as different data series and labeled by byte size and
|
||||
\textit{stride} values are mapped on the $x$ axis by the byte-equivalent value
|
||||
as well\footnote{Byte values are a factor of 4 greater than the values used in
|
||||
the code and in Figure \ref{fig:mem}. This is due to the fact that the array
|
||||
elements used in the benchmark are 32-bit signed integers, which take up 4 bytes
|
||||
each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$
|
||||
bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize =
|
||||
$2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time
|
||||
is $1.156$ nanoseconds. The first set of parameters performs well thanks to the
|
||||
low \textit{stride} value, thus achieving very good space locality and
|
||||
maximizing cache hits. However, the second set of parameters achieves good
|
||||
performance as well thanks to the few values accessed with each pass, thus
|
||||
improving the temporal locality of each address accessed. This observation
|
||||
applies for the few last data points in each data series of Figure
|
||||
\ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}.
|
||||
|
||||
\subsection{Analyzing Benchmark Results}
|
||||
|
||||
\begin{figure}[t]
|
||||
|
@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
|
|||
read+write time is less than 10 nanoseconds. Temporal locality is worst for
|
||||
large sizes and strides, although the largest values of \texttt{stride} for each
|
||||
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
|
||||
due to the few elements accessed in the pattern (this observation is also valid
|
||||
for the largest strides of each size series shown in the graph).
|
||||
for the aformentioned effect of having \textit{stride} values close to
|
||||
\textit{csize}.
|
||||
|
||||
The pattern that can be read from the graphs, especially the one for the
|
||||
cluster, shows that the \textit{stride} axis is divided in regions showing
|
||||
memory access time of similar magnitude. The boundary between the first and the
|
||||
second region is a \textit{stride} value of rougly 2KB, while a \textit{stride}
|
||||
of 512KB roughly separates the second and the third region. The difference in
|
||||
performance between regions and the similarity of performance within regions
|
||||
suggest the threshold stride values are related to changes in the use of the
|
||||
cache hierarchy. In particular, the first region may characterize regions where
|
||||
the L1 cache, the fastest non-register memory available, is predominantly used.
|
||||
Then the second region might overlap with a more intense use of the L2 cache and
|
||||
likewise between the third region and the L3 cache.
|
||||
\marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}
|
||||
|
||||
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
||||
|
@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation:
|
|||
discussed above to a final $6.5\%$).
|
||||
\end{itemize}
|
||||
|
||||
The chosen matrix block size for running the benchmark on the cluster is
|
||||
The chosen matrix block size for running the benchmark on the cluster is:
|
||||
|
||||
$$s = 26$$
|
||||
|
||||
|
@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the
|
|||
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
|
||||
algorithm needs to exploit fast memory as much as possible. The reason the
|
||||
empirically best value results in a theoretical cache allocation that is only
|
||||
half of the complete cache size is due to some real-life factors. For example,
|
||||
half of the complete L1 cache size is due to some real-life factors. For example,
|
||||
cache misses tipically result in aligned page loads which may load unnecessary
|
||||
data.
|
||||
|
||||
|
@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of
|
|||
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
|
||||
optimization the latter one is able to exploit.
|
||||
|
||||
I was unable to run this benchmark suite on my personal machine due to Intel MKL
|
||||
installation issues that prevented the code to compile.
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=\textwidth]{timing.pdf}
|
||||
\caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,
|
||||
|
|
Reference in a new issue