wip
This commit is contained in:
parent
bc6d82f7b0
commit
d2d8fb38d2
2 changed files with 43 additions and 13 deletions
Binary file not shown.
|
@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
||||||
|
|
||||||
\fill (0.5,-2.5) circle [radius=2pt];
|
\fill (0.5,-2.5) circle [radius=2pt];
|
||||||
\fill (6,-2.5) circle [radius=2pt];
|
\fill (6,-2.5) circle [radius=2pt];
|
||||||
\fill (11.5,-2.5) circle [radius=2pt];
|
|
||||||
\foreach \r in {0,1,2}{
|
\foreach \r in {0,1,2}{
|
||||||
\draw (\r + 0.5, -3.5) node {$\r$};
|
\draw (\r + 0.5, -3.5) node {$\r$};
|
||||||
}
|
}
|
||||||
|
@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
||||||
\FPeval{l}{round(\r - 12, 0)}
|
\FPeval{l}{round(\r - 12, 0)}
|
||||||
\draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$};
|
\draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$};
|
||||||
}
|
}
|
||||||
\foreach \r in {4.5,5.5}{
|
\draw (5, -3.5) node {\tiny $2^{19} - 1$};
|
||||||
\FPeval{l}{round(\r - 6.5, 0)}
|
\draw (6, -3.5) node {\tiny $2^{19}$};
|
||||||
\draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$};
|
\draw (7,-3.5) node {\tiny $2^{19} + 1$};
|
||||||
}
|
|
||||||
\draw (7,-3.5) node {\tiny $2^{19}$};
|
|
||||||
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
|
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
|
||||||
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
|
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
|
||||||
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
|
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
|
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
|
||||||
|
@ -158,14 +154,33 @@ highest index used to access the array in the pattern. \texttt{stride}
|
||||||
determines the difference between array indexes over access iterations, i.e. a
|
determines the difference between array indexes over access iterations, i.e. a
|
||||||
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
|
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
|
||||||
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
|
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
|
||||||
and so on and so forth.
|
and so on and so forth. The benchmark stops when the index to access is strictly
|
||||||
|
greater than \texttt{csize - stride}.
|
||||||
|
|
||||||
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
||||||
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
||||||
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
|
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
|
||||||
index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
|
index $2^{19}-1$. The access patterns for these
|
||||||
two configurations are shown visually in Figure \ref{fig:access}.
|
two configurations are shown visually in Figure \ref{fig:access}.
|
||||||
|
|
||||||
|
By running the \texttt{membench.c} both on my personal laptop and on the
|
||||||
|
cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize}
|
||||||
|
values are shown as different data series and labeled by byte size and
|
||||||
|
\textit{stride} values are mapped on the $x$ axis by the byte-equivalent value
|
||||||
|
as well\footnote{Byte values are a factor of 4 greater than the values used in
|
||||||
|
the code and in Figure \ref{fig:mem}. This is due to the fact that the array
|
||||||
|
elements used in the benchmark are 32-bit signed integers, which take up 4 bytes
|
||||||
|
each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$
|
||||||
|
bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize =
|
||||||
|
$2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time
|
||||||
|
is $1.156$ nanoseconds. The first set of parameters performs well thanks to the
|
||||||
|
low \textit{stride} value, thus achieving very good space locality and
|
||||||
|
maximizing cache hits. However, the second set of parameters achieves good
|
||||||
|
performance as well thanks to the few values accessed with each pass, thus
|
||||||
|
improving the temporal locality of each address accessed. This observation
|
||||||
|
applies for the few last data points in each data series of Figure
|
||||||
|
\ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}.
|
||||||
|
|
||||||
\subsection{Analyzing Benchmark Results}
|
\subsection{Analyzing Benchmark Results}
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
|
@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
|
||||||
read+write time is less than 10 nanoseconds. Temporal locality is worst for
|
read+write time is less than 10 nanoseconds. Temporal locality is worst for
|
||||||
large sizes and strides, although the largest values of \texttt{stride} for each
|
large sizes and strides, although the largest values of \texttt{stride} for each
|
||||||
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
|
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
|
||||||
due to the few elements accessed in the pattern (this observation is also valid
|
for the aformentioned effect of having \textit{stride} values close to
|
||||||
for the largest strides of each size series shown in the graph).
|
\textit{csize}.
|
||||||
|
|
||||||
|
The pattern that can be read from the graphs, especially the one for the
|
||||||
|
cluster, shows that the \textit{stride} axis is divided in regions showing
|
||||||
|
memory access time of similar magnitude. The boundary between the first and the
|
||||||
|
second region is a \textit{stride} value of rougly 2KB, while a \textit{stride}
|
||||||
|
of 512KB roughly separates the second and the third region. The difference in
|
||||||
|
performance between regions and the similarity of performance within regions
|
||||||
|
suggest the threshold stride values are related to changes in the use of the
|
||||||
|
cache hierarchy. In particular, the first region may characterize regions where
|
||||||
|
the L1 cache, the fastest non-register memory available, is predominantly used.
|
||||||
|
Then the second region might overlap with a more intense use of the L2 cache and
|
||||||
|
likewise between the third region and the L3 cache.
|
||||||
\marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}
|
\marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}
|
||||||
|
|
||||||
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
||||||
|
@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation:
|
||||||
discussed above to a final $6.5\%$).
|
discussed above to a final $6.5\%$).
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
The chosen matrix block size for running the benchmark on the cluster is
|
The chosen matrix block size for running the benchmark on the cluster is:
|
||||||
|
|
||||||
$$s = 26$$
|
$$s = 26$$
|
||||||
|
|
||||||
|
@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the
|
||||||
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
|
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
|
||||||
algorithm needs to exploit fast memory as much as possible. The reason the
|
algorithm needs to exploit fast memory as much as possible. The reason the
|
||||||
empirically best value results in a theoretical cache allocation that is only
|
empirically best value results in a theoretical cache allocation that is only
|
||||||
half of the complete cache size is due to some real-life factors. For example,
|
half of the complete L1 cache size is due to some real-life factors. For example,
|
||||||
cache misses tipically result in aligned page loads which may load unnecessary
|
cache misses tipically result in aligned page loads which may load unnecessary
|
||||||
data.
|
data.
|
||||||
|
|
||||||
|
@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of
|
||||||
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
|
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
|
||||||
optimization the latter one is able to exploit.
|
optimization the latter one is able to exploit.
|
||||||
|
|
||||||
|
I was unable to run this benchmark suite on my personal machine due to Intel MKL
|
||||||
|
installation issues that prevented the code to compile.
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\includegraphics[width=\textwidth]{timing.pdf}
|
\includegraphics[width=\textwidth]{timing.pdf}
|
||||||
\caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,
|
\caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,
|
||||||
|
|
Reference in a new issue