hw1: added figure for stride/csize graph

This commit is contained in:
Claudio Maggioni 2022-10-05 20:04:15 +02:00
parent bf57b5b6d6
commit 7e716e1db2
2 changed files with 135 additions and 13 deletions

View file

@ -6,6 +6,10 @@
\usepackage{graphicx} \usepackage{graphicx}
\usepackage{tikz} \usepackage{tikz}
\usepackage{multirow} \usepackage{multirow}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage[nomessages]{fp}
\usetikzlibrary{decorations.markings}
\begin{document} \begin{document}
@ -13,20 +17,22 @@
\setduedate{12.10.2022 (midnight)} \setduedate{12.10.2022 (midnight)}
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio \serieheader{High-Performance Computing Lab}{2022}{Student: Claudio
Maggioni}{Discussed with: ---}{Solution for Project 1}{} Maggioni}{Discussed with: --}{Solution for Project 1}{}
\newline \newline
\assignmentpolicy %\assignmentpolicy
In this project you will practice memory access optimization, %In this project you will practice memory access optimization,
performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster. %performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.
\tableofcontents
\section{Explaining Memory Hierarchies \punkte{25}} \section{Explaining Memory Hierarchies \punkte{25}}
\subsection{Memory Hierarchy Parameters of the Cluster} \subsection{Memory Hierarchy Parameters of the Cluster}
By identifying the memory hierarchy parameters through \texttt{likwid-topology} By invoking \texttt{likwid-topology} for the cache topology and \texttt{free -g}
for the cache topology and \texttt{free -g} for the amount of primary memory I for the amount of primary memory, the following memory hierarchy parameters are
find the following values: found:
\begin{center} \begin{center}
\begin{tabular}{llll} \begin{tabular}{llll}
@ -41,10 +47,11 @@ All values are reported using base 2 IEC byte units. The cluster has 2 sockets
and a total of 20 cores (10 per socket). The cache topology diagram reported by and a total of 20 cores (10 per socket). The cache topology diagram reported by
\texttt{likwid-topology -g} is shown in Figure \ref{fig:topo}. \texttt{likwid-topology -g} is shown in Figure \ref{fig:topo}.
\pagebreak[4]
\begin{figure}[t] \begin{figure}[t]
\begin{center} \begin{center}
Socket 0:\vspace{0.3cm} Socket 0:\vspace{0.3cm}
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|} \begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}
\hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\\hline \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\\hline
32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32 kB & 32
@ -70,6 +77,75 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
\subsection{Memory Access Pattern of \texttt{membench.c}} \subsection{Memory Access Pattern of \texttt{membench.c}}
\begin{figure}[t]
\begin{center}
\begin{tikzpicture}
\tikzset{->-/.style={decoration={
markings,
mark=at position .75 with {\arrow{>}}},postaction={decorate}}};
\draw (0,0) grid (5,1);
\draw [dashed] (5,0) -- (5.5,0);
\draw [dashed] (5,1) -- (5.5,1);
\draw [dashed] (6.5,0) -- (7,0);
\draw [dashed] (6.5,1) -- (7,1);
\draw (7,0) grid (12,1);
\foreach \r in {0,1,...,4}{
\fill (\r + 0.5,0.5) circle [radius=2pt];
\draw[->-] (\r-0.5,0.5) to[bend left] (\r+0.5,0.5);
\draw (\r + 0.5, -0.5) node {$\r$};
}
\draw[->-] (4.5,0.5) to[bend left] (5.5,0.5);
\foreach \r in {7,8,...,11}{
\fill (\r + 0.5,0.5) circle [radius=2pt];
\FPeval{l}{round(\r + 128 - 12, 0)}
\draw[->-] (\r-0.5,0.5) to[bend left] (\r+0.5,0.5);
\draw (\r + 0.5, -0.5) node {$\l$};
}
\draw (0,-3) grid (3,-2);
\draw [dashed] (3,-2) -- (3.5,-2);
\draw [dashed] (3,-3) -- (3.5,-3);
\draw [dashed] (4,-2) -- (4.5,-2);
\draw [dashed] (4,-3) -- (4.5,-3);
\draw (4.5,-2) -- (7.5,-2);
\draw (4.5,-3) -- (7.5,-3);
\foreach \r in {4.5,5.5,...,7.5}{
\draw (\r,-3) -- (\r,-2);
}
\draw [dashed] (7.5,-2) -- (8,-2);
\draw [dashed] (7.5,-3) -- (8,-3);
\draw [dashed] (8.5,-2) -- (9,-2);
\draw [dashed] (8.5,-3) -- (9,-3);
\draw (9,-3) grid (12,-2);
\fill (0.5,-2.5) circle [radius=2pt];
\fill (6,-2.5) circle [radius=2pt];
\fill (11.5,-2.5) circle [radius=2pt];
\foreach \r in {0,1,2}{
\draw (\r + 0.5, -3.5) node {$\r$};
}
\foreach \r in {9,10,11}{
\FPeval{l}{round(\r - 12, 0)}
\draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$};
}
\foreach \r in {4.5,5.5}{
\FPeval{l}{round(\r - 6.5, 0)}
\draw (\r + 0.5, -3.5) node {\tiny $2^{10} \l$};
}
\draw (7,-3.5) node {\tiny $2^{10}$};
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
\end{tikzpicture}
\end{center}
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
\texttt{stride = $2^{10}$} (below)}
\label{fig:access}
\end{figure}
The benchmark \texttt{membench.c} measures the average time of repeated read and The benchmark \texttt{membench.c} measures the average time of repeated read and
write overations across a set of indices of a stack-allocated array of 32-bit write overations across a set of indices of a stack-allocated array of 32-bit
signed integers. The indices vary according to the access pattern used, which in signed integers. The indices vary according to the access pattern used, which in
@ -84,7 +160,8 @@ and so on and so forth.
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
access all indexes between 0 and 127 sequentially, and for \texttt{csize = access all indexes between 0 and 127 sequentially, and for \texttt{csize =
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then $2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
index $2^{10}-1$, and finally index $2^{20}-1$. index $2^{10}-1$, and finally index $2^{20}-1$. The access patterns for these
two configurations are shown visually in Figure \ref{fig:access}.
\subsection{Analyzing Benchmark Results} \subsection{Analyzing Benchmark Results}
@ -212,8 +289,9 @@ implementing the pseudocode, my implementation:
\end{figure} \end{figure}
The results of the matrix multiplication benchmark for the naive, blocked, and The results of the matrix multiplication benchmark for the naive, blocked, and
BLAS implementations are shown in Figure \ref{fig:bench}. The blocked BLAS implementations are shown in Figure \ref{fig:bench} as a graph of GFlop/s
implementation achieves approximately 50\% more FLOPS than the naive over matrix size or in Figure \ref{fig:benchtab} as a table. The blocked
implementation achieves on average 50\% more FLOPS than the naive
implementation thanks to the optimisations in space and temporal cache locality implementation thanks to the optimisations in space and temporal cache locality
described. However, the blocked implementation achives less than a tenth of described. However, the blocked implementation achives less than a tenth of
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
@ -221,9 +299,53 @@ optimization the latter one is able to exploit.
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=\textwidth]{timing.pdf} \includegraphics[width=\textwidth]{timing.pdf}
\caption{Results of the matrix multiplication benchmark for the naive, \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,
blocked, and BLAS implementations} blocked, and BLAS implementations. The Y-axis is log-scaled.}
\label{fig:bench} \label{fig:bench}
\end{figure} \end{figure}
\begin{figure}[t]
\begin{center}
\begin{tabular}{c|cc|cc|cc}
\toprule
& \multicolumn{2}{c|}{Naive} & \multicolumn{2}{c|}{Blocked} &
\multicolumn{2}{c}{BLAS} \\
\makecell{Size} & \makecell{MFLOPS} &
\makecell{\% CPU} & \makecell{MFLOPS} &
\makecell{\% CPU} & \makecell{MFLOPS} &
\makecell{\% CPU} \\
\midrule
31 & 2393.33 & 6.50 & 2112.63 & 5.74 & 23449.20 & 63.72 \\
32 & 2400.13 & 6.52 & 2187.44 & 5.94 & 28198.90 & 76.63 \\
96 & 1998.74 & 5.43 & 2325.39 & 6.32 & 32542.30 & 88.43 \\
97 & 1996.01 & 5.42 & 2322.81 & 6.31 & 29801.30 & 80.98 \\
127 & 1923.81 & 5.23 & 2330.30 & 6.33 & 28557.80 & 77.60 \\
128 & 1731.98 & 4.71 & 2282.93 & 6.20 & 32643.30 & 88.70 \\
129 & 1903.31 & 5.17 & 2334.25 & 6.34 & 31198.20 & 84.78 \\
191 & 1736.78 & 4.72 & 2345.91 & 6.37 & 32247.30 & 87.63 \\
192 & 1694.44 & 4.60 & 2345.38 & 6.37 & 32830.60 & 89.21 \\
229 & 1715.10 & 4.66 & 2351.01 & 6.39 & 34360.90 & 93.37 \\
255 & 1720.39 & 4.67 & 2335.21 & 6.35 & 33477.70 & 90.97 \\
256 & 777.65 & 2.11 & 2306.48 & 6.27 & 33473.90 & 90.96 \\
257 & 1729.27 & 4.70 & 2330.68 & 6.33 & 33686.50 & 91.54 \\
319 & 1704.80 & 4.63 & 2360.03 & 6.41 & 34335.20 & 93.30 \\
320 & 1414.84 & 3.84 & 2364.53 & 6.43 & 36438.10 & 99.02 \\
321 & 1741.30 & 4.73 & 2366.38 & 6.43 & 35433.70 & 96.29 \\
417 & 1733.00 & 4.71 & 2378.34 & 6.46 & 36133.70 & 98.19 \\
479 & 1731.17 & 4.70 & 2233.05 & 6.07 & 32951.40 & 89.54 \\
480 & 1678.77 & 4.56 & 2187.87 & 5.95 & 37260.00 & 101.25 \\
511 & 1733.60 & 4.71 & 2224.61 & 6.05 & 34128.00 & 92.74 \\
512 & 782.96 & 2.13 & 2284.85 & 6.21 & 36526.40 & 99.26 \\
639 & 1714.42 & 4.66 & 2292.78 & 6.23 & 35249.20 & 95.79 \\
640 & 663.42 & 1.80 & 2264.70 & 6.15 & 36538.70 & 99.29 \\
767 & 1690.82 & 4.59 & 2324.83 & 6.32 & 35718.50 & 97.06 \\
768 & 792.04 & 2.15 & 2363.92 & 6.42 & 32116.80 & 87.27 \\
769 & 1696.95 & 4.61 & 2321.31 & 6.31 & 33033.90 & 89.77 \\
\bottomrule
\end{tabular}
\end{center}
\caption{MFlop/s and CPU utlisation per matrix size of the matrix
multiplication benchmark for the naive, blocked, and BLAS implementations.}
\label{fig:benchtab}
\end{figure}
\end{document} \end{document}