237 lines
12 KiB
TeX
237 lines
12 KiB
TeX
\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl}
|
|
|
|
\input{assignment.sty}
|
|
\usepackage{float}
|
|
\usepackage{subcaption}
|
|
\usepackage{graphicx}
|
|
\usepackage{fancyvrb}
|
|
\usepackage{tikz}
|
|
|
|
\begin{document}
|
|
|
|
\setassignment
|
|
\setduedate{12.10.2022 (midnight)}
|
|
|
|
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio
|
|
Maggioni}{Discussed with: ---}{Solution for Project 1}{}
|
|
\newline
|
|
|
|
\assignmentpolicy
|
|
In this project you will practice memory access optimization,
|
|
performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.
|
|
|
|
\section{Explaining Memory Hierarchies \punkte{25}}
|
|
|
|
\subsection{Memory Hierarchy Parameters of the Cluster}
|
|
|
|
By identifying the memory hierarchy parameters through \texttt{likwid-topology}
|
|
for the cache topology and \texttt{free -g} for the amount of primary memory I
|
|
find the following values:
|
|
|
|
\begin{center}
|
|
\begin{tabular}{llll}
|
|
Main memory & 62 GB \\
|
|
L3 cache & 25 MB per socket \\
|
|
L2 cache & 256 kB per core \\
|
|
L1 cache & 32 kB per core
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
All values are reported using base 2 IEC byte units. The cluster has 2 sockets
|
|
and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
|
\texttt{likwid-topology -g} is the following:
|
|
|
|
\pagebreak[4]
|
|
% https://tex.stackexchange.com/a/171818
|
|
\begin{Verbatim}[fontsize=\tiny]
|
|
Socket 0:
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
| | 25 MB | |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
Socket 1:
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
| | 25 MB | |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
\end{Verbatim}
|
|
|
|
\subsection{Memory Access Pattern of \texttt{membench.c}}
|
|
|
|
The benchmark \texttt{membench.c} measures the average time of repeated read and
|
|
write overations across a set of indices of a stack-allocated array of 32-bit
|
|
signed integers. The indices vary according to the access pattern used, which in
|
|
turn is defined by two variables, \texttt{csize} and \texttt{stride}.
|
|
\texttt{csize} is an upper bound on the index value, i.e. (one more of) the
|
|
highest index used to access the array in the pattern. \texttt{stride}
|
|
determines the difference between array indexes over access iterations, i.e. a
|
|
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
|
|
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
|
|
and so on and so forth.
|
|
|
|
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
|
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
|
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
|
|
index $2^{10}-1$, and finally index $2^{20}-1$.
|
|
|
|
\subsection{Analyzing Benchmark Results}
|
|
|
|
\begin{figure}[t]
|
|
\begin{subfigure}{0.5\textwidth}
|
|
\includegraphics[width=\textwidth]{generic_macos.pdf}
|
|
\caption{Personal laptop}
|
|
\label{fig:mem:laptop}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.5\textwidth}
|
|
\includegraphics[width=\textwidth]{generic_cluster.pdf}
|
|
\caption{Cluster}
|
|
\label{fig:mem:cluster}
|
|
\end{subfigure}
|
|
\caption{Results of the \texttt{membench.c} benchmark for both my personal
|
|
laptop (in Figure \ref{fig:mem:laptop}) and the cluster (in Figure
|
|
\ref{fig:mem:cluster}).}
|
|
\label{fig:mem}
|
|
\end{figure}
|
|
|
|
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
|
|
2018 with a Core i7-8750H CPU) and the cluster are shown in figure
|
|
\ref{fig:mem}.
|
|
|
|
The memory access graph for the cluster's benchmark results shows that temporal
|
|
locality is best for small array sizes and for small \texttt{stride} values.
|
|
In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
|
|
\cdot 2^{20}$ or lower) and \texttt{stride} values of 2048 or lower the mean
|
|
read+write time is less than 10 nanoseconds. Temporal locality is worst for
|
|
large sizes and strides, although the largest values of \texttt{stride} for each
|
|
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
|
|
due to the few elements accessed in the pattern (this observation is also valid
|
|
for the largest strides of each size series shown in the graph).
|
|
|
|
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
|
|
|
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
|
|
blocked matrix multiplication algorithm presented in the project. Other than
|
|
implementing the pseudocode, my implementation:
|
|
|
|
\begin{figure}[t]
|
|
\begin{center}
|
|
\begin{tikzpicture}
|
|
\fill[blue!60!white] (4,0) rectangle (5,-2);
|
|
\fill[blue!40!white] (4,-2) rectangle (5,-4);
|
|
\fill[blue!60!white] (0,-4) rectangle (2,-5);
|
|
\fill[blue!40!white] (2,-4) rectangle (4,-5);
|
|
\fill[green!40!white] (4,-4) rectangle (5,-5);
|
|
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
|
|
\draw[step=2] (0,0) grid (5,-5);
|
|
\draw[step=5] (0,0) grid (5,-5);
|
|
\end{tikzpicture}
|
|
\end{center}
|
|
\caption{Result of the block division process of a square matrix of size 5
|
|
using a block size of 2. The 2-by-1 and 1-by-2 rectangular remainders are
|
|
shown in blue and the square matrix of remainder size (i.e. 1) is shown in
|
|
green.}
|
|
\label{fig:matrix}
|
|
\end{figure}
|
|
|
|
\begin{itemize}
|
|
\item Handles the edge cases related to the ``remainders'' in the matrix
|
|
block division, i.e. when the division between the size of the matrix
|
|
and the block size yields a remainder. Assuming only squared matrices
|
|
are multiplied through the algorithm (as in the test suite provided) the
|
|
block division could yield rectangular matrix blocks located in the last
|
|
rows and columns of each matrix, and the bottom-right corner of the
|
|
matrix will be contained in a square matrix block of the size of the
|
|
remainder. The result of this process is shown in Figure
|
|
\ref{fig:matrix};
|
|
\item Converts matrix A into row major format. As shown in Figure
|
|
\ref{fig:iter}, by having A in row major format and B in column major
|
|
format, iterations across matrix block in the inner most loop of the
|
|
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
|
|
achieving space locality between the blocks used;
|
|
\item Caches the result of each innermost iteration into a temporary matrix
|
|
of block size before storing it into matrix C. This achieves better
|
|
space locality when \textit{naivemm} needs to store values in matrix C.
|
|
The block size temporary matrix has virtually no stride and thus cache
|
|
hits are maximised. The copy operation is implemented with bulk copy
|
|
\texttt{memcpy} calls.
|
|
\end{itemize}
|
|
|
|
\begin{figure}[t]
|
|
\begin{center}
|
|
\begin{tikzpicture}
|
|
\node[align=center] at (2.5,0.5) {Matrix A};
|
|
\fill[orange!10!white] (0,0) rectangle (2,-2);
|
|
\fill[orange!25!white] (2,0) rectangle (4,-2);
|
|
\fill[orange!40!white] (4,0) rectangle (5,-2);
|
|
|
|
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
|
|
\draw[step=2,black,thick] (0,0) grid (5,-5);
|
|
\draw[step=5,black,thick] (0,0) grid (5,-5);
|
|
|
|
\draw[-to,step=1,red,very thick] (0.5,-0.5) -- (4.5,-0.5);
|
|
\draw[-to,step=1,red,very thick] (0.5,-1.5) -- (4.5,-1.5);
|
|
\draw[-to,step=1,red,very thick] (0.5,-2.5) -- (4.5,-2.5);
|
|
\draw[-to,step=1,red,very thick] (0.5,-3.5) -- (4.5,-3.5);
|
|
\draw[-to,step=1,red,very thick] (0.5,-4.5) -- (4.5,-4.5);
|
|
|
|
\node[align=center] at (8.5,0.5) {Matrix B};
|
|
\fill[orange!10!white] (6,0) rectangle (8,-2);
|
|
\fill[orange!25!white] (6,-2) rectangle (8,-4);
|
|
\fill[orange!40!white] (6,-4) rectangle (8,-5);
|
|
|
|
\draw[step=1,gray,very thin] (6,0) grid (11,-5);
|
|
\draw[step=2,black,thick] (6,0) grid (11,-5);
|
|
\draw[step=5,black,thick] (6,0) grid (11,-5);
|
|
\draw[black,thick] (11,0) -- (11,-5);
|
|
|
|
\draw[-to,step=1,red,very thick] (6.5,-0.5) -- (6.5,-4.5);
|
|
\draw[-to,step=1,red,very thick] (7.5,-0.5) -- (7.5,-4.5);
|
|
\draw[-to,step=1,red,very thick] (8.5,-0.5) -- (8.5,-4.5);
|
|
\draw[-to,step=1,red,very thick] (9.5,-0.5) -- (9.5,-4.5);
|
|
\draw[-to,step=1,red,very thick] (10.5,-0.5) -- (10.5,-4.5);
|
|
\end{tikzpicture}
|
|
\end{center}
|
|
\caption{Inner most loop iteration of the blocked GEMM algorithm across
|
|
matrices A and B. The red lines represent the ``majorness'' of each matrix
|
|
(A is converted by the algorithm in row-major form, while B is given and
|
|
used in column-major form). The shades of orange represent the blocks used
|
|
in each iteration.}
|
|
\label{fig:iter}
|
|
\end{figure}
|
|
|
|
The results of the matrix multiplication benchmark for the naive, blocked, and
|
|
BLAS implementations are shown in Figure \ref{fig:bench}. The blocked
|
|
implementation achieves approximately 50\% more FLOPS than the naive
|
|
implementation thanks to the optimisations in space and temporal cache locality
|
|
described. However, the blocked implementation achives less than a tenth of
|
|
FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
|
|
optimization the latter one is able to exploit.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=\textwidth]{timing.pdf}
|
|
\caption{Results of the matrix multiplication benchmark for the naive,
|
|
blocked, and BLAS implementations}
|
|
\label{fig:bench}
|
|
\end{figure}
|
|
|
|
\end{document}
|