This commit is contained in:
Claudio Maggioni 2022-10-11 15:28:57 +02:00
parent 885e517e41
commit bc6d82f7b0
2 changed files with 85 additions and 10 deletions

View File

@ -12,6 +12,7 @@
\usepackage{multirow}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{algorithm2e}
\usepackage[nomessages]{fp}
\begin{document}
@ -20,7 +21,7 @@
\setduedate{12.10.2022 (midnight)}
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio Maggioni}{
Discussed with: --}{Solution for Project 1}{}
Discussed with: Gianmarco De Vita}{Solution for Project 1}{}
\newline
%\assignmentpolicy
@ -134,9 +135,9 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
}
\foreach \r in {4.5,5.5}{
\FPeval{l}{round(\r - 6.5, 0)}
\draw (\r + 0.5, -3.5) node {\tiny $2^{10} \l$};
\draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$};
}
\draw (7,-3.5) node {\tiny $2^{10}$};
\draw (7,-3.5) node {\tiny $2^{19}$};
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
@ -144,7 +145,7 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
\end{center}
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
\texttt{stride = $2^{10}$} (below)}
\texttt{stride = $2^{19}$} (below)}
\label{fig:access}
\end{figure}
@ -161,8 +162,8 @@ and so on and so forth.
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
index $2^{10}-1$, and finally index $2^{20}-1$. The access patterns for these
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
two configurations are shown visually in Figure \ref{fig:access}.
\subsection{Analyzing Benchmark Results}
@ -201,9 +202,52 @@ for the largest strides of each size series shown in the graph).
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
\begin{figure}[t]
\begin{verbatim}
INPUT: A (n by n), B (n by n), n
OUTPUT: C (n by n)
s := 26 # block dimension
A_row := <matrix A converted in row major form>
C_temp := <empty s by s matrix>
for i := 0 to n by s:
i_next := min(i + s, n)
for j := 0 to n by s:
j_next := min(j + s, n)
<set all cells in C_temp to 0>
for k := 0 to n by s:
k_next := min(k + s, n)
# Perform naive matrix multiplication, incrementing cells of C_temp
# with each multiplication result
naivemm(A_row[i, k][i_next, k_next], B[k, j][k_next, j_next],
C_temp[0, 0][i_next - i, j_next - j])
end for
C[i, j][i_next, j_next] = C_temp[0, 0][i_next - i, j_next - j]
end for
end for
\end{verbatim}
\caption{Pseudocode listing of my blocked matrix multiplication
implementation. Matrix indices start from 0 (i.e. row $0$ and column $0$
denotes the top-left-most cell in a matrix). \\ \texttt{M[a, b][c, d]} denotes
a rectangular region of the matrix $M$ whose top-left-most cell is the cell
in $M$ at row $a$ and column $b$ and whose bottom-right-most cell is the
cell in $M$ at row $c - 1$ and column $d - 1$.}
\label{fig:algo}
\end{figure}
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
blocked matrix multiplication algorithm presented in the project. Other than
implementing the pseudocode, my implementation:
blocked matrix multiplication algorithm presented in the project. A pseudocode
listing of the implementation is provided in Figure \ref{fig:algo}.
In order to achieve a correct and fast execution, my implementation:
\begin{figure}[t]
\begin{center}
@ -239,15 +283,46 @@ implementing the pseudocode, my implementation:
\ref{fig:iter}, by having A in row major format and B in column major
format, iterations across matrix block in the inner most loop of the
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
achieving space locality between the blocks used;
achieving space locality between the blocks used. This achieved
approximately an increase of performance of two percentage points in
terms of CPU utilization (i.e. from a baseline of $4\%$ to $6\%$),
\item Caches the result of each innermost iteration into a temporary matrix
of block size before storing it into matrix C. This achieves better
space locality when \textit{naivemm} needs to store values in matrix C.
The block size temporary matrix has virtually no stride and thus cache
hits are maximised. The copy operation is implemented with bulk copy
\texttt{memcpy} calls.
\texttt{memcpy} calls. This optimization achieves an extra half of a
percentage point in terms of CPU utilization (i.e. from the $6\%$
discussed above to a final $6.5\%$).
\end{itemize}
The chosen matrix block size for running the benchmark on the cluster is
$$s = 26$$
as shown in the pseudocode. This value has been obtained by running an empirical
binary search on the value using the benchmark as a metric, i.e. by running
\texttt{./run\_matrixmult.sh} several times with different values. For square
blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
and the \texttt{C\_temp} temporary matrix block for $C$ is:
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
given that a double-precision floating point number, the data type used for
matrix cells in the scope of this project, is 8 bytes long. The obtained total
bytes size is fairly close to the L1 cache size of the processor used in the
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
algorithm needs to exploit fast memory as much as possible. The reason the
empirically best value results in a theoretical cache allocation that is only
half of the complete cache size is due to some real-life factors. For example,
cache misses tipically result in aligned page loads which may load unnecessary
data.
A potential way to exploit the different cache levels is to apply the blocked
matrix algorithm iteratively multiple times. For example, OpenBLAS implements
DGEMM by having two levels of matrix blocks to better exploit the L2 and L3
caches found on most processors.
\begin{figure}[t]
\begin{center}
\begin{tikzpicture}