wip
This commit is contained in:
parent
885e517e41
commit
bc6d82f7b0
2 changed files with 85 additions and 10 deletions
Binary file not shown.
|
@ -12,6 +12,7 @@
|
||||||
\usepackage{multirow}
|
\usepackage{multirow}
|
||||||
\usepackage{makecell}
|
\usepackage{makecell}
|
||||||
\usepackage{booktabs}
|
\usepackage{booktabs}
|
||||||
|
\usepackage{algorithm2e}
|
||||||
\usepackage[nomessages]{fp}
|
\usepackage[nomessages]{fp}
|
||||||
|
|
||||||
\begin{document}
|
\begin{document}
|
||||||
|
@ -20,7 +21,7 @@
|
||||||
\setduedate{12.10.2022 (midnight)}
|
\setduedate{12.10.2022 (midnight)}
|
||||||
|
|
||||||
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio Maggioni}{
|
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio Maggioni}{
|
||||||
Discussed with: --}{Solution for Project 1}{}
|
Discussed with: Gianmarco De Vita}{Solution for Project 1}{}
|
||||||
\newline
|
\newline
|
||||||
|
|
||||||
%\assignmentpolicy
|
%\assignmentpolicy
|
||||||
|
@ -134,9 +135,9 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
||||||
}
|
}
|
||||||
\foreach \r in {4.5,5.5}{
|
\foreach \r in {4.5,5.5}{
|
||||||
\FPeval{l}{round(\r - 6.5, 0)}
|
\FPeval{l}{round(\r - 6.5, 0)}
|
||||||
\draw (\r + 0.5, -3.5) node {\tiny $2^{10} \l$};
|
\draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$};
|
||||||
}
|
}
|
||||||
\draw (7,-3.5) node {\tiny $2^{10}$};
|
\draw (7,-3.5) node {\tiny $2^{19}$};
|
||||||
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
|
\draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
|
||||||
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
|
\draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
|
||||||
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
|
\draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
|
||||||
|
@ -144,7 +145,7 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
||||||
\end{center}
|
\end{center}
|
||||||
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
|
\caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
|
||||||
128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
|
128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
|
||||||
\texttt{stride = $2^{10}$} (below)}
|
\texttt{stride = $2^{19}$} (below)}
|
||||||
\label{fig:access}
|
\label{fig:access}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -161,8 +162,8 @@ and so on and so forth.
|
||||||
|
|
||||||
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
||||||
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
||||||
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
|
$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
|
||||||
index $2^{10}-1$, and finally index $2^{20}-1$. The access patterns for these
|
index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
|
||||||
two configurations are shown visually in Figure \ref{fig:access}.
|
two configurations are shown visually in Figure \ref{fig:access}.
|
||||||
|
|
||||||
\subsection{Analyzing Benchmark Results}
|
\subsection{Analyzing Benchmark Results}
|
||||||
|
@ -201,9 +202,52 @@ for the largest strides of each size series shown in the graph).
|
||||||
|
|
||||||
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\begin{verbatim}
|
||||||
|
INPUT: A (n by n), B (n by n), n
|
||||||
|
OUTPUT: C (n by n)
|
||||||
|
|
||||||
|
s := 26 # block dimension
|
||||||
|
|
||||||
|
A_row := <matrix A converted in row major form>
|
||||||
|
C_temp := <empty s by s matrix>
|
||||||
|
|
||||||
|
for i := 0 to n by s:
|
||||||
|
i_next := min(i + s, n)
|
||||||
|
|
||||||
|
for j := 0 to n by s:
|
||||||
|
j_next := min(j + s, n)
|
||||||
|
|
||||||
|
<set all cells in C_temp to 0>
|
||||||
|
|
||||||
|
for k := 0 to n by s:
|
||||||
|
k_next := min(k + s, n)
|
||||||
|
|
||||||
|
# Perform naive matrix multiplication, incrementing cells of C_temp
|
||||||
|
# with each multiplication result
|
||||||
|
naivemm(A_row[i, k][i_next, k_next], B[k, j][k_next, j_next],
|
||||||
|
C_temp[0, 0][i_next - i, j_next - j])
|
||||||
|
end for
|
||||||
|
|
||||||
|
C[i, j][i_next, j_next] = C_temp[0, 0][i_next - i, j_next - j]
|
||||||
|
end for
|
||||||
|
end for
|
||||||
|
\end{verbatim}
|
||||||
|
\caption{Pseudocode listing of my blocked matrix multiplication
|
||||||
|
implementation. Matrix indices start from 0 (i.e. row $0$ and column $0$
|
||||||
|
denotes the top-left-most cell in a matrix). \\ \texttt{M[a, b][c, d]} denotes
|
||||||
|
a rectangular region of the matrix $M$ whose top-left-most cell is the cell
|
||||||
|
in $M$ at row $a$ and column $b$ and whose bottom-right-most cell is the
|
||||||
|
cell in $M$ at row $c - 1$ and column $d - 1$.}
|
||||||
|
\label{fig:algo}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
|
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
|
||||||
blocked matrix multiplication algorithm presented in the project. Other than
|
blocked matrix multiplication algorithm presented in the project. A pseudocode
|
||||||
implementing the pseudocode, my implementation:
|
listing of the implementation is provided in Figure \ref{fig:algo}.
|
||||||
|
|
||||||
|
In order to achieve a correct and fast execution, my implementation:
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -239,15 +283,46 @@ implementing the pseudocode, my implementation:
|
||||||
\ref{fig:iter}, by having A in row major format and B in column major
|
\ref{fig:iter}, by having A in row major format and B in column major
|
||||||
format, iterations across matrix block in the inner most loop of the
|
format, iterations across matrix block in the inner most loop of the
|
||||||
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
|
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
|
||||||
achieving space locality between the blocks used;
|
achieving space locality between the blocks used. This achieved
|
||||||
|
approximately an increase of performance of two percentage points in
|
||||||
|
terms of CPU utilization (i.e. from a baseline of $4\%$ to $6\%$),
|
||||||
\item Caches the result of each innermost iteration into a temporary matrix
|
\item Caches the result of each innermost iteration into a temporary matrix
|
||||||
of block size before storing it into matrix C. This achieves better
|
of block size before storing it into matrix C. This achieves better
|
||||||
space locality when \textit{naivemm} needs to store values in matrix C.
|
space locality when \textit{naivemm} needs to store values in matrix C.
|
||||||
The block size temporary matrix has virtually no stride and thus cache
|
The block size temporary matrix has virtually no stride and thus cache
|
||||||
hits are maximised. The copy operation is implemented with bulk copy
|
hits are maximised. The copy operation is implemented with bulk copy
|
||||||
\texttt{memcpy} calls.
|
\texttt{memcpy} calls. This optimization achieves an extra half of a
|
||||||
|
percentage point in terms of CPU utilization (i.e. from the $6\%$
|
||||||
|
discussed above to a final $6.5\%$).
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
The chosen matrix block size for running the benchmark on the cluster is
|
||||||
|
|
||||||
|
$$s = 26$$
|
||||||
|
|
||||||
|
as shown in the pseudocode. This value has been obtained by running an empirical
|
||||||
|
binary search on the value using the benchmark as a metric, i.e. by running
|
||||||
|
\texttt{./run\_matrixmult.sh} several times with different values. For square
|
||||||
|
blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
|
||||||
|
and the \texttt{C\_temp} temporary matrix block for $C$ is:
|
||||||
|
|
||||||
|
$$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
|
||||||
|
|
||||||
|
given that a double-precision floating point number, the data type used for
|
||||||
|
matrix cells in the scope of this project, is 8 bytes long. The obtained total
|
||||||
|
bytes size is fairly close to the L1 cache size of the processor used in the
|
||||||
|
cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
|
||||||
|
algorithm needs to exploit fast memory as much as possible. The reason the
|
||||||
|
empirically best value results in a theoretical cache allocation that is only
|
||||||
|
half of the complete cache size is due to some real-life factors. For example,
|
||||||
|
cache misses tipically result in aligned page loads which may load unnecessary
|
||||||
|
data.
|
||||||
|
|
||||||
|
A potential way to exploit the different cache levels is to apply the blocked
|
||||||
|
matrix algorithm iteratively multiple times. For example, OpenBLAS implements
|
||||||
|
DGEMM by having two levels of matrix blocks to better exploit the L2 and L3
|
||||||
|
caches found on most processors.
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}
|
\begin{tikzpicture}
|
||||||
|
|
Reference in a new issue