wip

2022-10-11 15:28:57 +02:00 · 2022-10-11 15:28:57 +02:00 · bc6d82f7b0
commit bc6d82f7b0
parent 885e517e41
2 changed files with 85 additions and 10 deletions
--- a/Project1/project_1_maggioni_claudio.pdf
+++ b/Project1/project_1_maggioni_claudio.pdf
--- a/Project1/project_1_maggioni_claudio.tex
+++ b/Project1/project_1_maggioni_claudio.tex
@ -12,6 +12,7 @@
 \usepackage{multirow}
 \usepackage{makecell}
 \usepackage{booktabs}
 \usepackage{algorithm2e}
 \usepackage[nomessages]{fp}
 \begin{document}
@ -20,7 +21,7 @@
 \setduedate{12.10.2022 (midnight)}
 \serieheader{High-Performance Computing Lab}{2022}{Student: Claudio Maggioni}{
-    Discussed with: --}{Solution for Project 1}{}
+    Discussed with: Gianmarco De Vita}{Solution for Project 1}{}
 \newline
 %\assignmentpolicy
@ -134,9 +135,9 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
    }
    \foreach \r in {4.5,5.5}{
        \FPeval{l}{round(\r - 6.5, 0)}
-        \draw (\r + 0.5, -3.5) node {\tiny $2^{10}  \l$};
+        \draw (\r + 0.5, -3.5) node {\tiny $2^{19}  \l$};
    }
-    \draw (7,-3.5) node {\tiny $2^{10}$};
+    \draw (7,-3.5) node {\tiny $2^{19}$};
    \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
    \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
    \draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
@ -144,7 +145,7 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
 \end{center}
    \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
    128} and \texttt{stride = 1} (above) and for \texttt{csize = $2^{20}$} and
-    \texttt{stride = $2^{10}$} (below)}
+    \texttt{stride = $2^{19}$} (below)}
    \label{fig:access}
 \end{figure}
@ -161,8 +162,8 @@ and so on and so forth.
 Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
 access all indexes between 0 and 127 sequentially, and for \texttt{csize =
-$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
+$2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
-index $2^{10}-1$, and finally index $2^{20}-1$. The access patterns for these
+index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
 two configurations are shown visually in Figure \ref{fig:access}.
 \subsection{Analyzing Benchmark Results}
@ -201,9 +202,52 @@ for the largest strides of each size series shown in the graph).
 \section{Optimize Square Matrix-Matrix Multiplication  \punkte{60}}
 \begin{figure}[t]
 \begin{verbatim}
 INPUT: A (n by n), B (n by n), n
 OUTPUT: C (n by n)
 s := 26 # block dimension
 A_row := <matrix A converted in row major form>
 C_temp := <empty s by s matrix>
 for i := 0 to n by s:
    i_next := min(i + s, n)
    for j := 0 to n by s:
        j_next := min(j + s, n)
        <set all cells in C_temp to 0>
        for k := 0 to n by s:
            k_next := min(k + s, n)
            # Perform naive matrix multiplication, incrementing cells of C_temp
            # with each multiplication result
            naivemm(A_row[i, k][i_next, k_next], B[k, j][k_next, j_next],
                    C_temp[0, 0][i_next - i, j_next - j])
        end for
        C[i, j][i_next, j_next] = C_temp[0, 0][i_next - i, j_next - j]
    end for
 end for
 \end{verbatim}
    \caption{Pseudocode listing of my blocked matrix multiplication
    implementation. Matrix indices start from 0 (i.e. row $0$ and column $0$
    denotes the top-left-most cell in a matrix). \\ \texttt{M[a, b][c, d]} denotes
    a rectangular region of the matrix $M$ whose top-left-most cell is the cell
    in $M$ at row $a$ and column $b$ and whose bottom-right-most cell is the 
    cell in $M$ at row $c - 1$ and column $d - 1$.}
    \label{fig:algo}
 \end{figure}
 The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
-blocked matrix multiplication algorithm presented in the project. Other than
+blocked matrix multiplication algorithm presented in the project. A pseudocode
-implementing the pseudocode, my implementation:
+listing of the implementation is provided in Figure \ref{fig:algo}.
 In order to achieve a correct and fast execution, my implementation:
 \begin{figure}[t]
 \begin{center}
@ -239,15 +283,46 @@ implementing the pseudocode, my implementation:
        \ref{fig:iter}, by having A in row major format and B in column major
        format, iterations across matrix block in the inner most loop of the
        algorithm (the one calling \textit{naivemm}) cache hits are maximised by
-        achieving space locality between the blocks used;
+        achieving space locality between the blocks used. This achieved
        approximately an increase of performance of two percentage points in
        terms of CPU utilization (i.e. from a baseline of $4\%$ to $6\%$),
    \item Caches the result of each innermost iteration into a temporary matrix
        of block size before storing it into matrix C. This achieves better
        space locality when \textit{naivemm} needs to store values in matrix C.
        The block size temporary matrix has virtually no stride and thus cache
        hits are maximised. The copy operation is implemented with bulk copy
-        \texttt{memcpy} calls.
+        \texttt{memcpy} calls. This optimization achieves an extra half of a
        percentage point in terms of CPU utilization (i.e. from the $6\%$
        discussed above to a final $6.5\%$).
 \end{itemize}
 The chosen matrix block size for running the benchmark on the cluster is
 $$s = 26$$
 as shown in the pseudocode. This value has been obtained by running an empirical
 binary search on the value using the benchmark as a metric, i.e. by running
 \texttt{./run\_matrixmult.sh} several times with different values. For square
 blocks (i.e. the worst case) the total size for the matrix $A$ and $B$ sub-block
 and the \texttt{C\_temp} temporary matrix block for $C$ is:
 $$\mathrm{Bytes} = \mathrm{cellSize} * s^2 * 3 = 8 * 26^2 * 3 = 16224$$
 given that a double-precision floating point number, the data type used for
 matrix cells in the scope of this project, is 8 bytes long. The obtained total
 bytes size is fairly close to the L1 cache size of the processor used in the
 cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
 algorithm needs to exploit fast memory as much as possible. The reason the
 empirically best value results in a theoretical cache allocation that is only
 half of the complete cache size is due to some real-life factors. For example,
 cache misses tipically result in aligned page loads which may load unnecessary
 data.
 A potential way to exploit the different cache levels is to apply the blocked
 matrix algorithm iteratively multiple times. For example, OpenBLAS implements
 DGEMM by having two levels of matrix blocks to better exploit the L2 and L3
 caches found on most processors.
 \begin{figure}[t]
 \begin{center}
 \begin{tikzpicture}