wip

2022-10-11 16:42:49 +02:00 · 2022-10-11 16:42:49 +02:00 · d2d8fb38d2
commit d2d8fb38d2
parent bc6d82f7b0
2 changed files with 43 additions and 13 deletions
--- a/Project1/project_1_maggioni_claudio.pdf
+++ b/Project1/project_1_maggioni_claudio.pdf
--- a/Project1/project_1_maggioni_claudio.tex
+++ b/Project1/project_1_maggioni_claudio.tex
@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by

    \fill (0.5,-2.5) circle [radius=2pt];
    \fill (6,-2.5) circle [radius=2pt];
-    \fill (11.5,-2.5) circle [radius=2pt];
    \foreach \r in {0,1,2}{
        \draw (\r + 0.5, -3.5) node {$\r$};
    }
@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by
        \FPeval{l}{round(\r - 12, 0)}
        \draw (\r + 0.5, -3.5) node {\tiny $2^{20}  \l$};
    }
-    \foreach \r in {4.5,5.5}{
-        \FPeval{l}{round(\r - 6.5, 0)}
-        \draw (\r + 0.5, -3.5) node {\tiny $2^{19}  \l$};
-    }
-    \draw (7,-3.5) node {\tiny $2^{19}$};
+    \draw (5, -3.5) node {\tiny $2^{19} - 1$};
+    \draw (6, -3.5) node {\tiny $2^{19}$};
+    \draw (7,-3.5) node {\tiny $2^{19} + 1$};
    \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5);
    \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5);
-    \draw[->-] (6,-2.5) to[bend left] (11.5,-2.5);
 \end{tikzpicture}
 \end{center}
    \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize =
@ -158,14 +154,33 @@ highest index used to access the array in the pattern.  \texttt{stride}
 determines the difference between array indexes over access iterations, i.e. a
 \texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
 skip every other index, a \texttt{stride} of 4 will access one index then skip 3
-and so on and so forth.
+and so on and so forth. The benchmark stops when the index to access is strictly
+greater than \texttt{csize - stride}.

 Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
 access all indexes between 0 and 127 sequentially, and for \texttt{csize =
 $2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then
-index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these
+index $2^{19}-1$. The access patterns for these
 two configurations are shown visually in Figure \ref{fig:access}.

+By running the \texttt{membench.c} both on my personal laptop and on the
+cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize}
+values are shown as different data series and labeled by byte size and
+\textit{stride} values are mapped on the $x$ axis by the byte-equivalent value
+as well\footnote{Byte values are a factor of 4 greater than the values used in
+the code and in Figure \ref{fig:mem}. This is due to the fact that the array
+elements used in the benchmark are 32-bit signed integers, which take up 4 bytes
+each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$
+bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize =
+$2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time
+is $1.156$ nanoseconds. The first set of parameters performs well thanks to the
+low \textit{stride} value, thus achieving very good space locality and
+maximizing cache hits. However, the second set of parameters achieves good
+performance as well thanks to the few values accessed with each pass, thus
+improving the temporal locality of each address accessed. This observation
+applies for the few last data points in each data series of Figure
+\ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}. 
+
 \subsection{Analyzing Benchmark Results}

 \begin{figure}[t]
@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
 read+write time is less than 10 nanoseconds. Temporal locality is worst for
 large sizes and strides, although the largest values of \texttt{stride} for each
 size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
-due to the few elements accessed in the pattern (this observation is also valid
-for the largest strides of each size series shown in the graph).
+for the aformentioned effect of having \textit{stride} values close to
+\textit{csize}.
+
+The pattern that can be read from the graphs, especially the one for the
+cluster, shows that the \textit{stride} axis is divided in regions showing
+memory access time of similar magnitude. The boundary between the first and the
+second region is a \textit{stride} value of rougly 2KB, while a \textit{stride}
+of 512KB roughly separates the second and the third region. The difference in
+performance between regions and the similarity of performance within regions
+suggest the threshold stride values are related to changes in the use of the
+cache hierarchy. In particular, the first region may characterize regions where
+the L1 cache, the fastest non-register memory available, is predominantly used.
+Then the second region might overlap with a more intense use of the L2 cache and
+likewise between the third region and the L3 cache.
 \marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}}

 \section{Optimize Square Matrix-Matrix Multiplication  \punkte{60}}
@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation:
        discussed above to a final $6.5\%$).
 \end{itemize}

-The chosen matrix block size for running the benchmark on the cluster is
+The chosen matrix block size for running the benchmark on the cluster is:

 $$s = 26$$

@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the
 cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the
 algorithm needs to exploit fast memory as much as possible. The reason the
 empirically best value results in a theoretical cache allocation that is only
-half of the complete cache size is due to some real-life factors. For example,
+half of the complete L1 cache size is due to some real-life factors. For example,
 cache misses tipically result in aligned page loads which may load unnecessary
 data.

@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of
 FLOPS compared to Intel MKL BLAS based one due to the microarchitecture
 optimization the latter one is able to exploit.

+I was unable to run this benchmark suite on my personal machine due to Intel MKL
+installation issues that prevented the code to compile.
+
 \begin{figure}[t]
    \includegraphics[width=\textwidth]{timing.pdf}
    \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,