diff --git a/Project1/project_1_maggioni_claudio.pdf b/Project1/project_1_maggioni_claudio.pdf index a8ebbe8..00744a3 100644 Binary files a/Project1/project_1_maggioni_claudio.pdf and b/Project1/project_1_maggioni_claudio.pdf differ diff --git a/Project1/project_1_maggioni_claudio.tex b/Project1/project_1_maggioni_claudio.tex index 244f798..3d9c71b 100644 --- a/Project1/project_1_maggioni_claudio.tex +++ b/Project1/project_1_maggioni_claudio.tex @@ -125,7 +125,6 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by \fill (0.5,-2.5) circle [radius=2pt]; \fill (6,-2.5) circle [radius=2pt]; - \fill (11.5,-2.5) circle [radius=2pt]; \foreach \r in {0,1,2}{ \draw (\r + 0.5, -3.5) node {$\r$}; } @@ -133,14 +132,11 @@ and a total of 20 cores (10 per socket). The cache topology diagram reported by \FPeval{l}{round(\r - 12, 0)} \draw (\r + 0.5, -3.5) node {\tiny $2^{20} \l$}; } - \foreach \r in {4.5,5.5}{ - \FPeval{l}{round(\r - 6.5, 0)} - \draw (\r + 0.5, -3.5) node {\tiny $2^{19} \l$}; - } - \draw (7,-3.5) node {\tiny $2^{19}$}; + \draw (5, -3.5) node {\tiny $2^{19} - 1$}; + \draw (6, -3.5) node {\tiny $2^{19}$}; + \draw (7,-3.5) node {\tiny $2^{19} + 1$}; \draw[->-] (-0.5,-2.5) to[bend left] (0.5,-2.5); \draw[->-] (0.5,-2.5) to[bend left] (6,-2.5); - \draw[->-] (6,-2.5) to[bend left] (11.5,-2.5); \end{tikzpicture} \end{center} \caption{Memory access patterns of \texttt{membench.c} for \texttt{csize = @@ -158,14 +154,33 @@ highest index used to access the array in the pattern. \texttt{stride} determines the difference between array indexes over access iterations, i.e. a \texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will skip every other index, a \texttt{stride} of 4 will access one index then skip 3 -and so on and so forth. +and so on and so forth. The benchmark stops when the index to access is strictly +greater than \texttt{csize - stride}. Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will access all indexes between 0 and 127 sequentially, and for \texttt{csize = $2^{20}$} and \texttt{stride = $2^{19}$} the benchmark will access index 0, then -index $2^{19}-1$, and finally index $2^{20}-1$. The access patterns for these +index $2^{19}-1$. The access patterns for these two configurations are shown visually in Figure \ref{fig:access}. +By running the \texttt{membench.c} both on my personal laptop and on the +cluster, the results shown in Figure \ref{fig:mem} are obtained. \textit{csize} +values are shown as different data series and labeled by byte size and +\textit{stride} values are mapped on the $x$ axis by the byte-equivalent value +as well\footnote{Byte values are a factor of 4 greater than the values used in +the code and in Figure \ref{fig:mem}. This is due to the fact that the array +elements used in the benchmark are 32-bit signed integers, which take up 4 bytes +each.}. For $\texttt{csize = 128} = 512$ bytes and $\texttt{stride = 1} = 4$ +bytes the mean access time is $0.124$ nanoseconds, while for $\texttt{csize = +$2^{20}$} = 4$MB and for $\texttt{stride = $2^{19}$} = 2$MB the mean access time +is $1.156$ nanoseconds. The first set of parameters performs well thanks to the +low \textit{stride} value, thus achieving very good space locality and +maximizing cache hits. However, the second set of parameters achieves good +performance as well thanks to the few values accessed with each pass, thus +improving the temporal locality of each address accessed. This observation +applies for the few last data points in each data series of Figure +\ref{fig:mem}, i.e. for \textit{stride} values close to \textit{csize}. + \subsection{Analyzing Benchmark Results} \begin{figure}[t] @@ -196,8 +211,20 @@ In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4 read+write time is less than 10 nanoseconds. Temporal locality is worst for large sizes and strides, although the largest values of \texttt{stride} for each size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times -due to the few elements accessed in the pattern (this observation is also valid -for the largest strides of each size series shown in the graph). +for the aformentioned effect of having \textit{stride} values close to +\textit{csize}. + +The pattern that can be read from the graphs, especially the one for the +cluster, shows that the \textit{stride} axis is divided in regions showing +memory access time of similar magnitude. The boundary between the first and the +second region is a \textit{stride} value of rougly 2KB, while a \textit{stride} +of 512KB roughly separates the second and the third region. The difference in +performance between regions and the similarity of performance within regions +suggest the threshold stride values are related to changes in the use of the +cache hierarchy. In particular, the first region may characterize regions where +the L1 cache, the fastest non-register memory available, is predominantly used. +Then the second region might overlap with a more intense use of the L2 cache and +likewise between the third region and the L3 cache. \marginpar[right text]{\color{white}\url{https://youtu.be/JzJlzGaQFoc}} \section{Optimize Square Matrix-Matrix Multiplication \punkte{60}} @@ -296,7 +323,7 @@ In order to achieve a correct and fast execution, my implementation: discussed above to a final $6.5\%$). \end{itemize} -The chosen matrix block size for running the benchmark on the cluster is +The chosen matrix block size for running the benchmark on the cluster is: $$s = 26$$ @@ -314,7 +341,7 @@ bytes size is fairly close to the L1 cache size of the processor used in the cluster ($32\mathrm{Kb} = 32768$ bytes), which is expected given that the algorithm needs to exploit fast memory as much as possible. The reason the empirically best value results in a theoretical cache allocation that is only -half of the complete cache size is due to some real-life factors. For example, +half of the complete L1 cache size is due to some real-life factors. For example, cache misses tipically result in aligned page loads which may load unnecessary data. @@ -375,6 +402,9 @@ described. However, the blocked implementation achives less than a tenth of FLOPS compared to Intel MKL BLAS based one due to the microarchitecture optimization the latter one is able to exploit. +I was unable to run this benchmark suite on my personal machine due to Intel MKL +installation issues that prevented the code to compile. + \begin{figure}[t] \includegraphics[width=\textwidth]{timing.pdf} \caption{GFlop/s per matrix size of the matrix multiplication benchmark for the naive,