\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl} \input{assignment.sty} \usepackage{fancyvrb} \begin{document} \setassignment \setduedate{12.10.2022 (midnight)} \serieheader{High-Performance Computing Lab}{2022}{Student: Claudio Maggioni}{Discussed with: ---}{Solution for Project 1}{} \newline \assignmentpolicy In this project you will practice memory access optimization, performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster . \section{Explaining Memory Hierarchies \punkte{25}} \subsection{Memory Hierarchy Parameters of the Cluster} By identifying the memory hierarchy parameters through \texttt{likwid-topology} for the cache topology and \texttt{free -g} for the amount of primary memory I find the following values: \begin{center} \begin{tabular}{llll} Main memory & 62 GB \\ L3 cache & 25 MB per socket \\ L2 cache & 256 kB per core \\ L1 cache & 32 kB per core \end{tabular} \end{center} All values are reported using base 2 IEC byte units. The cluster has 2 sockets and a total of 20 cores (10 per socket). The cache topology diagram reported by \texttt{likwid-topology -g} is the following: \pagebreak[4] % https://tex.stackexchange.com/a/171818 \begin{Verbatim}[fontsize=\tiny] Socket 0: +---------------------------------------------------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +-----------------------------------------------------------------------------------------------------------+ | | | 25 MB | | | +-----------------------------------------------------------------------------------------------------------+ | +---------------------------------------------------------------------------------------------------------------+ Socket 1: +---------------------------------------------------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +-----------------------------------------------------------------------------------------------------------+ | | | 25 MB | | | +-----------------------------------------------------------------------------------------------------------+ | +---------------------------------------------------------------------------------------------------------------+ \end{Verbatim} \subsection{Memory Access Pattern of \texttt{membench.c}} The benchmark \texttt{membench.c} measures the average time of repeated read and write overations across a set of indices of a stack-allocated array of 32-bit signed integers. The indices vary according to the access pattern used, which in turn is defined by two variables, \texttt{csize} and \texttt{stride}. \texttt{csize} is an upper bound on the index value, i.e. (one more of) the highest index used to access the array in the pattern. \texttt{stride} determines the difference between array indexes over access iterations, i.e. a \texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will skip every other index, a \texttt{stride} of 4 will access one index then skip 3 and so on and so forth. Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will access all indexes between 0 and 127 sequentially, and for \texttt{csize = $2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then index $2^{10}-1$, and finally index $2^{20}-1$. \subsection{Analyzing Benchmark Results} The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro 2018 with a Core i7-8750H CPU) and the cluster are shown below respectively: \begin{center} \includegraphics[width=12cm]{generic_macos.pdf} \includegraphics[width=12cm]{generic_cluster.pdf} \end{center} The memory access graph for the cluster's benchmark results shows that temporal locality is best for small array sizes and for small \texttt{stride} values. In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4 \cdot 2^{20}$ or lower) and \texttt{stride} values of 2048 or lower the mean read+write time is less than 10 nanoseconds. Temporal locality is worst for large sizes and strides, although the largest values of \texttt{stride} for each size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times due to the few elements accessed in the pattern (this observation is also valid for the largest strides of each size series shown in the graph). \section{Optimize Square Matrix-Matrix Multiplication \punkte{60}} \section{Quality of the Report \punkte{15}} \end{document}