119 lines
6.6 KiB
TeX
119 lines
6.6 KiB
TeX
\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl}
|
|
|
|
\input{assignment.sty}
|
|
\usepackage{fancyvrb}
|
|
\begin{document}
|
|
|
|
|
|
\setassignment
|
|
\setduedate{12.10.2022 (midnight)}
|
|
|
|
\serieheader{High-Performance Computing Lab}{2022}{Student: Claudio
|
|
Maggioni}{Discussed with: ---}{Solution for Project 1}{}
|
|
\newline
|
|
|
|
\assignmentpolicy
|
|
In this project you will practice memory access optimization, performance-oriented programming, and OpenMP parallelizaton
|
|
on the ICS Cluster .
|
|
|
|
\section{Explaining Memory Hierarchies \punkte{25}}
|
|
|
|
\subsection{Memory Hierarchy Parameters of the Cluster}
|
|
|
|
By identifying the memory hierarchy parameters through \texttt{likwid-topology}
|
|
for the cache topology and \texttt{free -g} for the amount of primary memory I
|
|
find the following values:
|
|
|
|
\begin{center}
|
|
\begin{tabular}{llll}
|
|
Main memory & 62 GB \\
|
|
L3 cache & 25 MB per socket \\
|
|
L2 cache & 256 kB per core \\
|
|
L1 cache & 32 kB per core
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
All values are reported using base 2 IEC byte units. The cluster has 2 sockets
|
|
and a total of 20 cores (10 per socket). The cache topology diagram reported by
|
|
\texttt{likwid-topology -g} is the following:
|
|
|
|
\pagebreak[4]
|
|
% https://tex.stackexchange.com/a/171818
|
|
\begin{Verbatim}[fontsize=\tiny]
|
|
Socket 0:
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
| | 25 MB | |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
Socket 1:
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
|
|
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
| | 25 MB | |
|
|
| +-----------------------------------------------------------------------------------------------------------+ |
|
|
+---------------------------------------------------------------------------------------------------------------+
|
|
\end{Verbatim}
|
|
|
|
\subsection{Memory Access Pattern of \texttt{membench.c}}
|
|
|
|
The benchmark \texttt{membench.c} measures the average time of repeated read and
|
|
write overations across a set of indices of a stack-allocated array of 32-bit
|
|
signed integers. The indices vary according to the access pattern used, which in
|
|
turn is defined by two variables, \texttt{csize} and \texttt{stride}.
|
|
\texttt{csize} is an upper bound on the index value, i.e. (one more of) the
|
|
highest index used to access the array in the pattern. \texttt{stride}
|
|
determines the difference between array indexes over access iterations, i.e. a
|
|
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
|
|
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
|
|
and so on and so forth.
|
|
|
|
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
|
|
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
|
|
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
|
|
index $2^{10}-1$, and finally index $2^{20}-1$i.
|
|
|
|
\subsection{Analyzing Benchmark Results}
|
|
|
|
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
|
|
2018 with a Core i7-8750H CPU) and the cluster are shown below respectively:
|
|
|
|
\begin{center}
|
|
\includegraphics[width=12cm]{generic_macos.pdf}
|
|
\includegraphics[width=12cm]{generic_cluster.pdf}
|
|
\end{center}
|
|
|
|
The memory access graph for the cluster's benchmark results shows that temporal
|
|
locality is best for small array sizes and for small \texttt{stride} values.
|
|
In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
|
|
\cdot 2^{20}$ or lower) and \texttt{stride} values of 2048 or lower the mean
|
|
read+write time is less than 10 nanoseconds. Temporal locality is worst for
|
|
large sizes and strides, although the largest values of \texttt{stride} for each
|
|
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
|
|
due to the few elements accessed in the pattern (this observation is also valid
|
|
for the largest strides of each size series shown in the graph).
|
|
|
|
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
|
|
|
|
|
\section{Quality of the Report \punkte{15}}
|
|
|
|
|
|
\end{document}
|