hw1: done ex1

This commit is contained in:
Claudio Maggioni 2022-09-27 10:39:48 +02:00
parent 27fc66cf14
commit 262701b276
5 changed files with 6398 additions and 2 deletions

3183
Project1/generic_cluster.pdf Normal file

File diff suppressed because it is too large Load Diff

3173
Project1/generic_macos.pdf Normal file

File diff suppressed because it is too large Load Diff

View File

@ -18,6 +18,8 @@ on the ICS Cluster .
\section{Explaining Memory Hierarchies \punkte{25}}
\subsection{Memory Hierarchy Parameters of the Cluster}
By identifying the memory hierarchy parameters through \texttt{likwid-topology}
for the cache topology and \texttt{free -g} for the amount of primary memory I
find the following values:
@ -70,6 +72,44 @@ Socket 1:
+---------------------------------------------------------------------------------------------------------------+
\end{Verbatim}
\subsection{Memory Access Pattern of \texttt{membench.c}}
The benchmark \texttt{membench.c} measures the average time of repeated read and
write overations across a set of indices of a stack-allocated array of 32-bit
signed integers. The indices vary according to the access pattern used, which in
turn is defined by two variables, \texttt{csize} and \texttt{stride}.
\texttt{csize} is an upper bound on the index value, i.e. (one more of) the
highest index used to access the array in the pattern. \texttt{stride}
determines the difference between array indexes over access iterations, i.e. a
\texttt{stride} of 1 will access every array index, a \texttt{stride} of 2 will
skip every other index, a \texttt{stride} of 4 will access one index then skip 3
and so on and so forth.
Therefore, for \texttt{csize = 128} and \texttt{stride = 1} the array will
access all indexes between 0 and 127 sequentially, and for \texttt{csize =
$2^{20}$} and \texttt{stride = $2^{10}$} the benchmark will access index 0, then
index $2^{10}-1$, and finally index $2^{20}-1$i.
\subsection{Analyzing Benchmark Results}
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
2018 with a Core i7-8750H CPU) and the cluster are shown below respectively:
\begin{center}
\includegraphics[width=12cm]{generic_macos.pdf}
\includegraphics[width=12cm]{generic_cluster.pdf}
\end{center}
The memory access graph for the cluster's benchmark results shows that temporal
locality is best for small array sizes and for small \texttt{stride} values.
In particular, for array memory sizes of 16MB or lower (\texttt{csize} of $4
\cdot 2^{20}$ or lower) and \texttt{stride} values of 2048 or lower the mean
read+write time is less than 10 nanoseconds. Temporal locality is worst for
large sizes and strides, although the largest values of \texttt{stride} for each
size (like \texttt{csize / 2} and \texttt{csize / 4}) achieve better mean times
due to the few elements accessed in the pattern (this observation is also valid
for the largest strides of each size series shown in the graph).
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}

View File

@ -1650,11 +1650,11 @@ LTb
LCb setrgbcolor
LCb setrgbcolor
3774 4829 M
[ [(Helvetica) 140.0 0.0 true true 0 (10-Core Intel\(R\) Xeon\(R\) CPU E3-1585L v5 )]
[ [(Helvetica) 140.0 0.0 true true 0 (6-Core Intel\(R\) Core\(R\) CPU i7-8750H )]
XYsave
[(Helvetica) 140.0 0.0 true true 0 ( )]
XYrestore
[(Helvetica) 140.0 0.0 true true 0 (3.00GHz Read+Write \(ns\) Versus Stride)]
[(Helvetica) 140.0 0.0 true true 0 (4.10GHz Read+Write \(ns\) Versus Stride)]
] -46.7 MCshow
/Helvetica findfont 140 scalefont setfont
LTb