hw1: work on ex2 report

This commit is contained in:
Claudio Maggioni 2022-10-05 10:03:17 +02:00
parent 8013c535e6
commit f6dd2a2d6b
3 changed files with 1183 additions and 10 deletions

View file

@ -1,10 +1,15 @@
\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl} \documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl}
\input{assignment.sty} \input{assignment.sty}
\usepackage{float}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{fancyvrb} \usepackage{fancyvrb}
\begin{document} \usepackage{tikz}
\begin{document}
\setassignment \setassignment
\setduedate{12.10.2022 (midnight)} \setduedate{12.10.2022 (midnight)}
@ -13,8 +18,8 @@ Maggioni}{Discussed with: ---}{Solution for Project 1}{}
\newline \newline
\assignmentpolicy \assignmentpolicy
In this project you will practice memory access optimization, performance-oriented programming, and OpenMP parallelizaton In this project you will practice memory access optimization,
on the ICS Cluster . performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.
\section{Explaining Memory Hierarchies \punkte{25}} \section{Explaining Memory Hierarchies \punkte{25}}
@ -92,13 +97,26 @@ index $2^{10}-1$, and finally index $2^{20}-1$.
\subsection{Analyzing Benchmark Results} \subsection{Analyzing Benchmark Results}
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro \begin{figure}[t]
2018 with a Core i7-8750H CPU) and the cluster are shown below respectively: \begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{generic_macos.pdf}
\caption{Personal laptop}
\label{fig:mem:laptop}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{generic_cluster.pdf}
\caption{Cluster}
\label{fig:mem:cluster}
\end{subfigure}
\caption{Results of the \texttt{membench.c} benchmark for both my personal
laptop (in Figure \ref{fig:mem:laptop}) and the cluster (in Figure
\ref{fig:mem:cluster}).}
\label{fig:mem}
\end{figure}
\begin{center} The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
\includegraphics[width=12cm]{generic_macos.pdf} 2018 with a Core i7-8750H CPU) and the cluster are shown in figure
\includegraphics[width=12cm]{generic_cluster.pdf} \ref{fig:mem}.
\end{center}
The memory access graph for the cluster's benchmark results shows that temporal The memory access graph for the cluster's benchmark results shows that temporal
locality is best for small array sizes and for small \texttt{stride} values. locality is best for small array sizes and for small \texttt{stride} values.
@ -112,8 +130,104 @@ for the largest strides of each size series shown in the graph).
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}} \section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
blocked matrix multiplication algorithm presented in the project. Other than
implementing the pseudocode, my implementation:
\section{Quality of the Report \punkte{15}} \begin{figure}[t]
\begin{center}
\begin{tikzpicture}
\fill[blue!60!white] (4,0) rectangle (5,-2);
\fill[blue!40!white] (4,-2) rectangle (5,-4);
\fill[blue!60!white] (0,-4) rectangle (2,-5);
\fill[blue!40!white] (2,-4) rectangle (4,-5);
\fill[green!40!white] (4,-4) rectangle (5,-5);
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
\draw[step=2] (0,0) grid (5,-5);
\draw[step=5] (0,0) grid (5,-5);
\end{tikzpicture}
\end{center}
\caption{Result of the block division process of a square matrix of size 5
using a block size of 2. The 2-by-1 and 1-by-2 rectangular remainders are
shown in blue and the square matrix of remainder size (i.e. 1) is shown in
green.}
\label{fig:matrix}
\end{figure}
\begin{itemize}
\item Handles the edge cases related to the ``remainders'' in the matrix
block division, i.e. when the division between the size of the matrix
and the block size yields a remainder. Assuming only squared matrices
are multiplied through the algorithm (as in the test suite provided) the
block division could yield rectangular matrix blocks located in the last
rows and columns of each matrix, and the bottom-right corner of the
matrix will be contained in a square matrix block of the size of the
remainder. The result of this process is shown in Figure
\ref{fig:matrix};
\item Converts matrix A into row major format. As shown in Figure
\ref{fig:iter}, by having A in row major format and B in column major
format, iterations across matrix block in the inner most loop of the
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
achieving space locality between the blocks used;
\item Caches the result of each innermost iteration into a temporary matrix
of block size before storing it into matrix C. This achieves better
space locality when \textit{naivemm} needs to store values in matrix C.
The block size temporary matrix has virtually no stride and thus cache
hits are maximised. The copy operation is implemented with bulk copy
\texttt{memcpy} calls.
\end{itemize}
\begin{figure}[t]
\begin{center}
\begin{tikzpicture}
\node[align=center] at (2.5,0.5) {Matrix A};
\fill[orange!10!white] (0,0) rectangle (2,-2);
\fill[orange!25!white] (2,0) rectangle (4,-2);
\fill[orange!40!white] (4,0) rectangle (5,-2);
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
\draw[step=2,black,thick] (0,0) grid (5,-5);
\draw[step=5,black,thick] (0,0) grid (5,-5);
\draw[-to,step=1,red,very thick] (0.5,-0.5) -- (4.5,-0.5);
\draw[-to,step=1,red,very thick] (0.5,-1.5) -- (4.5,-1.5);
\draw[-to,step=1,red,very thick] (0.5,-2.5) -- (4.5,-2.5);
\draw[-to,step=1,red,very thick] (0.5,-3.5) -- (4.5,-3.5);
\draw[-to,step=1,red,very thick] (0.5,-4.5) -- (4.5,-4.5);
\node[align=center] at (8.5,0.5) {Matrix B};
\fill[orange!10!white] (6,0) rectangle (8,-2);
\fill[orange!25!white] (6,-2) rectangle (8,-4);
\fill[orange!40!white] (6,-4) rectangle (8,-5);
\draw[step=1,gray,very thin] (6,0) grid (11,-5);
\draw[step=2,black,thick] (6,0) grid (11,-5);
\draw[step=5,black,thick] (6,0) grid (11,-5);
\draw[black,thick] (11,0) -- (11,-5);
\draw[-to,step=1,red,very thick] (6.5,-0.5) -- (6.5,-4.5);
\draw[-to,step=1,red,very thick] (7.5,-0.5) -- (7.5,-4.5);
\draw[-to,step=1,red,very thick] (8.5,-0.5) -- (8.5,-4.5);
\draw[-to,step=1,red,very thick] (9.5,-0.5) -- (9.5,-4.5);
\draw[-to,step=1,red,very thick] (10.5,-0.5) -- (10.5,-4.5);
\end{tikzpicture}
\end{center}
\caption{Inner most loop iteration of the blocked GEMM algorithm across
matrices A and B. The red lines represent the ``majorness'' of each matrix
(A is converted by the algorithm in row-major form, while B is given and
used in column-major form). The shades of orange represent the blocks used
in each iteration.}
\label{fig:iter}
\end{figure}
The results of the matrix multiplication benchmark for the naive, blocked, and
BLAS implementations are shown in Figure \ref{fig:bench}.
\begin{figure}[t]
\includegraphics[width=\textwidth]{timing.pdf}
\caption{Results of the matrix multiplication benchmark for the naive,
blocked, and BLAS implementations}
\label{fig:bench}
\end{figure}
\end{document} \end{document}

1059
Project1/timing.pdf Normal file

File diff suppressed because it is too large Load diff