hw1: work on ex2 report
This commit is contained in:
parent
8013c535e6
commit
f6dd2a2d6b
3 changed files with 1183 additions and 10 deletions
Binary file not shown.
|
@ -1,10 +1,15 @@
|
||||||
\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl}
|
\documentclass[unicode,11pt,a4paper,oneside,numbers=endperiod,openany]{scrartcl}
|
||||||
|
|
||||||
\input{assignment.sty}
|
\input{assignment.sty}
|
||||||
|
\usepackage{float}
|
||||||
|
\usepackage{subcaption}
|
||||||
|
\usepackage{graphicx}
|
||||||
\usepackage{fancyvrb}
|
\usepackage{fancyvrb}
|
||||||
\begin{document}
|
\usepackage{tikz}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
|
||||||
\setassignment
|
\setassignment
|
||||||
\setduedate{12.10.2022 (midnight)}
|
\setduedate{12.10.2022 (midnight)}
|
||||||
|
|
||||||
|
@ -13,8 +18,8 @@ Maggioni}{Discussed with: ---}{Solution for Project 1}{}
|
||||||
\newline
|
\newline
|
||||||
|
|
||||||
\assignmentpolicy
|
\assignmentpolicy
|
||||||
In this project you will practice memory access optimization, performance-oriented programming, and OpenMP parallelizaton
|
In this project you will practice memory access optimization,
|
||||||
on the ICS Cluster .
|
performance-oriented programming, and OpenMP parallelizaton on the ICS Cluster.
|
||||||
|
|
||||||
\section{Explaining Memory Hierarchies \punkte{25}}
|
\section{Explaining Memory Hierarchies \punkte{25}}
|
||||||
|
|
||||||
|
@ -92,13 +97,26 @@ index $2^{10}-1$, and finally index $2^{20}-1$.
|
||||||
|
|
||||||
\subsection{Analyzing Benchmark Results}
|
\subsection{Analyzing Benchmark Results}
|
||||||
|
|
||||||
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
|
\begin{figure}[t]
|
||||||
2018 with a Core i7-8750H CPU) and the cluster are shown below respectively:
|
\begin{subfigure}{0.5\textwidth}
|
||||||
|
\includegraphics[width=\textwidth]{generic_macos.pdf}
|
||||||
|
\caption{Personal laptop}
|
||||||
|
\label{fig:mem:laptop}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{0.5\textwidth}
|
||||||
|
\includegraphics[width=\textwidth]{generic_cluster.pdf}
|
||||||
|
\caption{Cluster}
|
||||||
|
\label{fig:mem:cluster}
|
||||||
|
\end{subfigure}
|
||||||
|
\caption{Results of the \texttt{membench.c} benchmark for both my personal
|
||||||
|
laptop (in Figure \ref{fig:mem:laptop}) and the cluster (in Figure
|
||||||
|
\ref{fig:mem:cluster}).}
|
||||||
|
\label{fig:mem}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\begin{center}
|
The \texttt{membench.c} benchmark results for my personal laptop (Macbook Pro
|
||||||
\includegraphics[width=12cm]{generic_macos.pdf}
|
2018 with a Core i7-8750H CPU) and the cluster are shown in figure
|
||||||
\includegraphics[width=12cm]{generic_cluster.pdf}
|
\ref{fig:mem}.
|
||||||
\end{center}
|
|
||||||
|
|
||||||
The memory access graph for the cluster's benchmark results shows that temporal
|
The memory access graph for the cluster's benchmark results shows that temporal
|
||||||
locality is best for small array sizes and for small \texttt{stride} values.
|
locality is best for small array sizes and for small \texttt{stride} values.
|
||||||
|
@ -112,8 +130,104 @@ for the largest strides of each size series shown in the graph).
|
||||||
|
|
||||||
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
\section{Optimize Square Matrix-Matrix Multiplication \punkte{60}}
|
||||||
|
|
||||||
|
The file \texttt{matmult/dgemm-blocked.c} contains a C implementation of the
|
||||||
|
blocked matrix multiplication algorithm presented in the project. Other than
|
||||||
|
implementing the pseudocode, my implementation:
|
||||||
|
|
||||||
\section{Quality of the Report \punkte{15}}
|
\begin{figure}[t]
|
||||||
|
\begin{center}
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\fill[blue!60!white] (4,0) rectangle (5,-2);
|
||||||
|
\fill[blue!40!white] (4,-2) rectangle (5,-4);
|
||||||
|
\fill[blue!60!white] (0,-4) rectangle (2,-5);
|
||||||
|
\fill[blue!40!white] (2,-4) rectangle (4,-5);
|
||||||
|
\fill[green!40!white] (4,-4) rectangle (5,-5);
|
||||||
|
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
|
||||||
|
\draw[step=2] (0,0) grid (5,-5);
|
||||||
|
\draw[step=5] (0,0) grid (5,-5);
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{center}
|
||||||
|
\caption{Result of the block division process of a square matrix of size 5
|
||||||
|
using a block size of 2. The 2-by-1 and 1-by-2 rectangular remainders are
|
||||||
|
shown in blue and the square matrix of remainder size (i.e. 1) is shown in
|
||||||
|
green.}
|
||||||
|
\label{fig:matrix}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Handles the edge cases related to the ``remainders'' in the matrix
|
||||||
|
block division, i.e. when the division between the size of the matrix
|
||||||
|
and the block size yields a remainder. Assuming only squared matrices
|
||||||
|
are multiplied through the algorithm (as in the test suite provided) the
|
||||||
|
block division could yield rectangular matrix blocks located in the last
|
||||||
|
rows and columns of each matrix, and the bottom-right corner of the
|
||||||
|
matrix will be contained in a square matrix block of the size of the
|
||||||
|
remainder. The result of this process is shown in Figure
|
||||||
|
\ref{fig:matrix};
|
||||||
|
\item Converts matrix A into row major format. As shown in Figure
|
||||||
|
\ref{fig:iter}, by having A in row major format and B in column major
|
||||||
|
format, iterations across matrix block in the inner most loop of the
|
||||||
|
algorithm (the one calling \textit{naivemm}) cache hits are maximised by
|
||||||
|
achieving space locality between the blocks used;
|
||||||
|
\item Caches the result of each innermost iteration into a temporary matrix
|
||||||
|
of block size before storing it into matrix C. This achieves better
|
||||||
|
space locality when \textit{naivemm} needs to store values in matrix C.
|
||||||
|
The block size temporary matrix has virtually no stride and thus cache
|
||||||
|
hits are maximised. The copy operation is implemented with bulk copy
|
||||||
|
\texttt{memcpy} calls.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\begin{center}
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\node[align=center] at (2.5,0.5) {Matrix A};
|
||||||
|
\fill[orange!10!white] (0,0) rectangle (2,-2);
|
||||||
|
\fill[orange!25!white] (2,0) rectangle (4,-2);
|
||||||
|
\fill[orange!40!white] (4,0) rectangle (5,-2);
|
||||||
|
|
||||||
|
\draw[step=1,gray,very thin] (0,0) grid (5,-5);
|
||||||
|
\draw[step=2,black,thick] (0,0) grid (5,-5);
|
||||||
|
\draw[step=5,black,thick] (0,0) grid (5,-5);
|
||||||
|
|
||||||
|
\draw[-to,step=1,red,very thick] (0.5,-0.5) -- (4.5,-0.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (0.5,-1.5) -- (4.5,-1.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (0.5,-2.5) -- (4.5,-2.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (0.5,-3.5) -- (4.5,-3.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (0.5,-4.5) -- (4.5,-4.5);
|
||||||
|
|
||||||
|
\node[align=center] at (8.5,0.5) {Matrix B};
|
||||||
|
\fill[orange!10!white] (6,0) rectangle (8,-2);
|
||||||
|
\fill[orange!25!white] (6,-2) rectangle (8,-4);
|
||||||
|
\fill[orange!40!white] (6,-4) rectangle (8,-5);
|
||||||
|
|
||||||
|
\draw[step=1,gray,very thin] (6,0) grid (11,-5);
|
||||||
|
\draw[step=2,black,thick] (6,0) grid (11,-5);
|
||||||
|
\draw[step=5,black,thick] (6,0) grid (11,-5);
|
||||||
|
\draw[black,thick] (11,0) -- (11,-5);
|
||||||
|
|
||||||
|
\draw[-to,step=1,red,very thick] (6.5,-0.5) -- (6.5,-4.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (7.5,-0.5) -- (7.5,-4.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (8.5,-0.5) -- (8.5,-4.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (9.5,-0.5) -- (9.5,-4.5);
|
||||||
|
\draw[-to,step=1,red,very thick] (10.5,-0.5) -- (10.5,-4.5);
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{center}
|
||||||
|
\caption{Inner most loop iteration of the blocked GEMM algorithm across
|
||||||
|
matrices A and B. The red lines represent the ``majorness'' of each matrix
|
||||||
|
(A is converted by the algorithm in row-major form, while B is given and
|
||||||
|
used in column-major form). The shades of orange represent the blocks used
|
||||||
|
in each iteration.}
|
||||||
|
\label{fig:iter}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
The results of the matrix multiplication benchmark for the naive, blocked, and
|
||||||
|
BLAS implementations are shown in Figure \ref{fig:bench}.
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\includegraphics[width=\textwidth]{timing.pdf}
|
||||||
|
\caption{Results of the matrix multiplication benchmark for the naive,
|
||||||
|
blocked, and BLAS implementations}
|
||||||
|
\label{fig:bench}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
|
1059
Project1/timing.pdf
Normal file
1059
Project1/timing.pdf
Normal file
File diff suppressed because it is too large
Load diff
Reference in a new issue