ml/as2_Maggioni_Claudio/report_Maggioni_Claudio.tex


%----------------------------------------------------------------------------------------
%	Machine Learning Assignment Template
%----------------------------------------------------------------------------------------

\documentclass[11pt]{scrartcl}
\newcommand*\student[1]{\newcommand{\thestudent}{{#1}}}

%----------------------------------------------------------------------------------------
%	INSERT HERE YOUR NAME
%----------------------------------------------------------------------------------------

\student{Claudio Maggioni}

%----------------------------------------------------------------------------------------
%	PACKAGES AND OTHER DOCUMENT CONFIGURATIONS
%----------------------------------------------------------------------------------------

\usepackage[utf8]{inputenc} % Required for inputting international characters
\usepackage[T1]{fontenc} % Use 8-bit encoding
\usepackage[sc]{mathpazo}
\usepackage{caption, subcaption}
\usepackage[colorlinks=true]{hyperref}
\usepackage{inconsolata}

\usepackage[english]{babel} % English language hyphenation
\usepackage{amsmath, amsfonts} % Math packages
\usepackage{listings} % Code listings, with syntax highlighting
\usepackage{graphicx} % Required for inserting images
\graphicspath{{Figures/}{./}} % Specifies where to look for included images (trailing slash required)
\usepackage{float}

%----------------------------------------------------------------------------------------
%	DOCUMENT MARGINS
%----------------------------------------------------------------------------------------

\usepackage{geometry} % For page dimensions and margins
\geometry{
	paper=a4paper,
	top=2.5cm, % Top margin
	bottom=3cm, % Bottom margin
	left=3cm, % Left margin
	right=3cm, % Right margin
}
\setlength\parindent{0pt}

%----------------------------------------------------------------------------------------
%	SECTION TITLES
%----------------------------------------------------------------------------------------

\usepackage{sectsty}
\sectionfont{\vspace{6pt}\centering\normalfont\scshape}
\subsectionfont{\normalfont\bfseries} % \subsection{} styling
\subsubsectionfont{\normalfont\itshape} % \subsubsection{} styling
\paragraphfont{\normalfont\scshape} % \paragraph{} styling

%----------------------------------------------------------------------------------------
%	HEADERS AND FOOTERS
%----------------------------------------------------------------------------------------

\usepackage{scrlayer-scrpage}
\ofoot*{\pagemark} % Right footer
\ifoot*{\thestudent} % Left footer
\cfoot*{} % Centre footer

%----------------------------------------------------------------------------------------
%	TITLE SECTION
%----------------------------------------------------------------------------------------

\title{
	\normalfont\normalsize
	\textsc{Machine Learning\\%
	Universit\`a della Svizzera italiana}\\
	\vspace{25pt}
	\rule{\linewidth}{0.5pt}\\
	\vspace{20pt}
	{\huge Assignment 2}\\
	\vspace{12pt}
	\rule{\linewidth}{1pt}\\
	\vspace{12pt}
}

\author{\LARGE \thestudent}

\date{\normalsize\today}

\begin{document}

\section*{Note on model download from SWITCHdrive}

Due to space constraints, all the neural network models in this assigment were saved and uploaded on
SWITCHdrive. To download them, make sure you have \texttt{curl}, \texttt{gzip}, and \texttt{tar}
(all installed by default on MacOS) and then run:

\begin{verbatim}
./download.sh
\end{verbatim}

having as current working directory (pwd) the root assigment directory \\ \texttt{as2\_Maggioni\_Claudio}.

The script will download automatically the models.

Then, in order to run the models with the \texttt{run\_task*.py}, make sure you first \texttt{cd} in the
\texttt{deliverable} directory.

Should the script not work, please download the compressed T1 and T2 models from
\url{https://drive.switch.ch/index.php/s/F0ubFgS6PBy8UM5/download} and uncompress the downloaded \texttt{.tar.gz}
file in the \texttt{deliverable/} directory, and then make sure the directories \texttt{deliverable/nn\_task1} and \texttt{deliverable/nn\_task2}
contain 3 \texttt{.h5} files in total.

\maketitle

In this assignment you are asked to:

\begin{enumerate}
\item Implement a neural network to classify images from the \texttt{CIFAR10} dataset;
\item Fine-tune a pre-trained neural network to classify rock, paper, scissors hand gestures.
\end{enumerate}

Both requests are very similar to what we have seen during the labs. However, you are required to follow \textbf{exactly} the assignment's specifications.

%----------------------------------------------------------------------------------------
%	Task 1
%----------------------------------------------------------------------------------------

\section{Follow our recipe}

Implement a multi-class classifier to identify the subject of the images from \href{https://www.cs.toronto.edu/\%7Ekriz/cifar.html}{\texttt{CIFAR-10}} data set. To simply the problem, we restrict the classes to 3: \texttt{airplane}, \texttt{automobile} and \texttt{bird}.

\begin{enumerate}
\item Download and load \texttt{CIFAR-10} dataset using the following \href{https://www.tensorflow.org/api_docs/python/tf/keras/datasets/cifar10/load_data}{function}, and consider only the first three classes. Check \texttt{src/utils.py}, there is already a function for this!
\item Preprocess the data:
\begin{itemize}
\item Normalize each pixel of each channel so that the range is [0, 1];
\item Create one-hot encoding of the labels.
\end{itemize}
\item Build a neural network with the following architecture:
\begin{itemize}
\item Convolutional layer, with 8 filters of size 5$\times$5, stride of 1$\times$1, and ReLU activation;
\item Max pooling layer, with pooling size of 2$\times$2;
\item Convolutional layer, with 16 filters of size 3$\times$3, stride of 2$\times$2, and ReLU activation;
\item Average pooling layer, with pooling size of 2$\times$2;
\item Layer to convert the 2D feature maps to vectors (Flatten layer);
\item Dense layer with 8 neurons and tanh activation;
\item Dense output layer with softmax activation;
\end{itemize}
\item Train the model on the training set from point 1 for 500 epochs:
\begin{itemize}
\item Use the RMSprop optimization algorithm, with a learning rate of 0.003 and a batch size of 128;
\item Use categorical cross-entropy as a loss function;
\item Implement early stopping, monitoring the validation accuracy of the model with a patience of 10 epochs and use 20\% of the training data as validation set;
\item When early stopping kicks in, and the training procedure stops, restore the best model found during training.
\end{itemize}
\item Draw a plot with epochs on the $x$-axis and with two graphs: the train accuracy and the validation accuracy (remember to add a legend to distinguish the two graphs!).
\item Assess the performances of the network on the test set loaded in point 1, and provide an estimate of the classification accuracy that you expect on new and unseen images.
\item \textbf{Bonus} (Optional) Tune the learning rate and the number of neurons in the last dense hidden layer with a \textbf{grid search} to improve the performances (if feasible).
\begin{itemize}
\item Consider the following options for the two hyper-parameters (4 models in total):
\begin{itemize}
\item learning rate: [0.01, 0.0001]
\item number of neurons: [16, 64]
\end{itemize}
\item Keep all the other hyper-parameters as in point 3.
\item Perform a grid search on the chosen ranges based on hold-out cross-validation in the training set and identify the most promising hyper-parameter setup.
\item Compare the accuracy on the test set achieved by the most promising configuration with that of the model obtained in point 4. Are the accuracy levels statistically different?
\end{itemize}
\end{enumerate}

\subsection{Comment}

The network model was built and trained according to the given specification.

The performance on the given test set is of $0.3879$ loss and $85.63\%$ accuracy. In order to assess performance on new and unseen images
a statistical confidence interval is necessary. Since the accuracy is by construction a binomial measure (since an image can either be correctly
classified or not, and we repeat this Bernoulli process for each test set datapoint), we perform a binomial distribution confidence interval computation
for 95\% confidence. The code use to do this is found in the notebook \texttt{src/Assignment 2.ipynb} under the section \textit{Statistical tests on CIFAR classifier}.
We conclude stating that with 95\% confidence the accuracy for new and unseen images will fall between $\approx 84.29\%$ and $\approx 86.82\%$.

The training and validation accuracy curves for the network is shown below:

\begin{figure}[H]
\centering
	\resizebox{\textwidth}{!}{%
		\includegraphics{./t1_plot.png}}
\caption{Training and validation accuracy curves during fitting for the CIFAR10 classifier}
\end{figure}

\subsection{Bonus}

A plot of the validation loss and accuracy over training epochs for each of the configurations in the grid search are given below.
The plot is useful to understand that for both learning rates chosen, choosing 64 neurons will lead to overfitting, due to the validation
loss being higher than the training loss. This overfitting cannot be easily avoided with the adopted early stopping procedure, since
our implementation monitors validation loss and not validation accuracy.

The second observation that can be found from the plots is that a smaller learning rate increases the number of epochs needed to achieve convergence
but it also increases the model accuracy and decreases the model loss.

By running all the models on the test set, using the script \texttt{run\_task1\_bonus.py} in the \texttt{deliverable/} directory,
we obtain the following loss and accuracy:

\begin{center}
\begin{tabular}{cccc}
\textbf{Learn rate} & \textbf{\# Neurons} & \textbf{Loss} & \textbf{Accuracy} \\\hline
0.01 & 16 & 0.4597 & 0.8117 \\
0.01 & 64 & 0.4508 & 0.8290 \\
0.0001 & 16 & 0.4048 & 0.8437 \\
0.0001 & 64 & 0.4073 & 0.8343 \\
\end{tabular}
\end{center}

We choose the third configuration as the optimal one since it has highest accuracy (and lowest loss too). We then perform a T-test
between this model and the one built in the previous section using the script \texttt{src/t\_test\_bonus.py}.

The T-test gives a T-score of 1.374106, corresponding to a P-value of 0.169511, which makes us accept the null hypothesis therefore
concluding that there is no significant difference between this optimal model and the one trained in the previous section.

\begin{figure}[H]
\centering
	\resizebox{0.75\textwidth}{!}{%
		\includegraphics[width=0.25\textwidth]{./t1_bonus_0001_16.png}}
\caption{Training and validation accuracy curves during fitting for the CIFAR10 classifier (0.0001 learning rate, 16 neurons)}
\end{figure}
\begin{figure}[H]
\centering
	\resizebox{0.75\textwidth}{!}{%
		\includegraphics[width=0.25\textwidth]{./t1_bonus_0001_64.png}}
\caption{Training and validation accuracy curves during fitting for the CIFAR10 classifier (0.0001 learning rate, 64 neurons)}
\end{figure}
\begin{figure}[H]
\centering
	\resizebox{0.75\textwidth}{!}{%
		\includegraphics[width=0.25\textwidth]{./t1_bonus_01_16.png}}
\caption{Training and validation accuracy curves during fitting for the CIFAR10 classifier (0.01 learning rate, 16 neurons)}
\end{figure}
\begin{figure}[H]
\centering
	\resizebox{\textwidth}{!}{%
		\includegraphics{./t1_bonus_01_64.png}}
\caption{Training and validation accuracy curves during fitting for the CIFAR10 classifier (0.01 learning rate, 64 neurons)}
\end{figure}

%----------------------------------------------------------------------------------------
%	Task 2
%----------------------------------------------------------------------------------------
\newpage
\section{Transfer learning}

In this task, we will fine-tune the last layer of a pretrained model in order to build a classifier for the \emph{rock, paper, scissors dataset} that we acquired for the lab. The objective is to make use of the experience collected on a task to bootstrap the performances on a different task. We are going to use the \texttt{VGG16} network, pretrained on Imagenet to compete in the ILSVRC-2014 competition.\\

\texttt{VGG16} is very expensive to train from scratch, but luckily the VGG team publicly released the trained weights of the network, so that people could use it for transfer learning. As we discussed during classes, this can be achieved by \textbf{removing the last fully connected layers} form the pretrained model and by using the output of the convolutional layers (with freezed weights) as input to a \textbf{new fully connected network}. This last part of the model is then trained from scratch on the task of interest.

\begin{enumerate}
\item Use \texttt{keras} to download a pretrained version of the \texttt{vgg16} network. You can start from the snippet of code you find on the \href{https://github.com/marshka/ml-20-21/tree/main/assignment\_2}{repository} of the assignment.
\item Download and preprocess the rock, paper, scissor dataset that we collected for the lab. You find the functions to download and build the dataset in \texttt{src/utils.py}. Vgg16 provides a function to prepropress the input\\
\texttt{applications.vgg16.preprocess\_input}\\
You may decide to use it.
Use $224 \times 224$ as image dimension.
\item Add a hidden layer (use any number of units and the activation function that you want), then add an output layer suitable for the hand gesture classification problem.
\item Train with and without \textbf{data augmentation} and report the learning curves (train and validation accuracy) for both cases.
\begin{itemize}
\item Turn on the GPU environment on Colab, otherwise training will be slow.
\item Train for 50 epochs or until convergence.
\item Comment if using data augmentation led to an improvement or not.
\end{itemize}
\end{enumerate}

\subsection{Comment}

The built network in its dense part is composed by a 128-neuron ReLU-activated hidden layer and a 3-neuron softmax output layer.
The input to the network is at first resized to 224x224 size and then normalized according to VGG16 normalization factors
(refer to the function \texttt{process\_vgg16} in the \texttt{src/Assignment 2.ipynb} notebook for details on the normalization process).
Classification labels were first converted from string labels to numeric ones, (i.e. \texttt{'scissors'} = 0,
\texttt{'paper'} = 1, \texttt{'rock'} = 2), and then the numeric encoding was in turn converted to a one-hot encoding using the
\texttt{keras.utils.to\_categorical} function.

Both the data-augmented and non-augmented network were trained using the ADAM optimizer with $0.001$ learning rate for 50 epochs with
an early stopping procedure with 10 epochs patience.

Both models were saved and both can be run at the same time on the given test set by executing \texttt{deliverable/run\_task2.py}.

The training and validation accuracy curves for both the data-augmented and the not augmented networks are shown below:

\begin{figure}[H]
\centering
	\resizebox{\textwidth}{!}{%
		\includegraphics{./t2_noaug.png}}
\caption{Training and validation accuracy curves during fitting for the not data augmented VGG16 classifier}
\end{figure}

\begin{figure}[H]
\centering
	\resizebox{\textwidth}{!}{%
		\includegraphics{./t2_aug.png}}
\caption{Training and validation accuracy curves during fitting for the data augmented VGG16 classifier}
\end{figure}

\subsection{T-Test}

The findings shown below were computed using the script \texttt{src/t\_test.py}.

To compare the trained model with and without data augmentation, we perform a two-tailed Student T-test between the models.
The test report that the models have different accuracy with $99.999973\%$ confidence, and the model trained with data augmentation
has lower variance. Therefore, we conclude that the model trained with data augmentation is the statistically better model out of the two.

The student T-test of course is a valid argument only for this specific instance of the application of data augmentation. However,
in the general case we can say that performing data augmentation on the training and validation data is intuitively better in order to
assure the network is able to correctly identify rock, paper or scissors from all angles and zoom levels.

On the given test set, the model trained with data augmentation has $\approx 90.00\%$ accuracy while the model trained without data augmentation
has $\approx 77.33\%$ accuracy.

\end{document}