ml/assignment_1/report_Maggioni_Claudio.md

<!-- vim: set ts=2 sw=2 et tw=80: -->

---
header-includes:
- \usepackage[utf8]{inputenc}
- \usepackage[T1]{fontenc}
- \usepackage[sc]{mathpazo}
- \usepackage{caption, subcaption}
- \usepackage{hyperref}
- \usepackage[english]{babel}
- \usepackage{amsmath, amsfonts}
- \usepackage{listings}
- \usepackage{graphicx}
- \graphicspath{{Figures/}{./}}
- \usepackage{float}
- \usepackage{geometry}
- \geometry{paper=a4paper,top=2.5cm,bottom=3cm,left=3cm,right=3cm}
- \usepackage{sectsty}
- \sectionfont{\vspace{6pt}\centering\normalfont\scshape}
- \subsectionfont{\normalfont\bfseries}
- \subsubsectionfont{\normalfont\itshape}
- \paragraphfont{\normalfont\scshape}
- \usepackage{scrlayer-scrpage}
- \ofoot*{\pagemark}
- \ifoot*{Maggioni Claudio}
- \cfoot*{}
---
\title{
	\normalfont\normalsize
	\textsc{Machine Learning\\
	Universit\`a della Svizzera italiana}\\
	\vspace{25pt}
	\rule{\linewidth}{0.5pt}\\
	\vspace{20pt}
	{\huge Assignment 1}\\
	\vspace{12pt}
	\rule{\linewidth}{1pt}\\
	\vspace{12pt}
}
\author{\LARGE Maggioni Claudio}
\date{\normalsize\today}
\maketitle

The assignment is split into two parts: you are asked to solve a
regression problem, and answer some questions. You can use all the
books, material, and help you need. Bear in mind that the questions you
are asked are similar to those you may find in the final exam, and are
related to very important and fundamental machine learning concepts. As
such, sooner or later you will need to learn them to pass the course. We
will give you some feedback afterwards.\
!! Note that this file is just meant as a template for the report, in
which we reported **part of** the assignment text for convenience. You
must always refer to the text in the README.md file as the assignment
requirements.

# Regression problem

This section should contain a detailed description of how you solved the
assignment, including all required statistical analyses of the models'
performance and a comparison between the linear regression and the model
of your choice. Limit the assignment to 2500 words (formulas, tables,
figures, etc., do not count as words) and do not include any code in the
report.

## Task 1

Use the family of models
$f(\mathbf{x}, \boldsymbol{\theta}) = \theta_0 + \theta_1 \cdot x_1 +
\theta_2 \cdot x_2 + \theta_3 \cdot x_1 \cdot x_2 + \theta_4 \cdot
\sin(x_1)$
to fit the data. Write in the report the formula of the model
substituting parameters $\theta_0, \ldots, \theta_4$ with the estimates
you've found:
$$f(\mathbf{x}, \boldsymbol{\theta}) = \_ + \_ \cdot x_1 + \_
\cdot x_2 + \_ \cdot x_1 \cdot x_2 + \_ \cdot \sin(x_1)$$
Evaluate the test performance of your model using the mean squared error
as performance measure.

## Task 2

Consider any family of non-linear models of your choice to address the
above regression problem. Evaluate the test performance of your model
using the mean squared error as performance measure. Compare your model
with the linear regression of Task 1. Which one is **statistically**
better?

## Task 3 (Bonus)

In the [**Github repository of the
course**](https://github.com/marshka/ml-20-21), you will find a trained
Scikit-learn model that we built using the same dataset you are given.
This baseline model is able to achieve a MSE of **0.0194**, when
evaluated on the test set. You will get extra points if the test
performance of your model is better (i.e., the MSE is lower) than ours.
Of course, you also have to tell us why you think that your model is
better.

# Questions

## Q1. Training versus Validation

1.  **Explain the curves' behavior in each of the three highlighted
    sections of the figures, namely (a), (b), and (c).**

    In the highlighted section (a) the expected test error, the observed
    validation error and the observed training error are significantly high and
    close toghether. All the errors decrease as the model complexity increases.
    In (c), instead, we see a low training error but high validation and
    expected test error. The last two increase as the model complexity increases
    while the training error is in a plateau. Finally, in (b), we see the test
    and validation error curves reaching their respectively lowest points while
    the training error curve decreases as the model complexity increases, albeit
    in a less steep fashion as its behaviour in (a).

2.  **Is any of the three section associated with the concepts of
    overfitting and underfitting? If yes, explain it.**

    Section (a) is associated with underfitting and section (c) is associated
    with overfitting.

    The behaviour in (a) is fairly easy to explain: since the model complexity
    is insufficient to capture the behaviour of the training data, the model is
    unable to provide accurate predictions and thus all MSEs we observe are
    rather high. It's worth to point out that the training error curve is quite
    close to the validation and the test error: this happens since the model is
    both unable to learn accurately the training data and unable to formulate
    accurate predictions on the validation and test data.

    In (c) instead, the model complexity is higher than the intrinsic complexity
    of the data to model, and thus this extra complexity will learn the
    intrinsic noise of the data. This is of course not desirable, and the dire
    consequences of this phenomena can be seen in the significant difference
    between the observed MSE on training data and MSEs for validation and test
    data. Since the model learns the noise of the training data, the model will
    accurately predict noise fluctuations on the training data, but since this
    noise is completely meaningless information for fitting new datapoints, the
    model is unable to accurately predict for validation and test datapoints and
    thus the MSEs for those sets are high.

    Finally in (b) we observe fairly appropriate fitting. Since the model
    complexity is at least on the same order of magnitude of the intrinsic
    complexity of the data the model is able to learn to accurately predict new
    data without learning noise. Thus, both the validation and the test MSE
    curves reach their lowest point in this region of the graph.

3.  **Is there any evidence of high approximation risk? Why? If yes, in
    which of the below subfigures?**

    Depending on the scale and magnitude of the x axis, there could be
    significant approximation risk. This can be observed in subfigure (b),
    namely by observing the difference in complexity between the model with
    lowest validation error and the optimal model (the model with lowest
    expected test error). The distance between the two lines indicated that the
    currently chosen family of models (i.e. the currently chosen gray box model
    function, and not the value of its hyperparameters) is not completely
    adequate to model the process that generated the data to fit. High
    approximation risk would cause even a correctly fitted model to have high
    test error, since the inherent structure behind the chosen family of models
    would be unable to capture the true behaviour of the data.

4.  **Do you think that by further increasing the model complexity you
    will be able to bring the training error to zero?**

    Yes, I think so. The model complexity could be increased up to the point
    where the model would be so complex that it could actually remember all x-y
    pairs of the training data, thus turning the model function effectively in a
    one-to-one direct mapping between input and output data of the training set.
    Then, the loss on the training dataset would be exactly 0.
    This of course would mean that an absurdly high amount of noise would be
    learned as well, thus making the model completely useless for prediction of
    new datapoints.

5.  **Do you think that by further increasing the model complexity you
    will be able to bring the structural risk to zero?**

    No, I don't think so. In order to achieve zero structural risk we would need
    to have an infinite training dataset covering the entire input parameter
    domain. Increasing the model's complexity would actually make the structural
    risk increase due to overfitting.

## Q2. Linear Regression

Comment and compare how the (a.) training error, (b.) test error and
(c.) coefficients would change in the following cases:

1.  **$x_3$ is a normally distributed independent random variable
    $x_3 \sim \mathcal{N}(1, 2)$**

    With this new variable, the coefficients $\theta_1$ and $\theta_2$ will not
    change significantly for the new optimal model. Training and test error
    behave similarly, although the training error may be higher in the first
    iteration of the learning procedure. All this variations are due to the fact
    that the new variable $x_3$ is completely independent from $x_1$ and $x_2$,
    and consequently from $y$. Therefore, the model will "understand" that $x_3$
    contains no information at all and thus set $\theta_3$ to 0. This effect
    would be achieved even more quickly by using Lasso instead of linear
    regression, since Lasso tends to set parameters to zero when their linear
    regression optimal value would be already close to 0.

1.  **$x_3 = 2.5 \cdot x_1 + x_2$**

    With this new variable, the coefficients would indeed change but test and
    training error would stay the same. Since $x_3$ is a linear combination of
    $x_1$ and $x_2$, then we can rewrite the model function in the following
    way:

    $$f(x, \theta) = \theta_1 x_1 + \theta_2 x_2 + \theta_3 (2.5 x_1 + x_2) =
    (\theta_1 + 2.5 \theta_3) x_1 + (\theta_2 + \theta_3) x_2$$

    This shows that even if the value of $\theta_1$ and $\theta_2$ would change
    if this term is introduced, the solution that would be found through linear
    regression would still be effectively equivalent w.r.t. effectiveness and
    MSE to the optimal model for the original family of models.

1.  **$x_3 = x_1 \cdot x_2$**

    If the underlying process generating the data would also depend on an $x_1
    \cdot x_2$ operation, then this additional input variable would change the
    parameters, improve the training error, and depending on if the impact of
    this quadratic term on the original data-generating process is small or big,
    it would slighty or considerably improve the test error.

    Essentially, this parameter would had useful complexity to the model, which
    may be beneficial if the model is underfitted w.r.t. number of variables in
    the linear regression function, or otherwise detrimental if the model is
    correctly
    fitted or overfitted already.

## Q3. Classification

1.  **Your boss asked you to solve the problem using a perceptron and now
    he's upset because you are getting poor results. How would you
    justify the poor performance of your perceptron classifier to your
    boss?**

    The classification problem in the graph, according to the data points
    shown, is quite similar to the XOR or ex-or problem. Since in 1969 that
    problem was proved impossible to solve by a perceptron model by Minsky and
    Papert, then that would be quite a motivation in front of my boss.

    On a morev general (and more serious) note, the perceptron model would be
    unable to solve the problem in the picture since a perceptron can solve only
    linearly-separable classification problems, and even through a simple
    graphical argument we would be unable to find a line able able to separate
    yellow and purple dots w.r.t. a decent approximation simply due to the way
    the dots are positioned.

2.  **Would you expect to have better luck with a neural network with
    activation function $h(x) = - x \cdot e^{-2}$ for the hidden units?**

    The activation function is still linear and data is not linearly separable

3.  **What are the main differences and similarities between the
    perceptron and the logistic regression neuron?**
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`<!-- vim: set ts=2 sw=2 et tw=80: -->`

			`---`
			`header-includes:`
			`- \usepackage[utf8]{inputenc}`
			`- \usepackage[T1]{fontenc}`
			`- \usepackage[sc]{mathpazo}`
			`- \usepackage{caption, subcaption}`
			`- \usepackage{hyperref}`
			`- \usepackage[english]{babel}`
			`- \usepackage{amsmath, amsfonts}`
			`- \usepackage{listings}`
			`- \usepackage{graphicx}`
			`- \graphicspath{{Figures/}{./}}`
			`- \usepackage{float}`
			`- \usepackage{geometry}`
			`- \geometry{paper=a4paper,top=2.5cm,bottom=3cm,left=3cm,right=3cm}`
			`- \usepackage{sectsty}`
			`- \sectionfont{\vspace{6pt}\centering\normalfont\scshape}`
			`- \subsectionfont{\normalfont\bfseries}`
			`- \subsubsectionfont{\normalfont\itshape}`
			`- \paragraphfont{\normalfont\scshape}`
			`- \usepackage{scrlayer-scrpage}`
			`- \ofoot*{\pagemark}`
			`- \ifoot*{Maggioni Claudio}`
			`- \cfoot*{}`
			`---`
			`\title{`
			`\normalfont\normalsize`
			`\textsc{Machine Learning\\`
			Universit\`a della Svizzera italiana}\\
			`\vspace{25pt}`
			`\rule{\linewidth}{0.5pt}\\`
			`\vspace{20pt}`
			`{\huge Assignment 1}\\`
			`\vspace{12pt}`
			`\rule{\linewidth}{1pt}\\`
			`\vspace{12pt}`
			`}`
			`\author{\LARGE Maggioni Claudio}`
			`\date{\normalsize\today}`
			`\maketitle`

			`The assignment is split into two parts: you are asked to solve a`
			`regression problem, and answer some questions. You can use all the`
			`books, material, and help you need. Bear in mind that the questions you`
			`are asked are similar to those you may find in the final exam, and are`
			`related to very important and fundamental machine learning concepts. As`
			`such, sooner or later you will need to learn them to pass the course. We`
			`will give you some feedback afterwards.\`
			`!! Note that this file is just meant as a template for the report, in`
			`which we reported part of the assignment text for convenience. You`
			`must always refer to the text in the README.md file as the assignment`
			`requirements.`

			`# Regression problem`

			`This section should contain a detailed description of how you solved the`
			`assignment, including all required statistical analyses of the models'`
			`performance and a comparison between the linear regression and the model`
			`of your choice. Limit the assignment to 2500 words (formulas, tables,`
			`figures, etc., do not count as words) and do not include any code in the`
			`report.`

			`## Task 1`

			`Use the family of models`
			`$f(\mathbf{x}, \boldsymbol{\theta}) = \theta_0 + \theta_1 \cdot x_1 +`
			`\theta_2 \cdot x_2 + \theta_3 \cdot x_1 \cdot x_2 + \theta_4 \cdot`
			`\sin(x_1)$`
			`to fit the data. Write in the report the formula of the model`
			`substituting parameters $\theta_0, \ldots, \theta_4$ with the estimates`
			`you've found:`
			`$$f(\mathbf{x}, \boldsymbol{\theta}) = \_ + \_ \cdot x_1 + \_`
			`\cdot x_2 + \_ \cdot x_1 \cdot x_2 + \_ \cdot \sin(x_1)$$`
			`Evaluate the test performance of your model using the mean squared error`
			`as performance measure.`

			`## Task 2`

			`Consider any family of non-linear models of your choice to address the`
			`above regression problem. Evaluate the test performance of your model`
			`using the mean squared error as performance measure. Compare your model`
			`with the linear regression of Task 1. Which one is statistically`
			`better?`

			`## Task 3 (Bonus)`

			`In the [**Github repository of the`
			`course**](https://github.com/marshka/ml-20-21), you will find a trained`
			`Scikit-learn model that we built using the same dataset you are given.`
			`This baseline model is able to achieve a MSE of 0.0194, when`
			`evaluated on the test set. You will get extra points if the test`
			`performance of your model is better (i.e., the MSE is lower) than ours.`
			`Of course, you also have to tell us why you think that your model is`
			`better.`

			`# Questions`

			`## Q1. Training versus Validation`

			`1. **Explain the curves' behavior in each of the three highlighted`
			`sections of the figures, namely (a), (b), and (c).**`

hw1: done T1-3 (no report) and Q1 2021-05-04 12:29:06 +00:00			`In the highlighted section (a) the expected test error, the observed`
			`validation error and the observed training error are significantly high and`
			`close toghether. All the errors decrease as the model complexity increases.`
			`In (c), instead, we see a low training error but high validation and`
			`expected test error. The last two increase as the model complexity increases`
			`while the training error is in a plateau. Finally, in (b), we see the test`
			`and validation error curves reaching their respectively lowest points while`
			`the training error curve decreases as the model complexity increases, albeit`
			`in a less steep fashion as its behaviour in (a).`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00
wip 2021-05-04 14:18:52 +00:00			`2. **Is any of the three section associated with the concepts of`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`overfitting and underfitting? If yes, explain it.**`

hw1: done T1-3 (no report) and Q1 2021-05-04 12:29:06 +00:00			`Section (a) is associated with underfitting and section (c) is associated`
			`with overfitting.`

			`The behaviour in (a) is fairly easy to explain: since the model complexity`
			`is insufficient to capture the behaviour of the training data, the model is`
			`unable to provide accurate predictions and thus all MSEs we observe are`
			`rather high. It's worth to point out that the training error curve is quite`
			`close to the validation and the test error: this happens since the model is`
			`both unable to learn accurately the training data and unable to formulate`
			`accurate predictions on the validation and test data.`

			`In (c) instead, the model complexity is higher than the intrinsic complexity`
			`of the data to model, and thus this extra complexity will learn the`
			`intrinsic noise of the data. This is of course not desirable, and the dire`
			`consequences of this phenomena can be seen in the significant difference`
			`between the observed MSE on training data and MSEs for validation and test`
			`data. Since the model learns the noise of the training data, the model will`
			`accurately predict noise fluctuations on the training data, but since this`
			`noise is completely meaningless information for fitting new datapoints, the`
			`model is unable to accurately predict for validation and test datapoints and`
			`thus the MSEs for those sets are high.`

			`Finally in (b) we observe fairly appropriate fitting. Since the model`
			`complexity is at least on the same order of magnitude of the intrinsic`
			`complexity of the data the model is able to learn to accurately predict new`
			`data without learning noise. Thus, both the validation and the test MSE`
			`curves reach their lowest point in this region of the graph.`

wip 2021-05-04 14:18:52 +00:00			`3. **Is there any evidence of high approximation risk? Why? If yes, in`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`which of the below subfigures?**`

hw1: done T1-3 (no report) and Q1 2021-05-04 12:29:06 +00:00			`Depending on the scale and magnitude of the x axis, there could be`
			`significant approximation risk. This can be observed in subfigure (b),`
			`namely by observing the difference in complexity between the model with`
			`lowest validation error and the optimal model (the model with lowest`
			`expected test error). The distance between the two lines indicated that the`
			`currently chosen family of models (i.e. the currently chosen gray box model`
			`function, and not the value of its hyperparameters) is not completely`
			`adequate to model the process that generated the data to fit. High`
			`approximation risk would cause even a correctly fitted model to have high`
			`test error, since the inherent structure behind the chosen family of models`
			`would be unable to capture the true behaviour of the data.`

wip 2021-05-04 14:18:52 +00:00			`4. **Do you think that by further increasing the model complexity you`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`will be able to bring the training error to zero?**`

hw1: done T1-3 (no report) and Q1 2021-05-04 12:29:06 +00:00			`Yes, I think so. The model complexity could be increased up to the point`
			`where the model would be so complex that it could actually remember all x-y`
			`pairs of the training data, thus turning the model function effectively in a`
			`one-to-one direct mapping between input and output data of the training set.`
			`Then, the loss on the training dataset would be exactly 0.`
			`This of course would mean that an absurdly high amount of noise would be`
			`learned as well, thus making the model completely useless for prediction of`
			`new datapoints.`

wip 2021-05-04 14:18:52 +00:00			`5. **Do you think that by further increasing the model complexity you`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`will be able to bring the structural risk to zero?**`

hw1: done T1-3 (no report) and Q1 2021-05-04 12:29:06 +00:00			`No, I don't think so. In order to achieve zero structural risk we would need`
			`to have an infinite training dataset covering the entire input parameter`
			`domain. Increasing the model's complexity would actually make the structural`
			`risk increase due to overfitting.`

hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`## Q2. Linear Regression`

			`Comment and compare how the (a.) training error, (b.) test error and`
			`(c.) coefficients would change in the following cases:`

			`1. **$x_3$ is a normally distributed independent random variable`
			`$x_3 \sim \mathcal{N}(1, 2)$**`

hw1: done Q1, Q2, Q3.1 (3.2 to rewrite) 2021-05-05 20:25:09 +00:00			`With this new variable, the coefficients $\theta_1$ and $\theta_2$ will not`
			`change significantly for the new optimal model. Training and test error`
			`behave similarly, although the training error may be higher in the first`
			`iteration of the learning procedure. All this variations are due to the fact`
			`that the new variable $x_3$ is completely independent from $x_1$ and $x_2$,`
			`and consequently from $y$. Therefore, the model will "understand" that $x_3$`
			`contains no information at all and thus set $\theta_3$ to 0. This effect`
			`would be achieved even more quickly by using Lasso instead of linear`
			`regression, since Lasso tends to set parameters to zero when their linear`
			`regression optimal value would be already close to 0.`

hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`1. $x_3 = 2.5 \cdot x_1 + x_2$`

hw1: done Q1, Q2, Q3.1 (3.2 to rewrite) 2021-05-05 20:25:09 +00:00			`With this new variable, the coefficients would indeed change but test and`
			`training error would stay the same. Since $x_3$ is a linear combination of`
			`$x_1$ and $x_2$, then we can rewrite the model function in the following`
			`way:`

			`$$f(x, \theta) = \theta_1 x_1 + \theta_2 x_2 + \theta_3 (2.5 x_1 + x_2) =`
			`(\theta_1 + 2.5 \theta_3) x_1 + (\theta_2 + \theta_3) x_2$$`

			`This shows that even if the value of $\theta_1$ and $\theta_2$ would change`
			`if this term is introduced, the solution that would be found through linear`
			`regression would still be effectively equivalent w.r.t. effectiveness and`
			`MSE to the optimal model for the original family of models.`

hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`1. $x_3 = x_1 \cdot x_2$`

hw1: done Q1, Q2, Q3.1 (3.2 to rewrite) 2021-05-05 20:25:09 +00:00			`If the underlying process generating the data would also depend on an $x_1`
			`\cdot x_2$ operation, then this additional input variable would change the`
			`parameters, improve the training error, and depending on if the impact of`
			`this quadratic term on the original data-generating process is small or big,`
			`it would slighty or considerably improve the test error.`

			`Essentially, this parameter would had useful complexity to the model, which`
			`may be beneficial if the model is underfitted w.r.t. number of variables in`
			`the linear regression function, or otherwise detrimental if the model is`
			`correctly`
			`fitted or overfitted already.`

hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`## Q3. Classification`

			`1. **Your boss asked you to solve the problem using a perceptron and now`
			`he's upset because you are getting poor results. How would you`
			`justify the poor performance of your perceptron classifier to your`
			`boss?**`

wip 2021-05-04 14:18:52 +00:00			`The classification problem in the graph, according to the data points`
			`shown, is quite similar to the XOR or ex-or problem. Since in 1969 that`
			`problem was proved impossible to solve by a perceptron model by Minsky and`
done stuff 2021-05-05 16:07:38 +00:00			`Papert, then that would be quite a motivation in front of my boss.`
wip 2021-05-04 14:18:52 +00:00
done stuff 2021-05-05 16:07:38 +00:00			`On a morev general (and more serious) note, the perceptron model would be`
			`unable to solve the problem in the picture since a perceptron can solve only`
			`linearly-separable classification problems, and even through a simple`
			`graphical argument we would be unable to find a line able able to separate`
			`yellow and purple dots w.r.t. a decent approximation simply due to the way`
			`the dots are positioned.`

			`2. **Would you expect to have better luck with a neural network with`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`activation function $h(x) = - x \cdot e^{-2}$ for the hidden units?**`

hw1: done Q1, Q2, Q3.1 (3.2 to rewrite) 2021-05-05 20:25:09 +00:00			`The activation function is still linear and data is not linearly separable`
done stuff 2021-05-05 16:07:38 +00:00
			`3. **What are the main differences and similarities between the`
hw1: report template to pandoc 2021-05-04 10:59:44 +00:00			`perceptron and the logistic regression neuron?**`
done stuff 2021-05-05 16:07:38 +00:00

hw1: done Q1, Q2, Q3.1 (3.2 to rewrite) 2021-05-05 20:25:09 +00:00