midterm: done

This commit is contained in:
Claudio Maggioni 2021-05-14 13:24:13 +02:00
parent 5026da324a
commit 27da90d89b
2 changed files with 128 additions and 41 deletions

View file

@ -15,6 +15,16 @@ header-includes:
--- ---
\maketitle \maketitle
# Acknowledgements on group work
- Gianmarco De Vita suggested me the use of MATLAB's equation solver for parts
of `dogleg.m`'s implementation.
- I have discussed my solutions for exercise 1.2 and exercise 3 with several
people, namely:
- Gianmarco De Vita
- Tommaso Rodolfo Masera
- Andrea Brites Marto
# Exercise 1 # Exercise 1
## Point 1 ## Point 1
@ -61,6 +71,25 @@ converges to the minimizer in one iteration.
The right answer is choice (a), since the energy norm of the error indeed always The right answer is choice (a), since the energy norm of the error indeed always
decreases monotonically. decreases monotonically.
The proof of this that I will provide is independent from the provided objective
or the provided number of iterations, and it works for all choices of $A$ where
$A$ is symmetric and positive definite.
Therefore, first of all I will prove $A$ is indeed SPD by computing its
eigenvalues.
$$CP(A) = det\left(\begin{bmatrix}2&-1&0\\-1&2&-1\\0&-1&2\end{bmatrix} - \lambda I \right) =
det\left(\begin{bmatrix}2 - \lambda&-1&0\\-1&2 -
\lambda&-1\\0&-1&2-\lambda\end{bmatrix} \right) = -\lambda^3 + 6
\lambda^2 - 10\lambda + 4$$
$$CP(A) = 0 \Leftrightarrow \lambda = 2 \lor \lambda = 2 \pm \sqrt{2}$$
Therefore we have 3 eigenvalues and they are all positive, so A is positive
definite and it is clearly symmetric as well.
Now we switch to the general proof for the monotonicity.
To prove that this is true, we first consider a way to express any iterate $x_k$ To prove that this is true, we first consider a way to express any iterate $x_k$
in function of the minimizer $x_s$ and of the missing iterations: in function of the minimizer $x_s$ and of the missing iterations:
@ -148,10 +177,25 @@ monotonically decreases.
## Point 1 ## Point 1
### (a) For which kind of minimization problems can the trust region method be ### (a) For which kind of minimization problems can the trust region method be used? What are the assumptions on the objective function?
used? What are the assumptions on the objective function?
**TBD** The trust region method is an algorithm that can be used for unconstrained
minimization. The trust region method uses parts of the gradient descent and
Newton methods, and thus it accepts basically the same domain of objectives
that these two methods accept.
These constraints namely require the objective function $f(x)$ to be twice
differentiable in order to make building a quadratic model around an arbitrary
point possible. In addition, our assumptions w.r.t. the scope of this course
require that $f(x)$ should be continuous up to the second derivatives.
This is needed to allow the Hessian to be symmetric (as by the Schwartz theorem)
which is an assumption that simplifies significantly proofs related to the
method (like namely Exercise 3 in this assignment).
Finally, as all the other unconstrained minimization methods we covered in this
course, the trust region method is only able to find a local minimizer close to
he chosen starting points, and by no means the computed minimizer is guaranteed
to be a global minimizer.
### (b) Write down the quadratic model around a current iterate xk and explain the meaning of each term. ### (b) Write down the quadratic model around a current iterate xk and explain the meaning of each term.
@ -163,7 +207,7 @@ Here's an explaination of the meaning of each term:
(length); (length);
- $f$ is the energy function value at the current iterate, i.e. $f(x_k)$; - $f$ is the energy function value at the current iterate, i.e. $f(x_k)$;
- $p$ is the trust region step, the solution of $\arg\min_p m(p)$ with $\|p\| < - $p$ is the trust region step, the solution of $\arg\min_p m(p)$ with $\|p\| <
\Delta$ is the optimal step to take; \Delta$, i.e. the optimal step to take;
- $g$ is the gradient at the current iterate $x_k$, i.e. $\nabla f(x_k)$; - $g$ is the gradient at the current iterate $x_k$, i.e. $\nabla f(x_k)$;
- $B$ is the hessian at the current iterate $x_k$, i.e. $\nabla^2 f(x_k)$. - $B$ is the hessian at the current iterate $x_k$, i.e. $\nabla^2 f(x_k)$.
@ -172,40 +216,54 @@ Here's an explaination of the meaning of each term:
The role of the trust region radius is to put an upper bound on the step length The role of the trust region radius is to put an upper bound on the step length
in order to avoid "overly ambitious" steps, i.e. steps where the the step length in order to avoid "overly ambitious" steps, i.e. steps where the the step length
is considerably long and the quadratic model of the objective is low-quality is considerably long and the quadratic model of the objective is low-quality
(i.e. the quadratic model differs by a predetermined approximation threshold (i.e. the performance measure $\rho_k$ in the TR algorithm indicates significant
from the real objective). energy difference between the true objective and the quadratic model).
In layman's terms, the trust region radius makes the method switch more gradient In layman's terms, the trust region radius makes the method switch more gradient
based or more quadratic based steps w.r.t. the confidence in the quadratic based or more quadratic based steps w.r.t. the "confidence" (measured in terms
approximation. of $\rho_k$) in the computed
quadratic model.
### (d) Explain Cauchy point, sufficient decrease and Dogleg method, and the connection between them. ### (d) Explain Cauchy point, sufficient decrease and Dogleg method, and the connection between them.
**TBD** The Cauchy point and Dogleg method are algorithms to compute iteration steps
that are in the bounds of the trust region. They allow to provide an approximate
solution to the minimization of the quadratic model inside the TR radius.
**sufficient decrease TBD** The Cauchy point is a method providing sufficient decrease (as per the Wolfe
conditions) by essentially performing a gradient descent step with a
particularly chosen step size limited by the TR radius. However, since this
method basically does not exploit the quadratic component of the objective model
(the hessian is only used as a term in the step length calculation), even if it
provides sufficient decrease and consequentially convergence it is rarely used
as a standalone method to compute iteration steps.
The Cauchy point provides sufficient decrease, but makes the trust region method The Cauchy point is therefore often integrated in another method called Dogleg,
essentially like linear method. which uses the former algorithm in conjunction with a purely Newton step to
provided a steps obtained by a blend of linear and quadratic information.
The dogleg method allows for mixing purely linear iterations and purely quadratic This blend is achieved by choosing the new iterate by searching along a path
ones. The dogleg method picks along its "dog leg shaped" path function made out made out of two segments, namely the gradient descent step with optimal step
of a gradient component and a component directed towards a purely Newton step size and a segment pointing from the last point to the pure netwon step. The
picking the furthest point that is still inside the trust region radius. peculiar angle between this two segments is the reason the method is nicknamed
"Dogleg", since the final line resembles a dog's leg.
Dogleg uses cauchy point if the trust region does not allow for a proper dogleg In the Dogleg method, the Cauchy point is used in case the trust region is small
step since it is too slow. enough not to allow the "turn" on the second segment towards the Netwon step.
Thanks to this property and the use of the performance measure $\rho_k$ to grow
Cauchy provides linear convergence and dogleg superlinear. and shrink the TR radius, the Dogleg method performs well even with inaccurate
quadratic models. Therefore, it still satisfies sufficient decrease and the
Wolfe conditions while delivering superlinear convergence, compared to the
purely linear convergence of Cauchy point steps.
### (e) Write down the trust region ratio and explain its meaning. ### (e) Write down the trust region ratio and explain its meaning.
$$\rho_k = \frac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$$ $$\rho_k = \frac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$$
The trust region ratio measures the quality of the quadratic model built around The trust region ratio and performance measure $\rho_k$ measures the quality of
the quadratic model built around
the current iterate $x_k$, by measuring the ratio between the energy difference the current iterate $x_k$, by measuring the ratio between the energy difference
between the old and the new iterate according to the real energy function and between the old and the new iterate according to the real energy function w.r.t. the quadratic model around $x_k$.
according to the quadratic model around $x_k$.
The ratio is used to test the adequacy of the current trust region radius. For The ratio is used to test the adequacy of the current trust region radius. For
an inaccurate quadratic model, the predicted energy decrease would be an inaccurate quadratic model, the predicted energy decrease would be
@ -220,7 +278,11 @@ since the model quality is good.
### (f) Does the energy decrease monotonically when Trust Region method is employed? Justify your answer. ### (f) Does the energy decrease monotonically when Trust Region method is employed? Justify your answer.
**TBD** In the trust region method the energy of the iterates does not always decrease
monotonically. This is due to the fact that the algorithm could actively reject
a step if the performance measure factor $\rho_k$ is less that a given constant
$\eta$. In this case, the new iterate is equal to the old one, no step is taken
and thus the energy norm does not decrease but stays the same.
## Point 2 ## Point 2
@ -380,20 +442,42 @@ Then the norm of the step $\tilde{p}$ clearly increases as $\tau$
increases. For the second criterion, we compute the quadratic model for a increases. For the second criterion, we compute the quadratic model for a
generic $\tau \in [0,1]$: generic $\tau \in [0,1]$:
$$m(\tilde{p}(\tau)) = f - \frac{\tau^2 \|g\|^2}{g^TBg} - \frac12 \frac{\tau^2 $$m(\tilde{p}(\tau)) = f + g^T * \tilde{p}(\tau) + \frac{1}{2}
\|g\|^2}{(g^TBg)^2} g^TBg = f - \frac12 \frac{\tau^2 \|g\|^2}{g^TBg}$$ \tilde{p}(\tau)^T B \tilde{p}(\tau) = f + g^T \tau p^U + \frac12
\tau (p^U)^T B \tau p^U $$
$g^T B g > 0$ since we assume $B$ is positive definite, therefore the entire We then recall the definition of $p^U$:
term in function
of $\tau^2$ is negative and thus the model for an increasing $\tau \in [0,1]$ $$ p^U = -\frac{g^Tg}{g^TBg}g: $$
decreases monotonically (to be precise quadratically).
and plug it into the expression:
$$ = f + g^T \tau \cdot -
\frac{g^Tg}{g^TBg}g + \frac{1}{2} \tau \cdot -\left(\frac{g^Tg}{g^TBg}g\right)^T
\cdot B \tau \cdot - \frac{g^Tg}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g
\|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \left(\frac{\|g\|^2}{g^T B g}\right) g^T \cdot B
\frac{g^T g}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g \|^4}{g^TBg} +
\frac{1}{2} \tau^2 \cdot \frac{\| g \|^4}{(g^TBg)^2} \cdot g^TBg $$$$ = f -
\tau \cdot \frac{\| g \|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \frac{\| g
\|^4}{g^TBg} $$$$ = f + \left(\frac{1}{2} \tau^2 - \tau\right) \cdot \frac{\| g
\|^4}{g^TBg}$$$$= f + \left(\frac12\tau^2 - \tau\right) z \; \text{where } z =
\frac{\| g \|^4}{g^TBg}$$
We then compute the derivative of the model:
$$\frac{dm(\tilde{p}(\tau))}{d\tau} = z\tau - z = z(\tau - 1)$$
We know $z$ is positive since $g^T B g > 0$, since we assume $B$ is positive
definite. Then, since $\tau \in [0,1]$, the derivative is always $\leq 0$ and
therefore we have proven that the quadratic model of the Dogleg step decreases
as $\tau$ increases.
Now we show that the two claims on gradients hold also for $\tau \in [1,2]$. We Now we show that the two claims on gradients hold also for $\tau \in [1,2]$. We
define a function $h(\alpha)$ (where $\alpha = \tau - 1$) with same gradient define a function $h(\alpha)$ (where $\alpha = \tau - 1$) with same gradient
"sign" "sign"
as $\|\tilde{p}(\tau)\|$ and we show that this function increases: as $\|\tilde{p}(\tau)\|$ and we show that this function increases:
$$h(\alpha) = \frac12 \|\tilde{p}(1 - \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B - $$h(\alpha) = \frac12 \|\tilde{p}(1 + \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B -
p^U)\|^2 = \frac12 \|p^U\|^2 + \frac12 \alpha^2 \|p^B - p^U\|^2 + \alpha (p^U)^T p^U)\|^2 = \frac12 \|p^U\|^2 + \frac12 \alpha^2 \|p^B - p^U\|^2 + \alpha (p^U)^T
(p^B - p^U)$$ (p^B - p^U)$$
@ -401,9 +485,10 @@ We now take the derivative of $h(\alpha)$ and we show it is always positive,
i.e. that $h(\alpha)$ has always positive gradient and thus that it is i.e. that $h(\alpha)$ has always positive gradient and thus that it is
increasing w.r.t. $\alpha$: increasing w.r.t. $\alpha$:
$$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B - p^U) $$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B -
= \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$= \|g\|^2 p^U) = \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$=
\frac{g^TB^{-1}g}{g^TBg}\left(1 - \frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$ \|g\|^2 \frac{g^TB^{-1}g}{g^TBg}\left(1 -
\frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$
Since we know $B$ is symmetric and positive definite, then $B^{-1}$ is as well. Since we know $B$ is symmetric and positive definite, then $B^{-1}$ is as well.
Therefore, we know that the term outside of the parenthesis is always positive Therefore, we know that the term outside of the parenthesis is always positive
@ -420,8 +505,9 @@ by proving all properties of such space:
- **Linearity w.r.t. the first argument:** - **Linearity w.r.t. the first argument:**
${\langle x, y \rangle}_B + {\langle z, $\alpha {\langle x, y \rangle}_B + \beta {\langle z,
y \rangle}_B = x^TBy + z^TBy = (x + z)^TBy = {\langle (x + z), y \rangle}_B$; y \rangle}_B = \alpha \cdot x^TBy + \beta \cdot z^TBy = (\alpha x + \beta z)^TBy
= {\langle (\alpha x + \beta z), y \rangle}_B$;
- **Symmetry:** - **Symmetry:**
@ -430,11 +516,12 @@ by proving all properties of such space:
- **Positive definiteness:** - **Positive definiteness:**
${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite. ${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite
for all $x \neq 0$.
Since ${\langle x, y \rangle}_B$ is indeed a linear product space, then: Since ${\langle x, y \rangle}_B$ is indeed a linear product space, then:
$${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B, {\langle B^{-1} $${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B {\langle B^{-1}
g, B^{-1} g \rangle}_B$$ g, B^{-1} g \rangle}_B$$
holds according to the Cauchy-Schwartz inequality. Now, if we expand each inner holds according to the Cauchy-Schwartz inequality. Now, if we expand each inner
@ -455,7 +542,7 @@ plugging the Dogleg step in the quadratic model:
$$\hat{h}(\alpha) = m(\tilde{p}(1+\alpha)) = f + g^T (p^U + \alpha (p^B - p^U)) + $$\hat{h}(\alpha) = m(\tilde{p}(1+\alpha)) = f + g^T (p^U + \alpha (p^B - p^U)) +
\frac12 (p^U + \alpha (p^B - p^U))^T B (p^U + \alpha (p^B - p^U)) = $$$$ = \frac12 (p^U + \alpha (p^B - p^U))^T B (p^U + \alpha (p^B - p^U)) = $$$$ =
f + g^T p^U + \alpha g^T (p^B - p^U) + (p^U)^T B p^U + \frac12 \alpha (p^U)^T B f + g^T p^U + \alpha g^T (p^B - p^U) + \frac12 (p^U)^T B p^U + \frac12 \alpha (p^U)^T B
(p^B - p^U) + \frac12 \alpha (p^B - p^U)^T B p^U + \frac12 \alpha^2 (p^B - p^U) + \frac12 \alpha (p^B - p^U)^T B p^U + \frac12 \alpha^2
(p^B - p^U)^T B (p^B - p^U)$$ (p^B - p^U)^T B (p^B - p^U)$$