midterm: done
This commit is contained in:
parent
5026da324a
commit
27da90d89b
2 changed files with 128 additions and 41 deletions
|
@ -15,6 +15,16 @@ header-includes:
|
|||
---
|
||||
\maketitle
|
||||
|
||||
# Acknowledgements on group work
|
||||
|
||||
- Gianmarco De Vita suggested me the use of MATLAB's equation solver for parts
|
||||
of `dogleg.m`'s implementation.
|
||||
- I have discussed my solutions for exercise 1.2 and exercise 3 with several
|
||||
people, namely:
|
||||
- Gianmarco De Vita
|
||||
- Tommaso Rodolfo Masera
|
||||
- Andrea Brites Marto
|
||||
|
||||
# Exercise 1
|
||||
|
||||
## Point 1
|
||||
|
@ -61,6 +71,25 @@ converges to the minimizer in one iteration.
|
|||
The right answer is choice (a), since the energy norm of the error indeed always
|
||||
decreases monotonically.
|
||||
|
||||
The proof of this that I will provide is independent from the provided objective
|
||||
or the provided number of iterations, and it works for all choices of $A$ where
|
||||
$A$ is symmetric and positive definite.
|
||||
|
||||
Therefore, first of all I will prove $A$ is indeed SPD by computing its
|
||||
eigenvalues.
|
||||
|
||||
$$CP(A) = det\left(\begin{bmatrix}2&-1&0\\-1&2&-1\\0&-1&2\end{bmatrix} - \lambda I \right) =
|
||||
det\left(\begin{bmatrix}2 - \lambda&-1&0\\-1&2 -
|
||||
\lambda&-1\\0&-1&2-\lambda\end{bmatrix} \right) = -\lambda^3 + 6
|
||||
\lambda^2 - 10\lambda + 4$$
|
||||
|
||||
$$CP(A) = 0 \Leftrightarrow \lambda = 2 \lor \lambda = 2 \pm \sqrt{2}$$
|
||||
|
||||
Therefore we have 3 eigenvalues and they are all positive, so A is positive
|
||||
definite and it is clearly symmetric as well.
|
||||
|
||||
Now we switch to the general proof for the monotonicity.
|
||||
|
||||
To prove that this is true, we first consider a way to express any iterate $x_k$
|
||||
in function of the minimizer $x_s$ and of the missing iterations:
|
||||
|
||||
|
@ -148,10 +177,25 @@ monotonically decreases.
|
|||
|
||||
## Point 1
|
||||
|
||||
### (a) For which kind of minimization problems can the trust region method be
|
||||
used? What are the assumptions on the objective function?
|
||||
### (a) For which kind of minimization problems can the trust region method be used? What are the assumptions on the objective function?
|
||||
|
||||
**TBD**
|
||||
The trust region method is an algorithm that can be used for unconstrained
|
||||
minimization. The trust region method uses parts of the gradient descent and
|
||||
Newton methods, and thus it accepts basically the same domain of objectives
|
||||
that these two methods accept.
|
||||
|
||||
These constraints namely require the objective function $f(x)$ to be twice
|
||||
differentiable in order to make building a quadratic model around an arbitrary
|
||||
point possible. In addition, our assumptions w.r.t. the scope of this course
|
||||
require that $f(x)$ should be continuous up to the second derivatives.
|
||||
This is needed to allow the Hessian to be symmetric (as by the Schwartz theorem)
|
||||
which is an assumption that simplifies significantly proofs related to the
|
||||
method (like namely Exercise 3 in this assignment).
|
||||
|
||||
Finally, as all the other unconstrained minimization methods we covered in this
|
||||
course, the trust region method is only able to find a local minimizer close to
|
||||
he chosen starting points, and by no means the computed minimizer is guaranteed
|
||||
to be a global minimizer.
|
||||
|
||||
### (b) Write down the quadratic model around a current iterate xk and explain the meaning of each term.
|
||||
|
||||
|
@ -163,7 +207,7 @@ Here's an explaination of the meaning of each term:
|
|||
(length);
|
||||
- $f$ is the energy function value at the current iterate, i.e. $f(x_k)$;
|
||||
- $p$ is the trust region step, the solution of $\arg\min_p m(p)$ with $\|p\| <
|
||||
\Delta$ is the optimal step to take;
|
||||
\Delta$, i.e. the optimal step to take;
|
||||
- $g$ is the gradient at the current iterate $x_k$, i.e. $\nabla f(x_k)$;
|
||||
- $B$ is the hessian at the current iterate $x_k$, i.e. $\nabla^2 f(x_k)$.
|
||||
|
||||
|
@ -172,40 +216,54 @@ Here's an explaination of the meaning of each term:
|
|||
The role of the trust region radius is to put an upper bound on the step length
|
||||
in order to avoid "overly ambitious" steps, i.e. steps where the the step length
|
||||
is considerably long and the quadratic model of the objective is low-quality
|
||||
(i.e. the quadratic model differs by a predetermined approximation threshold
|
||||
from the real objective).
|
||||
(i.e. the performance measure $\rho_k$ in the TR algorithm indicates significant
|
||||
energy difference between the true objective and the quadratic model).
|
||||
|
||||
In layman's terms, the trust region radius makes the method switch more gradient
|
||||
based or more quadratic based steps w.r.t. the confidence in the quadratic
|
||||
approximation.
|
||||
based or more quadratic based steps w.r.t. the "confidence" (measured in terms
|
||||
of $\rho_k$) in the computed
|
||||
quadratic model.
|
||||
|
||||
### (d) Explain Cauchy point, sufficient decrease and Dogleg method, and the connection between them.
|
||||
|
||||
**TBD**
|
||||
The Cauchy point and Dogleg method are algorithms to compute iteration steps
|
||||
that are in the bounds of the trust region. They allow to provide an approximate
|
||||
solution to the minimization of the quadratic model inside the TR radius.
|
||||
|
||||
**sufficient decrease TBD**
|
||||
The Cauchy point is a method providing sufficient decrease (as per the Wolfe
|
||||
conditions) by essentially performing a gradient descent step with a
|
||||
particularly chosen step size limited by the TR radius. However, since this
|
||||
method basically does not exploit the quadratic component of the objective model
|
||||
(the hessian is only used as a term in the step length calculation), even if it
|
||||
provides sufficient decrease and consequentially convergence it is rarely used
|
||||
as a standalone method to compute iteration steps.
|
||||
|
||||
The Cauchy point provides sufficient decrease, but makes the trust region method
|
||||
essentially like linear method.
|
||||
The Cauchy point is therefore often integrated in another method called Dogleg,
|
||||
which uses the former algorithm in conjunction with a purely Newton step to
|
||||
provided a steps obtained by a blend of linear and quadratic information.
|
||||
|
||||
The dogleg method allows for mixing purely linear iterations and purely quadratic
|
||||
ones. The dogleg method picks along its "dog leg shaped" path function made out
|
||||
of a gradient component and a component directed towards a purely Newton step
|
||||
picking the furthest point that is still inside the trust region radius.
|
||||
This blend is achieved by choosing the new iterate by searching along a path
|
||||
made out of two segments, namely the gradient descent step with optimal step
|
||||
size and a segment pointing from the last point to the pure netwon step. The
|
||||
peculiar angle between this two segments is the reason the method is nicknamed
|
||||
"Dogleg", since the final line resembles a dog's leg.
|
||||
|
||||
Dogleg uses cauchy point if the trust region does not allow for a proper dogleg
|
||||
step since it is too slow.
|
||||
|
||||
Cauchy provides linear convergence and dogleg superlinear.
|
||||
In the Dogleg method, the Cauchy point is used in case the trust region is small
|
||||
enough not to allow the "turn" on the second segment towards the Netwon step.
|
||||
Thanks to this property and the use of the performance measure $\rho_k$ to grow
|
||||
and shrink the TR radius, the Dogleg method performs well even with inaccurate
|
||||
quadratic models. Therefore, it still satisfies sufficient decrease and the
|
||||
Wolfe conditions while delivering superlinear convergence, compared to the
|
||||
purely linear convergence of Cauchy point steps.
|
||||
|
||||
### (e) Write down the trust region ratio and explain its meaning.
|
||||
|
||||
$$\rho_k = \frac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$$
|
||||
|
||||
The trust region ratio measures the quality of the quadratic model built around
|
||||
The trust region ratio and performance measure $\rho_k$ measures the quality of
|
||||
the quadratic model built around
|
||||
the current iterate $x_k$, by measuring the ratio between the energy difference
|
||||
between the old and the new iterate according to the real energy function and
|
||||
according to the quadratic model around $x_k$.
|
||||
between the old and the new iterate according to the real energy function w.r.t. the quadratic model around $x_k$.
|
||||
|
||||
The ratio is used to test the adequacy of the current trust region radius. For
|
||||
an inaccurate quadratic model, the predicted energy decrease would be
|
||||
|
@ -220,7 +278,11 @@ since the model quality is good.
|
|||
|
||||
### (f) Does the energy decrease monotonically when Trust Region method is employed? Justify your answer.
|
||||
|
||||
**TBD**
|
||||
In the trust region method the energy of the iterates does not always decrease
|
||||
monotonically. This is due to the fact that the algorithm could actively reject
|
||||
a step if the performance measure factor $\rho_k$ is less that a given constant
|
||||
$\eta$. In this case, the new iterate is equal to the old one, no step is taken
|
||||
and thus the energy norm does not decrease but stays the same.
|
||||
|
||||
## Point 2
|
||||
|
||||
|
@ -250,7 +312,7 @@ and $\eta \in [0, \frac14)$\;
|
|||
$x_{k+1} \gets x_k$\;
|
||||
}
|
||||
}
|
||||
\caption{Trust region method}
|
||||
\caption{Trust region method}
|
||||
\end{algorithm}
|
||||
|
||||
The Cauchy point algorithm is the following:
|
||||
|
@ -266,7 +328,7 @@ Input $B$ (quadratic term), $g$ (linear term), $\Delta_k$\;
|
|||
|
||||
$p_k \gets -\tau \cdot \frac{\Delta_k}{\|g\|^2 \cdot g}$\;
|
||||
\Return{$p_k$}
|
||||
\caption{Cauchy point}
|
||||
\caption{Cauchy point}
|
||||
\end{algorithm}
|
||||
|
||||
Finally, the Dogleg method algorithm is the following:
|
||||
|
@ -380,20 +442,42 @@ Then the norm of the step $\tilde{p}$ clearly increases as $\tau$
|
|||
increases. For the second criterion, we compute the quadratic model for a
|
||||
generic $\tau \in [0,1]$:
|
||||
|
||||
$$m(\tilde{p}(\tau)) = f - \frac{\tau^2 \|g\|^2}{g^TBg} - \frac12 \frac{\tau^2
|
||||
\|g\|^2}{(g^TBg)^2} g^TBg = f - \frac12 \frac{\tau^2 \|g\|^2}{g^TBg}$$
|
||||
$$m(\tilde{p}(\tau)) = f + g^T * \tilde{p}(\tau) + \frac{1}{2}
|
||||
\tilde{p}(\tau)^T B \tilde{p}(\tau) = f + g^T \tau p^U + \frac12
|
||||
\tau (p^U)^T B \tau p^U $$
|
||||
|
||||
$g^T B g > 0$ since we assume $B$ is positive definite, therefore the entire
|
||||
term in function
|
||||
of $\tau^2$ is negative and thus the model for an increasing $\tau \in [0,1]$
|
||||
decreases monotonically (to be precise quadratically).
|
||||
We then recall the definition of $p^U$:
|
||||
|
||||
$$ p^U = -\frac{g^Tg}{g^TBg}g: $$
|
||||
|
||||
and plug it into the expression:
|
||||
|
||||
$$ = f + g^T \tau \cdot -
|
||||
\frac{g^Tg}{g^TBg}g + \frac{1}{2} \tau \cdot -\left(\frac{g^Tg}{g^TBg}g\right)^T
|
||||
\cdot B \tau \cdot - \frac{g^Tg}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g
|
||||
\|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \left(\frac{\|g\|^2}{g^T B g}\right) g^T \cdot B
|
||||
\frac{g^T g}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g \|^4}{g^TBg} +
|
||||
\frac{1}{2} \tau^2 \cdot \frac{\| g \|^4}{(g^TBg)^2} \cdot g^TBg $$$$ = f -
|
||||
\tau \cdot \frac{\| g \|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \frac{\| g
|
||||
\|^4}{g^TBg} $$$$ = f + \left(\frac{1}{2} \tau^2 - \tau\right) \cdot \frac{\| g
|
||||
\|^4}{g^TBg}$$$$= f + \left(\frac12\tau^2 - \tau\right) z \; \text{where } z =
|
||||
\frac{\| g \|^4}{g^TBg}$$
|
||||
|
||||
We then compute the derivative of the model:
|
||||
|
||||
$$\frac{dm(\tilde{p}(\tau))}{d\tau} = z\tau - z = z(\tau - 1)$$
|
||||
|
||||
We know $z$ is positive since $g^T B g > 0$, since we assume $B$ is positive
|
||||
definite. Then, since $\tau \in [0,1]$, the derivative is always $\leq 0$ and
|
||||
therefore we have proven that the quadratic model of the Dogleg step decreases
|
||||
as $\tau$ increases.
|
||||
|
||||
Now we show that the two claims on gradients hold also for $\tau \in [1,2]$. We
|
||||
define a function $h(\alpha)$ (where $\alpha = \tau - 1$) with same gradient
|
||||
"sign"
|
||||
as $\|\tilde{p}(\tau)\|$ and we show that this function increases:
|
||||
|
||||
$$h(\alpha) = \frac12 \|\tilde{p}(1 - \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B -
|
||||
$$h(\alpha) = \frac12 \|\tilde{p}(1 + \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B -
|
||||
p^U)\|^2 = \frac12 \|p^U\|^2 + \frac12 \alpha^2 \|p^B - p^U\|^2 + \alpha (p^U)^T
|
||||
(p^B - p^U)$$
|
||||
|
||||
|
@ -401,9 +485,10 @@ We now take the derivative of $h(\alpha)$ and we show it is always positive,
|
|||
i.e. that $h(\alpha)$ has always positive gradient and thus that it is
|
||||
increasing w.r.t. $\alpha$:
|
||||
|
||||
$$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B - p^U)
|
||||
= \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$= \|g\|^2
|
||||
\frac{g^TB^{-1}g}{g^TBg}\left(1 - \frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$
|
||||
$$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B -
|
||||
p^U) = \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$=
|
||||
\|g\|^2 \frac{g^TB^{-1}g}{g^TBg}\left(1 -
|
||||
\frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$
|
||||
|
||||
Since we know $B$ is symmetric and positive definite, then $B^{-1}$ is as well.
|
||||
Therefore, we know that the term outside of the parenthesis is always positive
|
||||
|
@ -420,8 +505,9 @@ by proving all properties of such space:
|
|||
|
||||
- **Linearity w.r.t. the first argument:**
|
||||
|
||||
${\langle x, y \rangle}_B + {\langle z,
|
||||
y \rangle}_B = x^TBy + z^TBy = (x + z)^TBy = {\langle (x + z), y \rangle}_B$;
|
||||
$\alpha {\langle x, y \rangle}_B + \beta {\langle z,
|
||||
y \rangle}_B = \alpha \cdot x^TBy + \beta \cdot z^TBy = (\alpha x + \beta z)^TBy
|
||||
= {\langle (\alpha x + \beta z), y \rangle}_B$;
|
||||
|
||||
- **Symmetry:**
|
||||
|
||||
|
@ -430,11 +516,12 @@ by proving all properties of such space:
|
|||
|
||||
- **Positive definiteness:**
|
||||
|
||||
${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite.
|
||||
${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite
|
||||
for all $x \neq 0$.
|
||||
|
||||
Since ${\langle x, y \rangle}_B$ is indeed a linear product space, then:
|
||||
|
||||
$${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B, {\langle B^{-1}
|
||||
$${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B {\langle B^{-1}
|
||||
g, B^{-1} g \rangle}_B$$
|
||||
|
||||
holds according to the Cauchy-Schwartz inequality. Now, if we expand each inner
|
||||
|
@ -455,7 +542,7 @@ plugging the Dogleg step in the quadratic model:
|
|||
|
||||
$$\hat{h}(\alpha) = m(\tilde{p}(1+\alpha)) = f + g^T (p^U + \alpha (p^B - p^U)) +
|
||||
\frac12 (p^U + \alpha (p^B - p^U))^T B (p^U + \alpha (p^B - p^U)) = $$$$ =
|
||||
f + g^T p^U + \alpha g^T (p^B - p^U) + (p^U)^T B p^U + \frac12 \alpha (p^U)^T B
|
||||
f + g^T p^U + \alpha g^T (p^B - p^U) + \frac12 (p^U)^T B p^U + \frac12 \alpha (p^U)^T B
|
||||
(p^B - p^U) + \frac12 \alpha (p^B - p^U)^T B p^U + \frac12 \alpha^2
|
||||
(p^B - p^U)^T B (p^B - p^U)$$
|
||||
|
||||
|
|
Binary file not shown.
Reference in a new issue