midterm: done
This commit is contained in:
parent
5026da324a
commit
27da90d89b
2 changed files with 128 additions and 41 deletions
|
@ -15,6 +15,16 @@ header-includes:
|
||||||
---
|
---
|
||||||
\maketitle
|
\maketitle
|
||||||
|
|
||||||
|
# Acknowledgements on group work
|
||||||
|
|
||||||
|
- Gianmarco De Vita suggested me the use of MATLAB's equation solver for parts
|
||||||
|
of `dogleg.m`'s implementation.
|
||||||
|
- I have discussed my solutions for exercise 1.2 and exercise 3 with several
|
||||||
|
people, namely:
|
||||||
|
- Gianmarco De Vita
|
||||||
|
- Tommaso Rodolfo Masera
|
||||||
|
- Andrea Brites Marto
|
||||||
|
|
||||||
# Exercise 1
|
# Exercise 1
|
||||||
|
|
||||||
## Point 1
|
## Point 1
|
||||||
|
@ -61,6 +71,25 @@ converges to the minimizer in one iteration.
|
||||||
The right answer is choice (a), since the energy norm of the error indeed always
|
The right answer is choice (a), since the energy norm of the error indeed always
|
||||||
decreases monotonically.
|
decreases monotonically.
|
||||||
|
|
||||||
|
The proof of this that I will provide is independent from the provided objective
|
||||||
|
or the provided number of iterations, and it works for all choices of $A$ where
|
||||||
|
$A$ is symmetric and positive definite.
|
||||||
|
|
||||||
|
Therefore, first of all I will prove $A$ is indeed SPD by computing its
|
||||||
|
eigenvalues.
|
||||||
|
|
||||||
|
$$CP(A) = det\left(\begin{bmatrix}2&-1&0\\-1&2&-1\\0&-1&2\end{bmatrix} - \lambda I \right) =
|
||||||
|
det\left(\begin{bmatrix}2 - \lambda&-1&0\\-1&2 -
|
||||||
|
\lambda&-1\\0&-1&2-\lambda\end{bmatrix} \right) = -\lambda^3 + 6
|
||||||
|
\lambda^2 - 10\lambda + 4$$
|
||||||
|
|
||||||
|
$$CP(A) = 0 \Leftrightarrow \lambda = 2 \lor \lambda = 2 \pm \sqrt{2}$$
|
||||||
|
|
||||||
|
Therefore we have 3 eigenvalues and they are all positive, so A is positive
|
||||||
|
definite and it is clearly symmetric as well.
|
||||||
|
|
||||||
|
Now we switch to the general proof for the monotonicity.
|
||||||
|
|
||||||
To prove that this is true, we first consider a way to express any iterate $x_k$
|
To prove that this is true, we first consider a way to express any iterate $x_k$
|
||||||
in function of the minimizer $x_s$ and of the missing iterations:
|
in function of the minimizer $x_s$ and of the missing iterations:
|
||||||
|
|
||||||
|
@ -148,10 +177,25 @@ monotonically decreases.
|
||||||
|
|
||||||
## Point 1
|
## Point 1
|
||||||
|
|
||||||
### (a) For which kind of minimization problems can the trust region method be
|
### (a) For which kind of minimization problems can the trust region method be used? What are the assumptions on the objective function?
|
||||||
used? What are the assumptions on the objective function?
|
|
||||||
|
|
||||||
**TBD**
|
The trust region method is an algorithm that can be used for unconstrained
|
||||||
|
minimization. The trust region method uses parts of the gradient descent and
|
||||||
|
Newton methods, and thus it accepts basically the same domain of objectives
|
||||||
|
that these two methods accept.
|
||||||
|
|
||||||
|
These constraints namely require the objective function $f(x)$ to be twice
|
||||||
|
differentiable in order to make building a quadratic model around an arbitrary
|
||||||
|
point possible. In addition, our assumptions w.r.t. the scope of this course
|
||||||
|
require that $f(x)$ should be continuous up to the second derivatives.
|
||||||
|
This is needed to allow the Hessian to be symmetric (as by the Schwartz theorem)
|
||||||
|
which is an assumption that simplifies significantly proofs related to the
|
||||||
|
method (like namely Exercise 3 in this assignment).
|
||||||
|
|
||||||
|
Finally, as all the other unconstrained minimization methods we covered in this
|
||||||
|
course, the trust region method is only able to find a local minimizer close to
|
||||||
|
he chosen starting points, and by no means the computed minimizer is guaranteed
|
||||||
|
to be a global minimizer.
|
||||||
|
|
||||||
### (b) Write down the quadratic model around a current iterate xk and explain the meaning of each term.
|
### (b) Write down the quadratic model around a current iterate xk and explain the meaning of each term.
|
||||||
|
|
||||||
|
@ -163,7 +207,7 @@ Here's an explaination of the meaning of each term:
|
||||||
(length);
|
(length);
|
||||||
- $f$ is the energy function value at the current iterate, i.e. $f(x_k)$;
|
- $f$ is the energy function value at the current iterate, i.e. $f(x_k)$;
|
||||||
- $p$ is the trust region step, the solution of $\arg\min_p m(p)$ with $\|p\| <
|
- $p$ is the trust region step, the solution of $\arg\min_p m(p)$ with $\|p\| <
|
||||||
\Delta$ is the optimal step to take;
|
\Delta$, i.e. the optimal step to take;
|
||||||
- $g$ is the gradient at the current iterate $x_k$, i.e. $\nabla f(x_k)$;
|
- $g$ is the gradient at the current iterate $x_k$, i.e. $\nabla f(x_k)$;
|
||||||
- $B$ is the hessian at the current iterate $x_k$, i.e. $\nabla^2 f(x_k)$.
|
- $B$ is the hessian at the current iterate $x_k$, i.e. $\nabla^2 f(x_k)$.
|
||||||
|
|
||||||
|
@ -172,40 +216,54 @@ Here's an explaination of the meaning of each term:
|
||||||
The role of the trust region radius is to put an upper bound on the step length
|
The role of the trust region radius is to put an upper bound on the step length
|
||||||
in order to avoid "overly ambitious" steps, i.e. steps where the the step length
|
in order to avoid "overly ambitious" steps, i.e. steps where the the step length
|
||||||
is considerably long and the quadratic model of the objective is low-quality
|
is considerably long and the quadratic model of the objective is low-quality
|
||||||
(i.e. the quadratic model differs by a predetermined approximation threshold
|
(i.e. the performance measure $\rho_k$ in the TR algorithm indicates significant
|
||||||
from the real objective).
|
energy difference between the true objective and the quadratic model).
|
||||||
|
|
||||||
In layman's terms, the trust region radius makes the method switch more gradient
|
In layman's terms, the trust region radius makes the method switch more gradient
|
||||||
based or more quadratic based steps w.r.t. the confidence in the quadratic
|
based or more quadratic based steps w.r.t. the "confidence" (measured in terms
|
||||||
approximation.
|
of $\rho_k$) in the computed
|
||||||
|
quadratic model.
|
||||||
|
|
||||||
### (d) Explain Cauchy point, sufficient decrease and Dogleg method, and the connection between them.
|
### (d) Explain Cauchy point, sufficient decrease and Dogleg method, and the connection between them.
|
||||||
|
|
||||||
**TBD**
|
The Cauchy point and Dogleg method are algorithms to compute iteration steps
|
||||||
|
that are in the bounds of the trust region. They allow to provide an approximate
|
||||||
|
solution to the minimization of the quadratic model inside the TR radius.
|
||||||
|
|
||||||
**sufficient decrease TBD**
|
The Cauchy point is a method providing sufficient decrease (as per the Wolfe
|
||||||
|
conditions) by essentially performing a gradient descent step with a
|
||||||
|
particularly chosen step size limited by the TR radius. However, since this
|
||||||
|
method basically does not exploit the quadratic component of the objective model
|
||||||
|
(the hessian is only used as a term in the step length calculation), even if it
|
||||||
|
provides sufficient decrease and consequentially convergence it is rarely used
|
||||||
|
as a standalone method to compute iteration steps.
|
||||||
|
|
||||||
The Cauchy point provides sufficient decrease, but makes the trust region method
|
The Cauchy point is therefore often integrated in another method called Dogleg,
|
||||||
essentially like linear method.
|
which uses the former algorithm in conjunction with a purely Newton step to
|
||||||
|
provided a steps obtained by a blend of linear and quadratic information.
|
||||||
|
|
||||||
The dogleg method allows for mixing purely linear iterations and purely quadratic
|
This blend is achieved by choosing the new iterate by searching along a path
|
||||||
ones. The dogleg method picks along its "dog leg shaped" path function made out
|
made out of two segments, namely the gradient descent step with optimal step
|
||||||
of a gradient component and a component directed towards a purely Newton step
|
size and a segment pointing from the last point to the pure netwon step. The
|
||||||
picking the furthest point that is still inside the trust region radius.
|
peculiar angle between this two segments is the reason the method is nicknamed
|
||||||
|
"Dogleg", since the final line resembles a dog's leg.
|
||||||
|
|
||||||
Dogleg uses cauchy point if the trust region does not allow for a proper dogleg
|
In the Dogleg method, the Cauchy point is used in case the trust region is small
|
||||||
step since it is too slow.
|
enough not to allow the "turn" on the second segment towards the Netwon step.
|
||||||
|
Thanks to this property and the use of the performance measure $\rho_k$ to grow
|
||||||
Cauchy provides linear convergence and dogleg superlinear.
|
and shrink the TR radius, the Dogleg method performs well even with inaccurate
|
||||||
|
quadratic models. Therefore, it still satisfies sufficient decrease and the
|
||||||
|
Wolfe conditions while delivering superlinear convergence, compared to the
|
||||||
|
purely linear convergence of Cauchy point steps.
|
||||||
|
|
||||||
### (e) Write down the trust region ratio and explain its meaning.
|
### (e) Write down the trust region ratio and explain its meaning.
|
||||||
|
|
||||||
$$\rho_k = \frac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$$
|
$$\rho_k = \frac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$$
|
||||||
|
|
||||||
The trust region ratio measures the quality of the quadratic model built around
|
The trust region ratio and performance measure $\rho_k$ measures the quality of
|
||||||
|
the quadratic model built around
|
||||||
the current iterate $x_k$, by measuring the ratio between the energy difference
|
the current iterate $x_k$, by measuring the ratio between the energy difference
|
||||||
between the old and the new iterate according to the real energy function and
|
between the old and the new iterate according to the real energy function w.r.t. the quadratic model around $x_k$.
|
||||||
according to the quadratic model around $x_k$.
|
|
||||||
|
|
||||||
The ratio is used to test the adequacy of the current trust region radius. For
|
The ratio is used to test the adequacy of the current trust region radius. For
|
||||||
an inaccurate quadratic model, the predicted energy decrease would be
|
an inaccurate quadratic model, the predicted energy decrease would be
|
||||||
|
@ -220,7 +278,11 @@ since the model quality is good.
|
||||||
|
|
||||||
### (f) Does the energy decrease monotonically when Trust Region method is employed? Justify your answer.
|
### (f) Does the energy decrease monotonically when Trust Region method is employed? Justify your answer.
|
||||||
|
|
||||||
**TBD**
|
In the trust region method the energy of the iterates does not always decrease
|
||||||
|
monotonically. This is due to the fact that the algorithm could actively reject
|
||||||
|
a step if the performance measure factor $\rho_k$ is less that a given constant
|
||||||
|
$\eta$. In this case, the new iterate is equal to the old one, no step is taken
|
||||||
|
and thus the energy norm does not decrease but stays the same.
|
||||||
|
|
||||||
## Point 2
|
## Point 2
|
||||||
|
|
||||||
|
@ -250,7 +312,7 @@ and $\eta \in [0, \frac14)$\;
|
||||||
$x_{k+1} \gets x_k$\;
|
$x_{k+1} \gets x_k$\;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
\caption{Trust region method}
|
\caption{Trust region method}
|
||||||
\end{algorithm}
|
\end{algorithm}
|
||||||
|
|
||||||
The Cauchy point algorithm is the following:
|
The Cauchy point algorithm is the following:
|
||||||
|
@ -266,7 +328,7 @@ Input $B$ (quadratic term), $g$ (linear term), $\Delta_k$\;
|
||||||
|
|
||||||
$p_k \gets -\tau \cdot \frac{\Delta_k}{\|g\|^2 \cdot g}$\;
|
$p_k \gets -\tau \cdot \frac{\Delta_k}{\|g\|^2 \cdot g}$\;
|
||||||
\Return{$p_k$}
|
\Return{$p_k$}
|
||||||
\caption{Cauchy point}
|
\caption{Cauchy point}
|
||||||
\end{algorithm}
|
\end{algorithm}
|
||||||
|
|
||||||
Finally, the Dogleg method algorithm is the following:
|
Finally, the Dogleg method algorithm is the following:
|
||||||
|
@ -380,20 +442,42 @@ Then the norm of the step $\tilde{p}$ clearly increases as $\tau$
|
||||||
increases. For the second criterion, we compute the quadratic model for a
|
increases. For the second criterion, we compute the quadratic model for a
|
||||||
generic $\tau \in [0,1]$:
|
generic $\tau \in [0,1]$:
|
||||||
|
|
||||||
$$m(\tilde{p}(\tau)) = f - \frac{\tau^2 \|g\|^2}{g^TBg} - \frac12 \frac{\tau^2
|
$$m(\tilde{p}(\tau)) = f + g^T * \tilde{p}(\tau) + \frac{1}{2}
|
||||||
\|g\|^2}{(g^TBg)^2} g^TBg = f - \frac12 \frac{\tau^2 \|g\|^2}{g^TBg}$$
|
\tilde{p}(\tau)^T B \tilde{p}(\tau) = f + g^T \tau p^U + \frac12
|
||||||
|
\tau (p^U)^T B \tau p^U $$
|
||||||
|
|
||||||
$g^T B g > 0$ since we assume $B$ is positive definite, therefore the entire
|
We then recall the definition of $p^U$:
|
||||||
term in function
|
|
||||||
of $\tau^2$ is negative and thus the model for an increasing $\tau \in [0,1]$
|
$$ p^U = -\frac{g^Tg}{g^TBg}g: $$
|
||||||
decreases monotonically (to be precise quadratically).
|
|
||||||
|
and plug it into the expression:
|
||||||
|
|
||||||
|
$$ = f + g^T \tau \cdot -
|
||||||
|
\frac{g^Tg}{g^TBg}g + \frac{1}{2} \tau \cdot -\left(\frac{g^Tg}{g^TBg}g\right)^T
|
||||||
|
\cdot B \tau \cdot - \frac{g^Tg}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g
|
||||||
|
\|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \left(\frac{\|g\|^2}{g^T B g}\right) g^T \cdot B
|
||||||
|
\frac{g^T g}{g^TBg}g $$$$ = f - \tau \cdot \frac{\| g \|^4}{g^TBg} +
|
||||||
|
\frac{1}{2} \tau^2 \cdot \frac{\| g \|^4}{(g^TBg)^2} \cdot g^TBg $$$$ = f -
|
||||||
|
\tau \cdot \frac{\| g \|^4}{g^TBg} + \frac{1}{2} \tau^2 \cdot \frac{\| g
|
||||||
|
\|^4}{g^TBg} $$$$ = f + \left(\frac{1}{2} \tau^2 - \tau\right) \cdot \frac{\| g
|
||||||
|
\|^4}{g^TBg}$$$$= f + \left(\frac12\tau^2 - \tau\right) z \; \text{where } z =
|
||||||
|
\frac{\| g \|^4}{g^TBg}$$
|
||||||
|
|
||||||
|
We then compute the derivative of the model:
|
||||||
|
|
||||||
|
$$\frac{dm(\tilde{p}(\tau))}{d\tau} = z\tau - z = z(\tau - 1)$$
|
||||||
|
|
||||||
|
We know $z$ is positive since $g^T B g > 0$, since we assume $B$ is positive
|
||||||
|
definite. Then, since $\tau \in [0,1]$, the derivative is always $\leq 0$ and
|
||||||
|
therefore we have proven that the quadratic model of the Dogleg step decreases
|
||||||
|
as $\tau$ increases.
|
||||||
|
|
||||||
Now we show that the two claims on gradients hold also for $\tau \in [1,2]$. We
|
Now we show that the two claims on gradients hold also for $\tau \in [1,2]$. We
|
||||||
define a function $h(\alpha)$ (where $\alpha = \tau - 1$) with same gradient
|
define a function $h(\alpha)$ (where $\alpha = \tau - 1$) with same gradient
|
||||||
"sign"
|
"sign"
|
||||||
as $\|\tilde{p}(\tau)\|$ and we show that this function increases:
|
as $\|\tilde{p}(\tau)\|$ and we show that this function increases:
|
||||||
|
|
||||||
$$h(\alpha) = \frac12 \|\tilde{p}(1 - \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B -
|
$$h(\alpha) = \frac12 \|\tilde{p}(1 + \alpha)\|^2 = \frac12 \|p^U + \alpha(p^B -
|
||||||
p^U)\|^2 = \frac12 \|p^U\|^2 + \frac12 \alpha^2 \|p^B - p^U\|^2 + \alpha (p^U)^T
|
p^U)\|^2 = \frac12 \|p^U\|^2 + \frac12 \alpha^2 \|p^B - p^U\|^2 + \alpha (p^U)^T
|
||||||
(p^B - p^U)$$
|
(p^B - p^U)$$
|
||||||
|
|
||||||
|
@ -401,9 +485,10 @@ We now take the derivative of $h(\alpha)$ and we show it is always positive,
|
||||||
i.e. that $h(\alpha)$ has always positive gradient and thus that it is
|
i.e. that $h(\alpha)$ has always positive gradient and thus that it is
|
||||||
increasing w.r.t. $\alpha$:
|
increasing w.r.t. $\alpha$:
|
||||||
|
|
||||||
$$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B - p^U)
|
$$h'(\alpha) = \alpha \|p^B - p^U\|^2 + (p^U)^T (p^B - p^U) \geq (p^U)^T (p^B -
|
||||||
= \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$= \|g\|^2
|
p^U) = \frac{g^Tg}{g^TBg}g^T\left(- \frac{g^Tg}{g^TBg}g + B^{-1}g\right) =$$$$=
|
||||||
\frac{g^TB^{-1}g}{g^TBg}\left(1 - \frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$
|
\|g\|^2 \frac{g^TB^{-1}g}{g^TBg}\left(1 -
|
||||||
|
\frac{\|g\|^2}{(g^TBg)(g^TB^{-1}g)}\right) $$
|
||||||
|
|
||||||
Since we know $B$ is symmetric and positive definite, then $B^{-1}$ is as well.
|
Since we know $B$ is symmetric and positive definite, then $B^{-1}$ is as well.
|
||||||
Therefore, we know that the term outside of the parenthesis is always positive
|
Therefore, we know that the term outside of the parenthesis is always positive
|
||||||
|
@ -420,8 +505,9 @@ by proving all properties of such space:
|
||||||
|
|
||||||
- **Linearity w.r.t. the first argument:**
|
- **Linearity w.r.t. the first argument:**
|
||||||
|
|
||||||
${\langle x, y \rangle}_B + {\langle z,
|
$\alpha {\langle x, y \rangle}_B + \beta {\langle z,
|
||||||
y \rangle}_B = x^TBy + z^TBy = (x + z)^TBy = {\langle (x + z), y \rangle}_B$;
|
y \rangle}_B = \alpha \cdot x^TBy + \beta \cdot z^TBy = (\alpha x + \beta z)^TBy
|
||||||
|
= {\langle (\alpha x + \beta z), y \rangle}_B$;
|
||||||
|
|
||||||
- **Symmetry:**
|
- **Symmetry:**
|
||||||
|
|
||||||
|
@ -430,11 +516,12 @@ by proving all properties of such space:
|
||||||
|
|
||||||
- **Positive definiteness:**
|
- **Positive definiteness:**
|
||||||
|
|
||||||
${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite.
|
${\langle x, x \rangle_B} = x^T B x > 0$ is true since B is positive definite
|
||||||
|
for all $x \neq 0$.
|
||||||
|
|
||||||
Since ${\langle x, y \rangle}_B$ is indeed a linear product space, then:
|
Since ${\langle x, y \rangle}_B$ is indeed a linear product space, then:
|
||||||
|
|
||||||
$${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B, {\langle B^{-1}
|
$${\langle g, B^{-1} g \rangle}_B \leq {\langle g, g \rangle}_B {\langle B^{-1}
|
||||||
g, B^{-1} g \rangle}_B$$
|
g, B^{-1} g \rangle}_B$$
|
||||||
|
|
||||||
holds according to the Cauchy-Schwartz inequality. Now, if we expand each inner
|
holds according to the Cauchy-Schwartz inequality. Now, if we expand each inner
|
||||||
|
@ -455,7 +542,7 @@ plugging the Dogleg step in the quadratic model:
|
||||||
|
|
||||||
$$\hat{h}(\alpha) = m(\tilde{p}(1+\alpha)) = f + g^T (p^U + \alpha (p^B - p^U)) +
|
$$\hat{h}(\alpha) = m(\tilde{p}(1+\alpha)) = f + g^T (p^U + \alpha (p^B - p^U)) +
|
||||||
\frac12 (p^U + \alpha (p^B - p^U))^T B (p^U + \alpha (p^B - p^U)) = $$$$ =
|
\frac12 (p^U + \alpha (p^B - p^U))^T B (p^U + \alpha (p^B - p^U)) = $$$$ =
|
||||||
f + g^T p^U + \alpha g^T (p^B - p^U) + (p^U)^T B p^U + \frac12 \alpha (p^U)^T B
|
f + g^T p^U + \alpha g^T (p^B - p^U) + \frac12 (p^U)^T B p^U + \frac12 \alpha (p^U)^T B
|
||||||
(p^B - p^U) + \frac12 \alpha (p^B - p^U)^T B p^U + \frac12 \alpha^2
|
(p^B - p^U) + \frac12 \alpha (p^B - p^U)^T B p^U + \frac12 \alpha^2
|
||||||
(p^B - p^U)^T B (p^B - p^U)$$
|
(p^B - p^U)^T B (p^B - p^U)$$
|
||||||
|
|
||||||
|
|
Binary file not shown.
Reference in a new issue