midterm: done 1

This commit is contained in:
Claudio Maggioni (maggicl) 2021-05-03 11:18:45 +02:00
parent 0220738cc6
commit 87746f945f
7 changed files with 137 additions and 182 deletions

View file

@ -48,6 +48,13 @@ activated uses a fixed $\alpha=1$ despite the indications on the assignment
sheet. This was done in order to comply with the forum post on iCorsi found sheet. This was done in order to comply with the forum post on iCorsi found
here: \url{https://www.icorsi.ch/mod/forum/discuss.php?d=81144}. here: \url{https://www.icorsi.ch/mod/forum/discuss.php?d=81144}.
Here is a plot of the Rosenbrock function in 3d, with our starting point in red ($(0,0)$),
and the true minimizer in black ($(1,1)$):
\begin{center}
\resizebox{0.6\textwidth}{0.6\textwidth}{\includegraphics{rosenb.jpg}}
\end{center}
\subsection{Exercise 1.2} \subsection{Exercise 1.2}
Please consult the MATLAB implementation in the file \texttt{main.m} in section Please consult the MATLAB implementation in the file \texttt{main.m} in section
@ -131,6 +138,12 @@ zig-zagging pattern with iterates in the vicinity of the Netwon + backtracking
iterates. Finally, GD without backtracking quickly degenerates as it can be iterates. Finally, GD without backtracking quickly degenerates as it can be
seen by the enlarged plot. seen by the enlarged plot.
From the initial plot in the Rosenbrock's function, we can see why algorithms using
backtracking follow roughly the $(0,0) - (1,1)$ diagonal: since this region is effectively a
``valley'' in the 3D energy landscape, due to the nature of the backtracking methods and their
strict adherence of the Wolfe conditions these methods avoid wild ``climbs'' (unlike the Netwon method without backtracking)
and instead finding iterates with either sufficient decrease or a not to high increase.
When looking at gradient norms and objective function When looking at gradient norms and objective function
values (figure \ref{fig:gsppn}) over time, the degeneration of GD without values (figure \ref{fig:gsppn}) over time, the degeneration of GD without
backtracking and the inefficiency of GD with backtracking can clearly be seen. backtracking and the inefficiency of GD with backtracking can clearly be seen.
@ -141,6 +154,16 @@ at gradient $\nabla f(x_1) \approx 450$ and objective value $f(x_1) \approx
100$, but quickly has both values decrease to 0 for its second iteration 100$, but quickly has both values decrease to 0 for its second iteration
achieving convergence. achieving convergence.
What has been observed in this assignment matches with the theory behind the methods.
Since the Newton method has quadratic convergence and uses quadratic information (i.e.
using the hessian in the step direction calculation) the number of iterations required
to find the minimizer when already close to it (as we are with $x_0 = [0;0]$) is significantly
less than the ones required for linear methods, like gradient descent. However, it must be
said that for an objective with an high number of dimensions a single iteration of a quadratic
method is significantly more costly than a single iteration of a linear method due to the
quadratically growing number of cells in the hessian matrix, which makes it harder and harder
to compute as the number of dimensions increase.
\section{Exercise 2} \section{Exercise 2}
\subsection{Exercise 2.1} \subsection{Exercise 2.1}
@ -212,4 +235,10 @@ the opposite direction of the minimizer. This behaviour can also be observed in
the plots in figure \ref{fig:4}, where several bumps and spikes are present in the plots in figure \ref{fig:4}, where several bumps and spikes are present in
the gradient norm plot and small plateaus can be found in the objective the gradient norm plot and small plateaus can be found in the objective
function value plot. function value plot.
Comparing these results with the theory behind BFGS, we can say the results that have been
obtained fall within what we expect from the theory. Since BFGS is a superlinear but not quadratic
method of convergence, its ``speed'' in terms of number of iterations falls within linear methods (like GD) and
quadratic methods (like Newton).
\end{document} \end{document}

View file

@ -1,176 +0,0 @@
%% Homework 3 - Optimization Methods
% Author: Claudio Maggioni
%
% Note: exercises are not in the right order due to matlab constraints of
% functions inside of scripts.
clc
clear
close all
% Set to non-zero to generate LaTeX for graphs
enable_m2tikz = 0;
if enable_m2tikz
addpath /home/claudio/git/matlab2tikz/src
else
matlab2tikz = @(a,b,c) 0;
end
syms x y;
f = (1 - x)^2 + 100 * (y - x^2)^2;
global fl
fl = matlabFunction(f);
%% 1.3 - Newton and GD solutions and energy plot
[x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true);
plot(xs1(1, :), xs1(2, :), 'Marker', '.');
fprintf("Newton backtracking: it=%d\n", size(xs1, 2)-1);
xlim([-0.01 1.01])
ylim([-0.01 1.01])
hold on;
[x2, xs2, gs2] = Newton(f, [0;0], 50000, 1e-6, false);
plot(xs2(1, :), xs2(2, :), 'Marker', '.');
fprintf("Newton: it=%d\n", size(xs2, 2)-1);
[x3, xs3, gs3] = GD(f, [0;0], 50000, 1e-6, true);
plot(xs3(1, :), xs3(2, :), 'Marker', '.');
fprintf("GD backtracking: it=%d\n", size(xs3, 2)-1);
[x4, xs4, gs4] = GD(f, [0;0], 50000, 1e-6, false);
plot(xs4(1, :), xs4(2, :), 'Marker', '.');
fprintf("GD: it=%d\n", size(xs4, 2)-1);
hold off;
legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape");
matlab2tikz('showInfo', false, './ex1-3.tex');
figure;
plot(xs4(1, :), xs4(2, :), 'Marker', '.');
legend('GD (alpha=1)')
sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape");
matlab2tikz('showInfo', false, './ex1-3-large.tex');
%% 2.3 - BGFS solution and energy plot
figure;
[x5, xs5, gs5] = BGFS(f, [0;0], eye(2), 50000, 1e-6);
xlim([-0.01 1.01])
ylim([-0.01 1.01])
plot(xs5(1, :), xs5(2, :), 'Marker', '.');
fprintf("BGFS backtracking: it=%d\n", size(xs5, 2)-1);
sgtitle("Iterations of BGFS method over 2D energy landscape");
matlab2tikz('showInfo', false, './ex2-3.tex');
%% 1.4 - Newton and GD gradient norm log
figure;
semilogy(0:size(xs1, 2)-1, gs1, 'Marker', '.');
ylim([5e-10, 1e12]);
xlim([-1, 30]);
hold on
semilogy(0:size(xs2, 2)-1, gs2, 'Marker', '.');
semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.');
semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.');
hold off
legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-grad.tex');
figure;
plot(0:size(xs1, 2)-1, gs1, 'Marker', '.');
ylim([5e-10, 25]);
xlim([-1, 30]);
hold on
plot(0:size(xs2, 2)-1, gs2, 'Marker', '.');
plot(0:size(xs3, 2)-1, gs3, 'Marker', '.');
plot(0:size(xs4, 2)-1, gs4, 'Marker', '.');
hold off
legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-grad.tex');
figure;
semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.');
ylim([1e-7, 1e10]);
hold on
semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.');
hold off
legend('GD + backtracking', 'GD (alpha=1)')
sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-grad-large.tex');
figure;
ys1 = funvalues(xs1);
ys2 = funvalues(xs2);
ys3 = funvalues(xs3);
ys4 = funvalues(xs4);
ys5 = funvalues(xs5);
semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.');
ylim([5e-19, 1e12]);
xlim([-1, 30]);
hold on
semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.');
semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
hold off
legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-ys.tex');
ys1 = funvalues(xs1);
ys2 = funvalues(xs2);
ys3 = funvalues(xs3);
ys4 = funvalues(xs4);
ys5 = funvalues(xs5);
semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.');
ylim([5e-19, 20]);
xlim([-1, 30]);
hold on
semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.');
semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
hold off
legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-ys.tex');
figure;
semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
legend('GD + backtracking', 'GD (alpha=1)')
sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
matlab2tikz('showInfo', false, './ex1-4-ys-large.tex');
%% 2.4 - BGFS gradient norms plot
figure;
semilogy(0:size(xs5, 2)-1, gs5, 'Marker', '.');
sgtitle("Gradient norm w.r.t. iteration number for BGFS method");
matlab2tikz('showInfo', false, './ex2-4-grad.tex');
%% 2.4 - BGFS objective values plot
figure;
semilogy(0:size(xs5, 2)-1, ys5, 'Marker', '.');
sgtitle("Objective function value w.r.t. iteration number for BGFS methods");
matlab2tikz('showInfo', false, './ex2-4-ys.tex');
function ys = funvalues(xs)
ys = zeros(1, size(xs, 2));
global fl
for i = 1:size(xs, 2)
ys(i) = fl(xs(1,i), xs(2,i));
end
end

View file

@ -6,7 +6,6 @@
clc clc
clear clear
close all
% Set to non-zero to generate LaTeX for graphs % Set to non-zero to generate LaTeX for graphs
enable_m2tikz = 0; enable_m2tikz = 0;
@ -16,11 +15,31 @@ else
matlab2tikz = @(a,b,c) 0; matlab2tikz = @(a,b,c) 0;
end end
%% 1.1 - Rosenbrock function definition and surf plot
close all
syms x y; syms x y;
f = (1 - x)^2 + 100 * (y - x^2)^2; f = (1 - x)^2 + 100 * (y - x^2)^2;
global fl global fl
fl = matlabFunction(f); fl = matlabFunction(f);
xs = -0.2:0.01:1.2;
Z = zeros(size(xs,2), size(xs,2));
for x = 1:size(xs, 2)
for y = 1:size(xs, 2)
Z(x,y) = fl(xs(x), xs(y));
end
end
surf(xs, xs, Z, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
hold on
plot3([0,1],[0,1],[fl(0,0),fl(1,1)],'-r.');
plot3([1],[1],[fl(1,1)],'-k.');
hold off
figure;
%% 1.2 - Minimizing the Rosenbrock function %% 1.2 - Minimizing the Rosenbrock function
[x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true); [x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true);

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View file

@ -18,12 +18,12 @@ header-includes:
### Question (a) ### Question (a)
As already covered in the course, the gradient of a standard quadratic form at a As already covered in the course, the gradient of a standard quadratic form at a
point $ x_0$ is equal to: point $x_0$ is equal to:
$$ \nabla f(x_0) = A x_0 - b $$ $$ \nabla f(x_0) = A x_0 - b $$
Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b = 0$ Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b
(according to the first necessary condition for a minimizer), we obtain: = 0$ (according to the first necessary condition for a minimizer), we obtain:
$$ \nabla f(x_0) = A (x_m + v) - b = A x_m + A v - b = b + \lambda v - b = $$ \nabla f(x_0) = A (x_m + v) - b = A x_m + A v - b = b + \lambda v - b =
\lambda v $$ \lambda v $$
@ -52,7 +52,90 @@ $\alpha \neq \frac{1}{\lambda}$.
Therefore, since $x_1 = x_m$, we have proven SD Therefore, since $x_1 = x_m$, we have proven SD
converges to the minimizer in one iteration. converges to the minimizer in one iteration.
### Point 2 ## Point 2
The right answer is choice (a), since the energy norm of the error indeed always The right answer is choice (a), since the energy norm of the error indeed always
decreases monotonically. decreases monotonically.
To prove that this is true, we first consider a way to express any iterate $x_k$
in function of the minimizer $x_s$ and of the missing iterations:
$$x_k = x_s + \sum_{i=k}^{N} \alpha_i A^i p_0$$
This formula makes use of the fact that step directions in CG are all
A-orthogonal with each other, so the k-th search direction $p_k$ is equal to
$A^k p_0$, where $p_0 = -r_0$ and $r_0$ is the first residual.
Given that definition of iterates, we're able to express the error after
iteration $k$ $e_k$ in a similar fashion:
$$e_k = x_k - x_s = \sum_{i=k}^{N} \alpha_i A^i p_0$$
We then recall the definition of energy norm $\|e_k\|_A$:
$$\|e_k\|_A = \sqrt{\langle Ae_k, e_k \rangle}$$
We then want to show that $\|e_k\|_A = \|x_k - x_s\|_A > \|e_{k+1}\|_A$, which
in turn is equivalent to claim that:
$$\langle Ae_k, e_k \rangle > \langle Ae_{k+1}, e_{k+1} \rangle$$
Knowing that the dot product is linear w.r.t. either of its arguments, we pull
out the sum term related to the k-th step (i.e. the first term in the sum that
makes up $e_k$) from both sides of $\langle Ae_k, e_k \rangle$,
obtaining the following:
$$\langle Ae_{k+1}, e_{k+1} \rangle + \langle \alpha_k A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > \langle Ae_{k+1}, e_{k+1}
\rangle$$
which in turn is equivalent to claim that:
$$\langle \alpha_k A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > 0$$
From this expression we can collect term $\alpha_k$ thanks to linearity of the
dot-product:
$$\alpha_k (\langle A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1}, A^k p_0 \rangle) > 0$$
and we can further "ignore" the $\alpha_k$ term since we know that all
$\alpha_i$s are positive by definition:
$$\langle A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1}, A^k p_0 \rangle > 0$$
Then, we convert the dot-products in their equivalent vector to vector product
form, and we plug in the definitions of $e_k$ and $e_{k+1}$:
$$p_0^T (A^{k+1})^T (\sum_{i=k}^{N} \alpha_i A^i p_0) +
p_0^T (A^{k})^T (\sum_{i=k+1}^{N} \alpha_i A^i p_0) > 0$$
We then pull out the sum to cover all terms thanks to associativity of vector
products:
$$\sum_{i=k}^N (p_0^T (A^{k+1})^T A^i p_0) \alpha_i+ \sum_{i=k+1}^N
(p_0^T (A^{k})^T A^i p_0) \alpha_i > 0$$
We then, as before, can "ignore" all $\alpha_i$ terms since we know by
definition that
they are all strictly positive. We then recalled that we assumed that A is
symmetric, so $A^T = A$. In the end we have to show that these two
inequalities are true:
$$p_0^T A^{k+1+i} p_0 > 0 \; \forall i \in [k,N]$$
$$p_0^T A^{k+i} p_0 > 0 \; \forall i \in [k+1,N]$$
To show these inequalities are indeed true, we recall that A is symmetric and
positive definite. We then consider that if a matrix A is SPD, then $A^i$ for
any positive $i$ is also SPD[^1]. Therefore, both inequalities are trivially
true due to the definition of positive definite matrices.
[^1]: source: [Wikipedia - Definite Matrix $\to$ Properties $\to$
Multiplication](
https://en.wikipedia.org/wiki/Definite_matrix#Multiplication)
Thanks to this we have indeed proven that the delta $\|e_k\|_A - \|e_{k+1}\|_A$
is indeed positive and thus as $i$ increases the energy norm of the error
monotonically decreases.