midterm: done 1

2021-05-03 11:18:45 +02:00 · 2021-05-03 11:18:45 +02:00 · 87746f945f
commit 87746f945f
parent 0220738cc6
7 changed files with 137 additions and 182 deletions
--- a/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf
+++ b/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf
--- a/Claudio_Maggioni_3/Claudio_Maggioni_3.tex
+++ b/Claudio_Maggioni_3/Claudio_Maggioni_3.tex
@ -48,6 +48,13 @@ activated uses a fixed $\alpha=1$ despite the indications on the assignment
 sheet. This was done in order to comply with the forum post on iCorsi found
 here: \url{https://www.icorsi.ch/mod/forum/discuss.php?d=81144}.

+Here is a plot of the Rosenbrock function in 3d, with our starting point in red ($(0,0)$),
+and the true minimizer in black ($(1,1)$):
+
+\begin{center}
+	\resizebox{0.6\textwidth}{0.6\textwidth}{\includegraphics{rosenb.jpg}}
+\end{center}
+
 \subsection{Exercise 1.2}

 Please consult the MATLAB implementation in the file \texttt{main.m} in section
@ -131,6 +138,12 @@ zig-zagging pattern with iterates in the vicinity of the Netwon + backtracking
 iterates. Finally, GD without backtracking quickly degenerates as it can be
 seen by the enlarged plot.

+From the initial plot in the Rosenbrock's function, we can see why algorithms using
+backtracking follow roughly the $(0,0) - (1,1)$ diagonal: since this region is effectively a
+``valley'' in the 3D energy landscape, due to the nature of the backtracking methods and their
+strict adherence of the Wolfe conditions these methods avoid wild ``climbs'' (unlike the Netwon method without backtracking)
+and instead finding iterates with either sufficient decrease or a not to high increase.
+
 When looking at gradient norms and objective function
 values (figure \ref{fig:gsppn}) over time, the degeneration of GD without
 backtracking and the inefficiency of GD with backtracking can clearly be seen.
@ -141,6 +154,16 @@ at gradient $\nabla f(x_1) \approx 450$ and objective value $f(x_1) \approx
 100$, but quickly has both values decrease to 0 for its second iteration
 achieving convergence.

+What has been observed in this assignment matches with the theory behind the methods.
+Since the Newton method has quadratic convergence and uses quadratic information (i.e.
+using the hessian in the step direction calculation) the number of iterations required
+to find the minimizer when already close to it (as we are with $x_0 = [0;0]$) is significantly
+less than the ones required for linear methods, like gradient descent. However, it must be
+said that for an objective with an high number of dimensions a single iteration of a quadratic
+method is significantly more costly than a single iteration of a linear method due to the
+quadratically growing number of cells in the hessian matrix, which makes it harder and harder
+to compute as the number of dimensions increase.
+
 \section{Exercise 2}

 \subsection{Exercise 2.1}
@ -212,4 +235,10 @@ the opposite direction of the minimizer. This behaviour can also be observed in
 the plots in figure \ref{fig:4}, where several bumps and spikes are present in
 the gradient norm plot and small plateaus can be found in the objective
 function value plot.
+
+Comparing these results with the theory behind BFGS, we can say the results that have been
+obtained fall within what we expect from the theory. Since BFGS is a superlinear but not quadratic
+method of convergence, its ``speed'' in terms of number of iterations falls within linear methods (like GD) and
+quadratic methods (like Newton).
+
 \end{document}
--- a/Claudio_Maggioni_3/main.asv
+++ b/Claudio_Maggioni_3/main.asv
@ -1,176 +0,0 @@
-%% Homework 3 - Optimization Methods
-% Author: Claudio Maggioni
-%
-% Note: exercises are not in the right order due to matlab constraints of
-% functions inside of scripts.
-
-clc
-clear
-close all
-
-% Set to non-zero to generate LaTeX for graphs 
-enable_m2tikz = 0;
-if enable_m2tikz
-    addpath /home/claudio/git/matlab2tikz/src
-else
-    matlab2tikz = @(a,b,c) 0;
-end
-    
-syms x y;
-f = (1 - x)^2 + 100 * (y - x^2)^2;
-global fl
-fl = matlabFunction(f);
-
-%% 1.3 - Newton and GD solutions and energy plot 
-
-[x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true);
-plot(xs1(1, :), xs1(2, :), 'Marker', '.');
-fprintf("Newton backtracking: it=%d\n", size(xs1, 2)-1);
-xlim([-0.01 1.01]) 
-ylim([-0.01 1.01])
-hold on;
-[x2, xs2, gs2] = Newton(f, [0;0], 50000, 1e-6, false);
-plot(xs2(1, :), xs2(2, :), 'Marker', '.');
-fprintf("Newton: it=%d\n", size(xs2, 2)-1);
-
-[x3, xs3, gs3] = GD(f, [0;0], 50000, 1e-6, true);
-plot(xs3(1, :), xs3(2, :), 'Marker', '.');
-fprintf("GD backtracking: it=%d\n", size(xs3, 2)-1);
-
-[x4, xs4, gs4] = GD(f, [0;0], 50000, 1e-6, false);
-plot(xs4(1, :), xs4(2, :), 'Marker', '.');
-fprintf("GD: it=%d\n", size(xs4, 2)-1);
-hold off;
-legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
-sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape");
-matlab2tikz('showInfo', false, './ex1-3.tex');
-
-figure;
-plot(xs4(1, :), xs4(2, :), 'Marker', '.');
-legend('GD (alpha=1)')
-sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape");
-matlab2tikz('showInfo', false, './ex1-3-large.tex');
-
-%% 2.3 - BGFS solution and energy plot 
-
-figure;
-[x5, xs5, gs5] = BGFS(f, [0;0], eye(2), 50000, 1e-6);
-xlim([-0.01 1.01]) 
-ylim([-0.01 1.01])
-plot(xs5(1, :), xs5(2, :), 'Marker', '.');
-fprintf("BGFS backtracking: it=%d\n", size(xs5, 2)-1);
-
-sgtitle("Iterations of BGFS method over 2D energy landscape");
-matlab2tikz('showInfo', false, './ex2-3.tex');
-
-%% 1.4 - Newton and GD gradient norm log
-
-figure;
-semilogy(0:size(xs1, 2)-1, gs1, 'Marker', '.');
-ylim([5e-10, 1e12]);
-xlim([-1, 30]);
-hold on
-semilogy(0:size(xs2, 2)-1, gs2, 'Marker', '.');
-semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.');
-semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.');
-hold off
-
-legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
-sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-grad.tex');
-
-
-figure;
-plot(0:size(xs1, 2)-1, gs1, 'Marker', '.');
-ylim([5e-10, 25]);
-xlim([-1, 30]);
-hold on
-plot(0:size(xs2, 2)-1, gs2, 'Marker', '.');
-plot(0:size(xs3, 2)-1, gs3, 'Marker', '.');
-plot(0:size(xs4, 2)-1, gs4, 'Marker', '.');
-hold off
-
-legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
-sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-grad.tex');
-
-figure;
-semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.');
-ylim([1e-7, 1e10]);
-hold on
-semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.');
-hold off
-
-legend('GD + backtracking', 'GD (alpha=1)')
-sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-grad-large.tex');
-
-figure;
-ys1 = funvalues(xs1);
-ys2 = funvalues(xs2);
-ys3 = funvalues(xs3);
-ys4 = funvalues(xs4);
-ys5 = funvalues(xs5);
-semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.');
-ylim([5e-19, 1e12]);
-xlim([-1, 30]);
-hold on
-semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.');
-semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
-semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
-hold off
-
-legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
-sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-ys.tex');
-
-ys1 = funvalues(xs1);
-ys2 = funvalues(xs2);
-ys3 = funvalues(xs3);
-ys4 = funvalues(xs4);
-ys5 = funvalues(xs5);
-semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.');
-ylim([5e-19, 20]);
-xlim([-1, 30]);
-hold on
-semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.');
-semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
-semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
-hold off
-
-legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)')
-sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-ys.tex');
-
-figure;
-
-semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.');
-semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.');
-
-legend('GD + backtracking', 'GD (alpha=1)')
-sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods");
-matlab2tikz('showInfo', false, './ex1-4-ys-large.tex');
-
-%% 2.4 - BGFS gradient norms plot
-
-figure;
-semilogy(0:size(xs5, 2)-1, gs5, 'Marker', '.');
-
-sgtitle("Gradient norm w.r.t. iteration number for BGFS method");
-matlab2tikz('showInfo', false, './ex2-4-grad.tex');
-
-%% 2.4 - BGFS objective values plot
-
-figure;
-semilogy(0:size(xs5, 2)-1, ys5, 'Marker', '.');
-
-sgtitle("Objective function value w.r.t. iteration number for BGFS methods");
-matlab2tikz('showInfo', false, './ex2-4-ys.tex');
-
-function ys = funvalues(xs)
-    ys = zeros(1, size(xs, 2));
-    global fl
-    for i = 1:size(xs, 2)
-        ys(i) = fl(xs(1,i), xs(2,i));
-    end
-end
--- a/Claudio_Maggioni_3/main.m
+++ b/Claudio_Maggioni_3/main.m
@ -6,7 +6,6 @@

 clc
 clear
-close all

 % Set to non-zero to generate LaTeX for graphs 
 enable_m2tikz = 0;
@ -16,11 +15,31 @@ else
    matlab2tikz = @(a,b,c) 0;
 end

+%% 1.1 - Rosenbrock function definition and surf plot
+close all
 syms x y;
 f = (1 - x)^2 + 100 * (y - x^2)^2;
 global fl
 fl = matlabFunction(f);

+xs = -0.2:0.01:1.2;
+Z = zeros(size(xs,2), size(xs,2));
+
+for x = 1:size(xs, 2)
+    for y = 1:size(xs, 2)
+        Z(x,y) = fl(xs(x), xs(y));
+    end
+end
+
+surf(xs, xs, Z, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
+hold on
+plot3([0,1],[0,1],[fl(0,0),fl(1,1)],'-r.');
+plot3([1],[1],[fl(1,1)],'-k.');
+hold off
+
+figure;
+
+
 %% 1.2 - Minimizing the Rosenbrock function

 [x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true);
--- a/Claudio_Maggioni_3/rosenb.jpg
+++ b/Claudio_Maggioni_3/rosenb.jpg
--- a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md
+++ b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md
@ -18,12 +18,12 @@ header-includes:
 ### Question (a)

 As already covered in the course, the gradient of a standard quadratic form at a
-point $ x_0$ is equal to:
+point $x_0$ is equal to:

 $$ \nabla f(x_0) = A x_0 - b $$

-Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b = 0$
-(according to the first necessary condition for a minimizer), we obtain:
+Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b
+= 0$ (according to the first necessary condition for a minimizer), we obtain:

 $$ \nabla f(x_0) = A (x_m + v) - b = A x_m + A v - b = b + \lambda v - b =
 \lambda v $$
@ -52,7 +52,90 @@ $\alpha \neq \frac{1}{\lambda}$.
 Therefore, since $x_1 = x_m$, we have proven SD
 converges to the minimizer in one iteration.

-### Point 2
+## Point 2

 The right answer is choice (a), since the energy norm of the error indeed always
 decreases monotonically.
+
+To prove that this is true, we first consider a way to express any iterate $x_k$
+in function of the minimizer $x_s$ and of the missing iterations:
+
+$$x_k = x_s + \sum_{i=k}^{N} \alpha_i A^i p_0$$
+
+This formula makes use of the fact that step directions in CG are all
+A-orthogonal with each other, so the k-th search direction $p_k$ is equal to
+$A^k p_0$, where $p_0 = -r_0$ and $r_0$ is the first residual.
+
+Given that definition of iterates, we're able to express the error after
+iteration $k$ $e_k$ in a similar fashion:
+
+$$e_k = x_k - x_s = \sum_{i=k}^{N} \alpha_i A^i p_0$$
+
+We then recall the definition of energy norm $\|e_k\|_A$:
+
+$$\|e_k\|_A = \sqrt{\langle Ae_k, e_k \rangle}$$
+
+We then want to show that $\|e_k\|_A = \|x_k - x_s\|_A > \|e_{k+1}\|_A$, which
+in turn is equivalent to claim that:
+
+$$\langle Ae_k, e_k \rangle > \langle Ae_{k+1}, e_{k+1} \rangle$$
+
+Knowing that the dot product is linear w.r.t. either of its arguments, we pull
+out the sum term related to the k-th step (i.e. the first term in the sum that
+makes up $e_k$) from both sides of $\langle Ae_k, e_k \rangle$,
+obtaining the following:
+
+$$\langle Ae_{k+1}, e_{k+1} \rangle + \langle \alpha_k A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > \langle Ae_{k+1}, e_{k+1}
+  \rangle$$
+
+which in turn is equivalent to claim that:
+
+$$\langle \alpha_k A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > 0$$
+
+From this expression we can collect term $\alpha_k$ thanks to linearity of the
+dot-product:
+
+$$\alpha_k (\langle A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1}, A^k p_0 \rangle) > 0$$
+
+and we can further "ignore" the $\alpha_k$ term since we know that all
+$\alpha_i$s are positive by definition:
+
+$$\langle A^{k+1} p_0, e_k \rangle
+ \langle Ae_{k+1}, A^k p_0 \rangle > 0$$
+
+Then, we convert the dot-products in their equivalent vector to vector product
+form, and we plug in the definitions of $e_k$ and $e_{k+1}$:
+
+$$p_0^T (A^{k+1})^T (\sum_{i=k}^{N} \alpha_i A^i p_0) +
+p_0^T (A^{k})^T (\sum_{i=k+1}^{N} \alpha_i A^i p_0) > 0$$
+
+We then pull out the sum to cover all terms thanks to associativity of vector
+products:
+
+$$\sum_{i=k}^N (p_0^T (A^{k+1})^T A^i p_0) \alpha_i+ \sum_{i=k+1}^N
+(p_0^T (A^{k})^T A^i p_0) \alpha_i > 0$$
+
+We then, as before, can "ignore" all $\alpha_i$ terms since we know by
+definition that
+they are all strictly positive. We then recalled that we assumed that A is
+symmetric, so $A^T = A$. In the end we have to show that these two
+inequalities are true:
+
+$$p_0^T A^{k+1+i} p_0 > 0 \; \forall i \in [k,N]$$
+$$p_0^T A^{k+i} p_0 > 0 \; \forall i \in [k+1,N]$$
+
+To show these inequalities are indeed true, we recall that A is symmetric and
+positive definite. We then consider that if a matrix A is SPD, then $A^i$ for
+any positive $i$ is also SPD[^1]. Therefore, both inequalities are trivially
+true due to the definition of positive definite matrices.
+
+[^1]: source: [Wikipedia - Definite Matrix $\to$ Properties $\to$
+  Multiplication](
+https://en.wikipedia.org/wiki/Definite_matrix#Multiplication)
+
+Thanks to this we have indeed proven that the delta $\|e_k\|_A - \|e_{k+1}\|_A$
+is indeed positive and thus as $i$ increases the energy norm of the error
+monotonically decreases.
--- a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf
+++ b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf