diff --git a/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf b/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf index b5c8385..b3e92fa 100644 Binary files a/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf and b/Claudio_Maggioni_3/Claudio_Maggioni_3.pdf differ diff --git a/Claudio_Maggioni_3/Claudio_Maggioni_3.tex b/Claudio_Maggioni_3/Claudio_Maggioni_3.tex index ee90c74..9d738e4 100755 --- a/Claudio_Maggioni_3/Claudio_Maggioni_3.tex +++ b/Claudio_Maggioni_3/Claudio_Maggioni_3.tex @@ -48,6 +48,13 @@ activated uses a fixed $\alpha=1$ despite the indications on the assignment sheet. This was done in order to comply with the forum post on iCorsi found here: \url{https://www.icorsi.ch/mod/forum/discuss.php?d=81144}. +Here is a plot of the Rosenbrock function in 3d, with our starting point in red ($(0,0)$), +and the true minimizer in black ($(1,1)$): + +\begin{center} + \resizebox{0.6\textwidth}{0.6\textwidth}{\includegraphics{rosenb.jpg}} +\end{center} + \subsection{Exercise 1.2} Please consult the MATLAB implementation in the file \texttt{main.m} in section @@ -131,6 +138,12 @@ zig-zagging pattern with iterates in the vicinity of the Netwon + backtracking iterates. Finally, GD without backtracking quickly degenerates as it can be seen by the enlarged plot. +From the initial plot in the Rosenbrock's function, we can see why algorithms using +backtracking follow roughly the $(0,0) - (1,1)$ diagonal: since this region is effectively a +``valley'' in the 3D energy landscape, due to the nature of the backtracking methods and their +strict adherence of the Wolfe conditions these methods avoid wild ``climbs'' (unlike the Netwon method without backtracking) +and instead finding iterates with either sufficient decrease or a not to high increase. + When looking at gradient norms and objective function values (figure \ref{fig:gsppn}) over time, the degeneration of GD without backtracking and the inefficiency of GD with backtracking can clearly be seen. @@ -141,6 +154,16 @@ at gradient $\nabla f(x_1) \approx 450$ and objective value $f(x_1) \approx 100$, but quickly has both values decrease to 0 for its second iteration achieving convergence. +What has been observed in this assignment matches with the theory behind the methods. +Since the Newton method has quadratic convergence and uses quadratic information (i.e. +using the hessian in the step direction calculation) the number of iterations required +to find the minimizer when already close to it (as we are with $x_0 = [0;0]$) is significantly +less than the ones required for linear methods, like gradient descent. However, it must be +said that for an objective with an high number of dimensions a single iteration of a quadratic +method is significantly more costly than a single iteration of a linear method due to the +quadratically growing number of cells in the hessian matrix, which makes it harder and harder +to compute as the number of dimensions increase. + \section{Exercise 2} \subsection{Exercise 2.1} @@ -212,4 +235,10 @@ the opposite direction of the minimizer. This behaviour can also be observed in the plots in figure \ref{fig:4}, where several bumps and spikes are present in the gradient norm plot and small plateaus can be found in the objective function value plot. + +Comparing these results with the theory behind BFGS, we can say the results that have been +obtained fall within what we expect from the theory. Since BFGS is a superlinear but not quadratic +method of convergence, its ``speed'' in terms of number of iterations falls within linear methods (like GD) and +quadratic methods (like Newton). + \end{document} diff --git a/Claudio_Maggioni_3/main.asv b/Claudio_Maggioni_3/main.asv deleted file mode 100644 index dc80f97..0000000 --- a/Claudio_Maggioni_3/main.asv +++ /dev/null @@ -1,176 +0,0 @@ -%% Homework 3 - Optimization Methods -% Author: Claudio Maggioni -% -% Note: exercises are not in the right order due to matlab constraints of -% functions inside of scripts. - -clc -clear -close all - -% Set to non-zero to generate LaTeX for graphs -enable_m2tikz = 0; -if enable_m2tikz - addpath /home/claudio/git/matlab2tikz/src -else - matlab2tikz = @(a,b,c) 0; -end - -syms x y; -f = (1 - x)^2 + 100 * (y - x^2)^2; -global fl -fl = matlabFunction(f); - -%% 1.3 - Newton and GD solutions and energy plot - -[x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true); -plot(xs1(1, :), xs1(2, :), 'Marker', '.'); -fprintf("Newton backtracking: it=%d\n", size(xs1, 2)-1); -xlim([-0.01 1.01]) -ylim([-0.01 1.01]) -hold on; -[x2, xs2, gs2] = Newton(f, [0;0], 50000, 1e-6, false); -plot(xs2(1, :), xs2(2, :), 'Marker', '.'); -fprintf("Newton: it=%d\n", size(xs2, 2)-1); - -[x3, xs3, gs3] = GD(f, [0;0], 50000, 1e-6, true); -plot(xs3(1, :), xs3(2, :), 'Marker', '.'); -fprintf("GD backtracking: it=%d\n", size(xs3, 2)-1); - -[x4, xs4, gs4] = GD(f, [0;0], 50000, 1e-6, false); -plot(xs4(1, :), xs4(2, :), 'Marker', '.'); -fprintf("GD: it=%d\n", size(xs4, 2)-1); -hold off; -legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)') -sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape"); -matlab2tikz('showInfo', false, './ex1-3.tex'); - -figure; -plot(xs4(1, :), xs4(2, :), 'Marker', '.'); -legend('GD (alpha=1)') -sgtitle("Iterations of Newton and Gradient descent methods over 2D energy landscape"); -matlab2tikz('showInfo', false, './ex1-3-large.tex'); - -%% 2.3 - BGFS solution and energy plot - -figure; -[x5, xs5, gs5] = BGFS(f, [0;0], eye(2), 50000, 1e-6); -xlim([-0.01 1.01]) -ylim([-0.01 1.01]) -plot(xs5(1, :), xs5(2, :), 'Marker', '.'); -fprintf("BGFS backtracking: it=%d\n", size(xs5, 2)-1); - -sgtitle("Iterations of BGFS method over 2D energy landscape"); -matlab2tikz('showInfo', false, './ex2-3.tex'); - -%% 1.4 - Newton and GD gradient norm log - -figure; -semilogy(0:size(xs1, 2)-1, gs1, 'Marker', '.'); -ylim([5e-10, 1e12]); -xlim([-1, 30]); -hold on -semilogy(0:size(xs2, 2)-1, gs2, 'Marker', '.'); -semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.'); -semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.'); -hold off - -legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)') -sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-grad.tex'); - - -figure; -plot(0:size(xs1, 2)-1, gs1, 'Marker', '.'); -ylim([5e-10, 25]); -xlim([-1, 30]); -hold on -plot(0:size(xs2, 2)-1, gs2, 'Marker', '.'); -plot(0:size(xs3, 2)-1, gs3, 'Marker', '.'); -plot(0:size(xs4, 2)-1, gs4, 'Marker', '.'); -hold off - -legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)') -sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-grad.tex'); - -figure; -semilogy(0:size(xs3, 2)-1, gs3, 'Marker', '.'); -ylim([1e-7, 1e10]); -hold on -semilogy(0:size(xs4, 2)-1, gs4, 'Marker', '.'); -hold off - -legend('GD + backtracking', 'GD (alpha=1)') -sgtitle("Gradient norm w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-grad-large.tex'); - -figure; -ys1 = funvalues(xs1); -ys2 = funvalues(xs2); -ys3 = funvalues(xs3); -ys4 = funvalues(xs4); -ys5 = funvalues(xs5); -semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.'); -ylim([5e-19, 1e12]); -xlim([-1, 30]); -hold on -semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.'); -semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.'); -semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.'); -hold off - -legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)') -sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-ys.tex'); - -ys1 = funvalues(xs1); -ys2 = funvalues(xs2); -ys3 = funvalues(xs3); -ys4 = funvalues(xs4); -ys5 = funvalues(xs5); -semilogy(0:size(xs1, 2)-1, ys1, 'Marker', '.'); -ylim([5e-19, 20]); -xlim([-1, 30]); -hold on -semilogy(0:size(xs2, 2)-1, ys2, 'Marker', '.'); -semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.'); -semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.'); -hold off - -legend('Newton + backtracking', 'Newton', 'GD + backtracking', 'GD (alpha=1)') -sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-ys.tex'); - -figure; - -semilogy(0:size(xs3, 2)-1, ys3, 'Marker', '.'); -semilogy(0:size(xs4, 2)-1, ys4, 'Marker', '.'); - -legend('GD + backtracking', 'GD (alpha=1)') -sgtitle("Objective function value w.r.t. iteration number for Newton and GD methods"); -matlab2tikz('showInfo', false, './ex1-4-ys-large.tex'); - -%% 2.4 - BGFS gradient norms plot - -figure; -semilogy(0:size(xs5, 2)-1, gs5, 'Marker', '.'); - -sgtitle("Gradient norm w.r.t. iteration number for BGFS method"); -matlab2tikz('showInfo', false, './ex2-4-grad.tex'); - -%% 2.4 - BGFS objective values plot - -figure; -semilogy(0:size(xs5, 2)-1, ys5, 'Marker', '.'); - -sgtitle("Objective function value w.r.t. iteration number for BGFS methods"); -matlab2tikz('showInfo', false, './ex2-4-ys.tex'); - -function ys = funvalues(xs) - ys = zeros(1, size(xs, 2)); - global fl - for i = 1:size(xs, 2) - ys(i) = fl(xs(1,i), xs(2,i)); - end -end diff --git a/Claudio_Maggioni_3/main.m b/Claudio_Maggioni_3/main.m index 3a4273c..47e48d9 100644 --- a/Claudio_Maggioni_3/main.m +++ b/Claudio_Maggioni_3/main.m @@ -6,7 +6,6 @@ clc clear -close all % Set to non-zero to generate LaTeX for graphs enable_m2tikz = 0; @@ -15,12 +14,32 @@ if enable_m2tikz else matlab2tikz = @(a,b,c) 0; end - + +%% 1.1 - Rosenbrock function definition and surf plot +close all syms x y; f = (1 - x)^2 + 100 * (y - x^2)^2; global fl fl = matlabFunction(f); +xs = -0.2:0.01:1.2; +Z = zeros(size(xs,2), size(xs,2)); + +for x = 1:size(xs, 2) + for y = 1:size(xs, 2) + Z(x,y) = fl(xs(x), xs(y)); + end +end + +surf(xs, xs, Z, 'EdgeColor', 'none', 'FaceAlpha', 0.4); +hold on +plot3([0,1],[0,1],[fl(0,0),fl(1,1)],'-r.'); +plot3([1],[1],[fl(1,1)],'-k.'); +hold off + +figure; + + %% 1.2 - Minimizing the Rosenbrock function [x1, xs1, gs1] = Newton(f, [0;0], 50000, 1e-6, true); diff --git a/Claudio_Maggioni_3/rosenb.jpg b/Claudio_Maggioni_3/rosenb.jpg new file mode 100644 index 0000000..2cfa59b Binary files /dev/null and b/Claudio_Maggioni_3/rosenb.jpg differ diff --git a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md index c7b3d7c..2df1872 100644 --- a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md +++ b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.md @@ -18,12 +18,12 @@ header-includes: ### Question (a) As already covered in the course, the gradient of a standard quadratic form at a -point $ x_0$ is equal to: +point $x_0$ is equal to: $$ \nabla f(x_0) = A x_0 - b $$ -Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b = 0$ -(according to the first necessary condition for a minimizer), we obtain: +Plugging in the definition of $x_0$ and knowing that $\nabla f(x_m) = A x_m - b += 0$ (according to the first necessary condition for a minimizer), we obtain: $$ \nabla f(x_0) = A (x_m + v) - b = A x_m + A v - b = b + \lambda v - b = \lambda v $$ @@ -52,7 +52,90 @@ $\alpha \neq \frac{1}{\lambda}$. Therefore, since $x_1 = x_m$, we have proven SD converges to the minimizer in one iteration. -### Point 2 +## Point 2 The right answer is choice (a), since the energy norm of the error indeed always decreases monotonically. + +To prove that this is true, we first consider a way to express any iterate $x_k$ +in function of the minimizer $x_s$ and of the missing iterations: + +$$x_k = x_s + \sum_{i=k}^{N} \alpha_i A^i p_0$$ + +This formula makes use of the fact that step directions in CG are all +A-orthogonal with each other, so the k-th search direction $p_k$ is equal to +$A^k p_0$, where $p_0 = -r_0$ and $r_0$ is the first residual. + +Given that definition of iterates, we're able to express the error after +iteration $k$ $e_k$ in a similar fashion: + +$$e_k = x_k - x_s = \sum_{i=k}^{N} \alpha_i A^i p_0$$ + +We then recall the definition of energy norm $\|e_k\|_A$: + +$$\|e_k\|_A = \sqrt{\langle Ae_k, e_k \rangle}$$ + +We then want to show that $\|e_k\|_A = \|x_k - x_s\|_A > \|e_{k+1}\|_A$, which +in turn is equivalent to claim that: + +$$\langle Ae_k, e_k \rangle > \langle Ae_{k+1}, e_{k+1} \rangle$$ + +Knowing that the dot product is linear w.r.t. either of its arguments, we pull +out the sum term related to the k-th step (i.e. the first term in the sum that +makes up $e_k$) from both sides of $\langle Ae_k, e_k \rangle$, +obtaining the following: + +$$\langle Ae_{k+1}, e_{k+1} \rangle + \langle \alpha_k A^{k+1} p_0, e_k \rangle ++ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > \langle Ae_{k+1}, e_{k+1} + \rangle$$ + +which in turn is equivalent to claim that: + +$$\langle \alpha_k A^{k+1} p_0, e_k \rangle ++ \langle Ae_{k+1},\alpha_k A^k p_0 \rangle > 0$$ + +From this expression we can collect term $\alpha_k$ thanks to linearity of the +dot-product: + +$$\alpha_k (\langle A^{k+1} p_0, e_k \rangle ++ \langle Ae_{k+1}, A^k p_0 \rangle) > 0$$ + +and we can further "ignore" the $\alpha_k$ term since we know that all +$\alpha_i$s are positive by definition: + +$$\langle A^{k+1} p_0, e_k \rangle ++ \langle Ae_{k+1}, A^k p_0 \rangle > 0$$ + +Then, we convert the dot-products in their equivalent vector to vector product +form, and we plug in the definitions of $e_k$ and $e_{k+1}$: + +$$p_0^T (A^{k+1})^T (\sum_{i=k}^{N} \alpha_i A^i p_0) + +p_0^T (A^{k})^T (\sum_{i=k+1}^{N} \alpha_i A^i p_0) > 0$$ + +We then pull out the sum to cover all terms thanks to associativity of vector +products: + +$$\sum_{i=k}^N (p_0^T (A^{k+1})^T A^i p_0) \alpha_i+ \sum_{i=k+1}^N +(p_0^T (A^{k})^T A^i p_0) \alpha_i > 0$$ + +We then, as before, can "ignore" all $\alpha_i$ terms since we know by +definition that +they are all strictly positive. We then recalled that we assumed that A is +symmetric, so $A^T = A$. In the end we have to show that these two +inequalities are true: + +$$p_0^T A^{k+1+i} p_0 > 0 \; \forall i \in [k,N]$$ +$$p_0^T A^{k+i} p_0 > 0 \; \forall i \in [k+1,N]$$ + +To show these inequalities are indeed true, we recall that A is symmetric and +positive definite. We then consider that if a matrix A is SPD, then $A^i$ for +any positive $i$ is also SPD[^1]. Therefore, both inequalities are trivially +true due to the definition of positive definite matrices. + +[^1]: source: [Wikipedia - Definite Matrix $\to$ Properties $\to$ + Multiplication]( +https://en.wikipedia.org/wiki/Definite_matrix#Multiplication) + +Thanks to this we have indeed proven that the delta $\|e_k\|_A - \|e_{k+1}\|_A$ +is indeed positive and thus as $i$ increases the energy norm of the error +monotonically decreases. diff --git a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf index b262bee..4c89418 100644 Binary files a/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf and b/Claudio_Maggioni_midterm/Claudio_Maggioni_midterm.pdf differ