Merge branch 'master' of tea.maggioni.xyz:maggicl/bachelorThesis

report
2021-06-17 16:02:59 +02:00 · 2021-06-17 15:59:53 +02:00
2 changed files with 95 additions and 70 deletions
--- a/report/Claudio_Maggioni_report.pdf
+++ b/report/Claudio_Maggioni_report.pdf
--- a/report/Claudio_Maggioni_report.tex
+++ b/report/Claudio_Maggioni_report.tex
@ -89,7 +89,20 @@ old analysis to understand even better the causes of failures and how to prevent
 them. Additionally, this report provides an overview of the data engineering
 techniques used to perform the queries and analyses on the 2019 traces.

-\section{State of the art}
+\subsection{Outline}
+The report is structured as follows. Section~\ref{sec2} contains information about the
+current state of the art for Google Borg cluster traces. Section~\ref{sec3}
+provides an overview including technical background information on the data to
+analyze and its storage format. Section~\ref{sec4} will discuss about the
+project requirements and the data science methods used to perform the analysis.
+Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
+obtained while analyzing, respectively the performance input of
+unsuccessful executions, the patterns of task and job events, and the potential
+causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
+conclusions.
+
+\section{State of the art}\label{sec2}
+
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{cc}
@ -142,7 +155,7 @@ machines.

 \input{figures/machine_configs}

-\section{Background information}
+\section{Background information}\label{sec3}

  \textit{Borg} is Google's own cluster management software able to run
  thousands of different jobs. Among the various cluster management services it
@ -275,7 +288,7 @@ science technologies like Apache Spark were used to achieve efficient
 and parallelized computations. This approach is discussed with further
 detail in the following section.

-\section{Project Requirements and Analysis Methodology}
+\section{Project Requirements and Analysis Methodology}\label{sec4}

 The aim of this project is to repeat the analysis performed in 2015 on the
 dataset Google has released in 2019 in order to find similarities and
@ -460,7 +473,7 @@ computing slowdown values given the previously computed execution attempt time
 deltas. Finally, the mean of the computed slowdown values is computed resulting
 in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.

-\section{Analysis: Performance Input of Unsuccessful Executions}
+\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}

 Our first investigation focuses on replicating the analysis done by the paper of
 Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
@ -667,7 +680,7 @@ With more than 98\% of both CPU and memory resources used by
 non-successful tasks, it is clear the spatial resource waste is high in the 2019
 traces.

-\section{Analysis: Patterns of Task and Job Events}
+\section{Analysis: Patterns of Task and Job Events}\label{sec6}

 This section aims to use some of the tecniques used in section IV of
 the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
@ -801,84 +814,96 @@ one. For some clusters (namely B, C, and  D), the mean number of \texttt{FAIL} a
 \texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
 Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.

-% \section{Analysis: Potential Causes of Unsuccessful Executions}
+\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}

-% The aim of this section is to analyze several task-level and job-level
-% parameters in order to find correlations with the success of an execution. By
-% using the tecniques used in Section V of the Rosa\' et al.\
-% paper\cite{dsn-paper} we analyze
-% task events' metadata, the use of CPU and Memory resources at the task level,
-% and job metadata respectively in Section~\ref{fig7-section},
-% Section~\ref{fig8-section} and Section~\ref{fig9-section}.
+This section re-applies the tecniques used in section V of the Ros\'a et al.\
+paper\cite{dsn-paper} to find patterns and interpendencies
+between task and job events by gathering event statistics at those events. In
+particular, Section~\ref{tabIII-section} explores how tasks of the success of a
+task is inter-correlated with its own event patterns, which
+Section~\ref{figV-section} explores even further by computing task success
+probabilities based on the number of task termination events of a specific type.
+Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
+the job level.

-% \subsection{Event rates vs.\ task priority, event execution time, and machine
-% concurrency.}\label{fig7-section}
+\section{Analysis: Potential Causes of Unsuccessful Executions}

-% \input{figures/figure_7}
+The aim of this section is to analyze several task-level and job-level
+parameters in order to find correlations with the success of an execution. By
+using the tecniques used in Section V of the Rosa\' et al.\
+paper\cite{dsn-paper} we analyze
+task events' metadata, the use of CPU and Memory resources at the task level,
+and job metadata respectively in Section~\ref{fig7-section},
+Section~\ref{fig8-section} and Section~\ref{fig9-section}.

-% Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
-% \ref{fig:figureVII-c}.
+\subsection{Event rates vs.\ task priority, event execution time, and machine
+concurrency.}\label{fig7-section}

-% \textbf{Observations}:
+\input{figures/figure_7}

-% \begin{itemize}
-% \item
-%   No smooth curves in this figure either, unlike 2011 traces
-% \item
-%   The behaviour of curves for 7a (priority) is almost the opposite of
-%   2011, i.e. in-between priorities have higher kill rates while
-%   priorities at the extremum have lower kill rates. This could also be
-%   due bt the inherent distribution of job terminations;
-% \item
-%   Event execution time curves are quite different than 2011, here it
-%   seems there is a good correlation between short task execution times
-%   and finish event rates, instead of the U shape curve in 2015 DSN
-% \item
-%   In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
-% \item
-%   Machine concurrency seems to play little role in the event termination
-%   distribution, as for all concurrency factors the kill rate is at 90\%.
-% \end{itemize}
+Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
+\ref{fig:figureVII-c}.

-% \subsection{Event Rates vs. Requested Resources, Resource Reservation, and
-% Resource Utilization}\label{fig8-section}
-% \input{figures/figure_8}
+\textbf{Observations}:

-% Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
-% Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
-% Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
-% Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
-% Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
-% Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
+\begin{itemize}
+\item
+  No smooth curves in this figure either, unlike 2011 traces
+\item
+  The behaviour of curves for 7a (priority) is almost the opposite of
+  2011, i.e. in-between priorities have higher kill rates while
+  priorities at the extremum have lower kill rates. This could also be
+  due bt the inherent distribution of job terminations;
+\item
+  Event execution time curves are quite different than 2011, here it
+  seems there is a good correlation between short task execution times
+  and finish event rates, instead of the U shape curve in 2015 DSN
+\item
+  In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
+\item
+  Machine concurrency seems to play little role in the event termination
+  distribution, as for all concurrency factors the kill rate is at 90\%.
+\end{itemize}

-% \subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
-% }\label{fig9-section}
-% \input{figures/figure_9}
+\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
+Resource Utilization}\label{fig8-section}
+\input{figures/figure_8}

-% Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
-% \ref{fig:figureIX-c}.
+Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
+Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
+Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
+Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
+Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
+Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.

-% \textbf{Observations}:
+\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
+}\label{fig9-section}
+\input{figures/figure_9}

-% \begin{itemize}
-% \item
-%   Behaviour between cluster varies a lot
-% \item
-%   There are no ``smooth'' gradients in the various curves unlike in the
-%   2011 traces
-% \item
-%   Killed jobs have higher event rates in general, and overall dominate
-%   all event rates measures
-% \item
-%   There still seems to be a correlation between short execution job
-%   times and successfull final termination, and likewise for kills and
-%   higher job terminations
-% \item
-%   Across all clusters, a machine locality factor of 1 seems to lead to
-%   the highest success event rate
-% \end{itemize}
+Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
+\ref{fig:figureIX-c}.

-\section{Conclusions, Future Work and Possible Developments}
+\textbf{Observations}:
+
+\begin{itemize}
+\item
+  Behaviour between cluster varies a lot
+\item
+  There are no ``smooth'' gradients in the various curves unlike in the
+  2011 traces
+\item
+  Killed jobs have higher event rates in general, and overall dominate
+  all event rates measures
+\item
+  There still seems to be a correlation between short execution job
+  times and successfull final termination, and likewise for kills and
+  higher job terminations
+\item
+  Across all clusters, a machine locality factor of 1 seems to lead to
+  the highest success event rate
+\end{itemize}
+
+\section{Conclusions, Future Work and Possible Developments}\label{sec8}
 \textbf{TBD}

 \newpage
Author	SHA1	Message	Date
Claudio Maggioni	d1ae92f239	Merge branch 'master' of tea.maggioni.xyz:maggicl/bachelorThesis	2021-06-17 16:02:59 +02:00
Claudio Maggioni	b58c2aaa52	report	2021-06-17 15:59:53 +02:00