This commit is contained in:
Claudio Maggioni 2021-07-12 21:28:33 +02:00
parent 3885afeb4d
commit f8375ddeb9
3 changed files with 23 additions and 21 deletions

Binary file not shown.

View file

@ -37,7 +37,7 @@
\advisor[Universit\`a della Svizzera Italiana,
Switzerland]{Prof.}{Walter}{Binder}
\assistant[Universit\`a della Svizzera Italiana,
Switzerland]{Dr.}{Andrea}{Ros\'a}
Switzerland]{Dr.}{Andrea}{Ros\`a}
\end{committee}
\abstract{The thesis aims at comparing two different traces coming from large
@ -65,7 +65,7 @@ avoid wasting resources and avoid failures.
In 2011 Google released a month long data trace of their own cluster management
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
scheduling, priority management, and failures of a real production workload.
This data was the foundation of the 2015 Ros\'a et al.\ paper
This data was the foundation of the 2015 Ros\`a et al.\ paper
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
for better cluster management highlighting the high amount of failures found in
@ -116,7 +116,7 @@ exploiting the power of parallel computing, following most of the time a
MapReduce-like structure.
%\subsection{Contribution}
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
This project aims to repeat the analysis performed in 2015 DSN Ros\`a et al.\
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
workload and the behaviour and patterns of executions within it. Thanks to this
analysis, we aim to understand even better the causes of failures and how to
@ -207,7 +207,7 @@ bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
However, the community has not yet performed any research on the new Borg
traces analysing unsuccessful executions, their possible causes, and the
relationships between tasks and jobs. Therefore, the only current research in
this field is this very report, providing and update to the the 2015 Ros\'a et
this field is this very report, providing and update to the the 2015 Ros\`a et
al.\ paper~\cite{dsn-paper} focusing on the new trace.
\section{Background}\label{sec3}
@ -517,7 +517,7 @@ task termination counts. After the task events are sorted, the script iterates
over the events in chronological order, storing each execution attempt time and
registering all execution termination types by checking the event type field.
The task termination is then equal to the last execution termination type,
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
following the definition originally given in the 2015 Ros\`a et al. DSN paper.
If the task termination is determined to be unsuccessful, the tally counter of
task terminations for the matching task property is increased. Otherwise, all
@ -533,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
Our first investigation focuses on replicating the analysis done by the paper of
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
Ros\`a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
and resources.
In this section we perform several analyses focusing on how machine time and
@ -639,7 +639,7 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
means are computed on a cluster-by-cluster basis for 2019 data in
Figure~\ref{fig:taskslowdown-csts}.
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
In 2015 Ros\`a et al.~\cite{dsn-paper} measured mean task slowdown per each task
priority value, which at the time were numeric values between 0 and 11. However,
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
Therefore, to allow an easier comparison, mean task slowdown values are computed
@ -740,7 +740,7 @@ traces.
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
This section aims to use some of the tecniques used in section IV of
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
the Ros\`a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events. In
particular, Section~\ref{tabIII-section} explores how the success of a
task is inter-correlated with its own event patterns, which
@ -873,15 +873,16 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
paper~\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events. In
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
task is inter-correlated with its own event patterns, which
Section~\ref{figV-section} explores even further by computing task success
probabilities based on the number of task termination events of a specific type.
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
the job level.
This section re-applies the tecniques used in Section V of the Ros\`a et al.\
paper~\cite{dsn-paper} to find causes for unsuccessful events related to
task-level parameters (analyzed in Section~\ref{fig7-section}),
usage of machine resources by tasks (analyzed in Section~\ref{fig8-section}),
and job-level parameters (analyzed in Section~\ref{fig9-section}). In all the
analyses we use the ``event rate'' metric, which represents the relative
percentage of termination type events over a certain task/job parameter
configuration. We compute this metric for all the possible terminations (i.e.\
\texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH} and \texttt{KILL}) in order to
find correlations with the several trace parameters.
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
@ -911,7 +912,7 @@ From this analysis we can make the following observations:
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
are quite different than 2011 ones, here it
seems there is a good correlation between short task execution times
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
and finish event rates, instead of the ``U shape'' curve found in the Ros\`a
et al.\ 2015 DSN paper~\cite{dsn-paper};
\item
The behaviour among different clusters for the event execution time

View file

@ -229,7 +229,7 @@
{\newpage }
{\textwidth 5cm}
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the hyperref package and depends on the nohyper option
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the ryperref package and depends on the nohyper option
%%% other useful packages
@ -241,7 +241,8 @@
\RequirePackage{amsmath}
%%% switch on hyperref support
\ifthenelse{\boolean{@hypermode}}{%
\RequirePackage[unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
\RequirePackage[svgnames]{xcolor}
\RequirePackage[colorlinks=true,linkcolor=Maroon,allcolors=Maroon,unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
\RequirePackage[all]{hypcap}
}{}
@ -256,7 +257,7 @@
\textsf{Advisor's approval}{}
(\DTLforeach*[\DTLiseq{\type}{r}]{committee}%
{\actitle=title,\first=first,\last=last,\type=type}{%
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\'a}):%
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\`a}):%
\hspace{4cm}
& \textsf{Date: }
}