|
|
|
@ -37,7 +37,7 @@
|
|
|
|
|
\advisor[Universit\`a della Svizzera Italiana,
|
|
|
|
|
Switzerland]{Prof.}{Walter}{Binder}
|
|
|
|
|
\assistant[Universit\`a della Svizzera Italiana,
|
|
|
|
|
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
|
|
|
|
Switzerland]{Dr.}{Andrea}{Ros\`a}
|
|
|
|
|
\end{committee}
|
|
|
|
|
|
|
|
|
|
\abstract{The thesis aims at comparing two different traces coming from large
|
|
|
|
@ -65,7 +65,7 @@ avoid wasting resources and avoid failures.
|
|
|
|
|
In 2011 Google released a month long data trace of their own cluster management
|
|
|
|
|
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
|
|
|
|
scheduling, priority management, and failures of a real production workload.
|
|
|
|
|
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
|
|
|
|
This data was the foundation of the 2015 Ros\`a et al.\ paper
|
|
|
|
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
|
|
|
|
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
|
|
|
|
for better cluster management highlighting the high amount of failures found in
|
|
|
|
@ -116,7 +116,7 @@ exploiting the power of parallel computing, following most of the time a
|
|
|
|
|
MapReduce-like structure.
|
|
|
|
|
|
|
|
|
|
%\subsection{Contribution}
|
|
|
|
|
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
|
|
|
|
This project aims to repeat the analysis performed in 2015 DSN Ros\`a et al.\
|
|
|
|
|
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
|
|
|
|
workload and the behaviour and patterns of executions within it. Thanks to this
|
|
|
|
|
analysis, we aim to understand even better the causes of failures and how to
|
|
|
|
@ -207,7 +207,7 @@ bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
|
|
|
|
|
However, the community has not yet performed any research on the new Borg
|
|
|
|
|
traces analysing unsuccessful executions, their possible causes, and the
|
|
|
|
|
relationships between tasks and jobs. Therefore, the only current research in
|
|
|
|
|
this field is this very report, providing and update to the the 2015 Ros\'a et
|
|
|
|
|
this field is this very report, providing and update to the the 2015 Ros\`a et
|
|
|
|
|
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
|
|
|
|
|
|
|
|
|
\section{Background}\label{sec3}
|
|
|
|
@ -517,7 +517,7 @@ task termination counts. After the task events are sorted, the script iterates
|
|
|
|
|
over the events in chronological order, storing each execution attempt time and
|
|
|
|
|
registering all execution termination types by checking the event type field.
|
|
|
|
|
The task termination is then equal to the last execution termination type,
|
|
|
|
|
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
|
|
|
|
|
following the definition originally given in the 2015 Ros\`a et al. DSN paper.
|
|
|
|
|
|
|
|
|
|
If the task termination is determined to be unsuccessful, the tally counter of
|
|
|
|
|
task terminations for the matching task property is increased. Otherwise, all
|
|
|
|
@ -533,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
|
|
|
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
|
|
|
|
|
|
|
|
|
Our first investigation focuses on replicating the analysis done by the paper of
|
|
|
|
|
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
|
|
|
|
Ros\`a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
|
|
|
|
and resources.
|
|
|
|
|
|
|
|
|
|
In this section we perform several analyses focusing on how machine time and
|
|
|
|
@ -639,7 +639,7 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
|
|
|
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
|
|
|
|
Figure~\ref{fig:taskslowdown-csts}.
|
|
|
|
|
|
|
|
|
|
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
|
|
|
|
In 2015 Ros\`a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
|
|
|
|
priority value, which at the time were numeric values between 0 and 11. However,
|
|
|
|
|
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
|
|
|
|
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
|
|
|
@ -740,7 +740,7 @@ traces.
|
|
|
|
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
|
|
|
|
|
|
|
|
|
This section aims to use some of the tecniques used in section IV of
|
|
|
|
|
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
the Ros\`a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
between task and job events by gathering event statistics at those events. In
|
|
|
|
|
particular, Section~\ref{tabIII-section} explores how the success of a
|
|
|
|
|
task is inter-correlated with its own event patterns, which
|
|
|
|
@ -873,15 +873,16 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
|
|
|
|
|
|
|
|
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
|
|
|
|
|
|
|
|
|
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
|
|
|
|
|
paper~\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
between task and job events by gathering event statistics at those events. In
|
|
|
|
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
|
|
|
|
task is inter-correlated with its own event patterns, which
|
|
|
|
|
Section~\ref{figV-section} explores even further by computing task success
|
|
|
|
|
probabilities based on the number of task termination events of a specific type.
|
|
|
|
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
|
|
|
|
the job level.
|
|
|
|
|
This section re-applies the tecniques used in Section V of the Ros\`a et al.\
|
|
|
|
|
paper~\cite{dsn-paper} to find causes for unsuccessful events related to
|
|
|
|
|
task-level parameters (analyzed in Section~\ref{fig7-section}),
|
|
|
|
|
usage of machine resources by tasks (analyzed in Section~\ref{fig8-section}),
|
|
|
|
|
and job-level parameters (analyzed in Section~\ref{fig9-section}). In all the
|
|
|
|
|
analyses we use the ``event rate'' metric, which represents the relative
|
|
|
|
|
percentage of termination type events over a certain task/job parameter
|
|
|
|
|
configuration. We compute this metric for all the possible terminations (i.e.\
|
|
|
|
|
\texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH} and \texttt{KILL}) in order to
|
|
|
|
|
find correlations with the several trace parameters.
|
|
|
|
|
|
|
|
|
|
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
|
|
|
|
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
|
|
|
@ -911,7 +912,7 @@ From this analysis we can make the following observations:
|
|
|
|
|
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
|
|
|
|
are quite different than 2011 ones, here it
|
|
|
|
|
seems there is a good correlation between short task execution times
|
|
|
|
|
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
|
|
|
|
|
and finish event rates, instead of the ``U shape'' curve found in the Ros\`a
|
|
|
|
|
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
|
|
|
|
\item
|
|
|
|
|
The behaviour among different clusters for the event execution time
|
|
|
|
|