report
This commit is contained in:
parent
3885afeb4d
commit
f8375ddeb9
3 changed files with 23 additions and 21 deletions
Binary file not shown.
|
@ -37,7 +37,7 @@
|
|||
\advisor[Universit\`a della Svizzera Italiana,
|
||||
Switzerland]{Prof.}{Walter}{Binder}
|
||||
\assistant[Universit\`a della Svizzera Italiana,
|
||||
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
||||
Switzerland]{Dr.}{Andrea}{Ros\`a}
|
||||
\end{committee}
|
||||
|
||||
\abstract{The thesis aims at comparing two different traces coming from large
|
||||
|
@ -65,7 +65,7 @@ avoid wasting resources and avoid failures.
|
|||
In 2011 Google released a month long data trace of their own cluster management
|
||||
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
||||
scheduling, priority management, and failures of a real production workload.
|
||||
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
||||
This data was the foundation of the 2015 Ros\`a et al.\ paper
|
||||
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
||||
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
||||
for better cluster management highlighting the high amount of failures found in
|
||||
|
@ -116,7 +116,7 @@ exploiting the power of parallel computing, following most of the time a
|
|||
MapReduce-like structure.
|
||||
|
||||
%\subsection{Contribution}
|
||||
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
||||
This project aims to repeat the analysis performed in 2015 DSN Ros\`a et al.\
|
||||
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
||||
workload and the behaviour and patterns of executions within it. Thanks to this
|
||||
analysis, we aim to understand even better the causes of failures and how to
|
||||
|
@ -207,7 +207,7 @@ bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
|
|||
However, the community has not yet performed any research on the new Borg
|
||||
traces analysing unsuccessful executions, their possible causes, and the
|
||||
relationships between tasks and jobs. Therefore, the only current research in
|
||||
this field is this very report, providing and update to the the 2015 Ros\'a et
|
||||
this field is this very report, providing and update to the the 2015 Ros\`a et
|
||||
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
||||
|
||||
\section{Background}\label{sec3}
|
||||
|
@ -517,7 +517,7 @@ task termination counts. After the task events are sorted, the script iterates
|
|||
over the events in chronological order, storing each execution attempt time and
|
||||
registering all execution termination types by checking the event type field.
|
||||
The task termination is then equal to the last execution termination type,
|
||||
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
|
||||
following the definition originally given in the 2015 Ros\`a et al. DSN paper.
|
||||
|
||||
If the task termination is determined to be unsuccessful, the tally counter of
|
||||
task terminations for the matching task property is increased. Otherwise, all
|
||||
|
@ -533,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
|||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||
|
||||
Our first investigation focuses on replicating the analysis done by the paper of
|
||||
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
||||
Ros\`a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
||||
and resources.
|
||||
|
||||
In this section we perform several analyses focusing on how machine time and
|
||||
|
@ -639,7 +639,7 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
|||
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||
Figure~\ref{fig:taskslowdown-csts}.
|
||||
|
||||
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
||||
In 2015 Ros\`a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
||||
priority value, which at the time were numeric values between 0 and 11. However,
|
||||
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
||||
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
||||
|
@ -740,7 +740,7 @@ traces.
|
|||
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||
|
||||
This section aims to use some of the tecniques used in section IV of
|
||||
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||
the Ros\`a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||
between task and job events by gathering event statistics at those events. In
|
||||
particular, Section~\ref{tabIII-section} explores how the success of a
|
||||
task is inter-correlated with its own event patterns, which
|
||||
|
@ -873,15 +873,16 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
|||
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||
|
||||
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
|
||||
paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||
between task and job events by gathering event statistics at those events. In
|
||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||
task is inter-correlated with its own event patterns, which
|
||||
Section~\ref{figV-section} explores even further by computing task success
|
||||
probabilities based on the number of task termination events of a specific type.
|
||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||
the job level.
|
||||
This section re-applies the tecniques used in Section V of the Ros\`a et al.\
|
||||
paper~\cite{dsn-paper} to find causes for unsuccessful events related to
|
||||
task-level parameters (analyzed in Section~\ref{fig7-section}),
|
||||
usage of machine resources by tasks (analyzed in Section~\ref{fig8-section}),
|
||||
and job-level parameters (analyzed in Section~\ref{fig9-section}). In all the
|
||||
analyses we use the ``event rate'' metric, which represents the relative
|
||||
percentage of termination type events over a certain task/job parameter
|
||||
configuration. We compute this metric for all the possible terminations (i.e.\
|
||||
\texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH} and \texttt{KILL}) in order to
|
||||
find correlations with the several trace parameters.
|
||||
|
||||
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
||||
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
||||
|
@ -911,7 +912,7 @@ From this analysis we can make the following observations:
|
|||
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
||||
are quite different than 2011 ones, here it
|
||||
seems there is a good correlation between short task execution times
|
||||
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
|
||||
and finish event rates, instead of the ``U shape'' curve found in the Ros\`a
|
||||
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
||||
\item
|
||||
The behaviour among different clusters for the event execution time
|
||||
|
|
|
@ -229,7 +229,7 @@
|
|||
{\newpage }
|
||||
{\textwidth 5cm}
|
||||
|
||||
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the hyperref package and depends on the nohyper option
|
||||
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the ryperref package and depends on the nohyper option
|
||||
|
||||
%%% other useful packages
|
||||
|
||||
|
@ -241,7 +241,8 @@
|
|||
\RequirePackage{amsmath}
|
||||
%%% switch on hyperref support
|
||||
\ifthenelse{\boolean{@hypermode}}{%
|
||||
\RequirePackage[unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
|
||||
\RequirePackage[svgnames]{xcolor}
|
||||
\RequirePackage[colorlinks=true,linkcolor=Maroon,allcolors=Maroon,unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
|
||||
\RequirePackage[all]{hypcap}
|
||||
|
||||
}{}
|
||||
|
@ -256,7 +257,7 @@
|
|||
\textsf{Advisor's approval}{}
|
||||
(\DTLforeach*[\DTLiseq{\type}{r}]{committee}%
|
||||
{\actitle=title,\first=first,\last=last,\type=type}{%
|
||||
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\'a}):%
|
||||
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\`a}):%
|
||||
\hspace{4cm}
|
||||
& \textsf{Date: }
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue