report
This commit is contained in:
parent
3885afeb4d
commit
f8375ddeb9
3 changed files with 23 additions and 21 deletions
Binary file not shown.
|
@ -37,7 +37,7 @@
|
||||||
\advisor[Universit\`a della Svizzera Italiana,
|
\advisor[Universit\`a della Svizzera Italiana,
|
||||||
Switzerland]{Prof.}{Walter}{Binder}
|
Switzerland]{Prof.}{Walter}{Binder}
|
||||||
\assistant[Universit\`a della Svizzera Italiana,
|
\assistant[Universit\`a della Svizzera Italiana,
|
||||||
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
Switzerland]{Dr.}{Andrea}{Ros\`a}
|
||||||
\end{committee}
|
\end{committee}
|
||||||
|
|
||||||
\abstract{The thesis aims at comparing two different traces coming from large
|
\abstract{The thesis aims at comparing two different traces coming from large
|
||||||
|
@ -65,7 +65,7 @@ avoid wasting resources and avoid failures.
|
||||||
In 2011 Google released a month long data trace of their own cluster management
|
In 2011 Google released a month long data trace of their own cluster management
|
||||||
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
||||||
scheduling, priority management, and failures of a real production workload.
|
scheduling, priority management, and failures of a real production workload.
|
||||||
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
This data was the foundation of the 2015 Ros\`a et al.\ paper
|
||||||
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
||||||
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
||||||
for better cluster management highlighting the high amount of failures found in
|
for better cluster management highlighting the high amount of failures found in
|
||||||
|
@ -116,7 +116,7 @@ exploiting the power of parallel computing, following most of the time a
|
||||||
MapReduce-like structure.
|
MapReduce-like structure.
|
||||||
|
|
||||||
%\subsection{Contribution}
|
%\subsection{Contribution}
|
||||||
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
This project aims to repeat the analysis performed in 2015 DSN Ros\`a et al.\
|
||||||
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
||||||
workload and the behaviour and patterns of executions within it. Thanks to this
|
workload and the behaviour and patterns of executions within it. Thanks to this
|
||||||
analysis, we aim to understand even better the causes of failures and how to
|
analysis, we aim to understand even better the causes of failures and how to
|
||||||
|
@ -207,7 +207,7 @@ bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
|
||||||
However, the community has not yet performed any research on the new Borg
|
However, the community has not yet performed any research on the new Borg
|
||||||
traces analysing unsuccessful executions, their possible causes, and the
|
traces analysing unsuccessful executions, their possible causes, and the
|
||||||
relationships between tasks and jobs. Therefore, the only current research in
|
relationships between tasks and jobs. Therefore, the only current research in
|
||||||
this field is this very report, providing and update to the the 2015 Ros\'a et
|
this field is this very report, providing and update to the the 2015 Ros\`a et
|
||||||
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
||||||
|
|
||||||
\section{Background}\label{sec3}
|
\section{Background}\label{sec3}
|
||||||
|
@ -517,7 +517,7 @@ task termination counts. After the task events are sorted, the script iterates
|
||||||
over the events in chronological order, storing each execution attempt time and
|
over the events in chronological order, storing each execution attempt time and
|
||||||
registering all execution termination types by checking the event type field.
|
registering all execution termination types by checking the event type field.
|
||||||
The task termination is then equal to the last execution termination type,
|
The task termination is then equal to the last execution termination type,
|
||||||
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
|
following the definition originally given in the 2015 Ros\`a et al. DSN paper.
|
||||||
|
|
||||||
If the task termination is determined to be unsuccessful, the tally counter of
|
If the task termination is determined to be unsuccessful, the tally counter of
|
||||||
task terminations for the matching task property is increased. Otherwise, all
|
task terminations for the matching task property is increased. Otherwise, all
|
||||||
|
@ -533,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the analysis done by the paper of
|
Our first investigation focuses on replicating the analysis done by the paper of
|
||||||
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
Ros\`a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
||||||
and resources.
|
and resources.
|
||||||
|
|
||||||
In this section we perform several analyses focusing on how machine time and
|
In this section we perform several analyses focusing on how machine time and
|
||||||
|
@ -639,7 +639,7 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
||||||
means are computed on a cluster-by-cluster basis for 2019 data in
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||||
Figure~\ref{fig:taskslowdown-csts}.
|
Figure~\ref{fig:taskslowdown-csts}.
|
||||||
|
|
||||||
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
In 2015 Ros\`a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
||||||
priority value, which at the time were numeric values between 0 and 11. However,
|
priority value, which at the time were numeric values between 0 and 11. However,
|
||||||
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
||||||
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
||||||
|
@ -740,7 +740,7 @@ traces.
|
||||||
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||||
|
|
||||||
This section aims to use some of the tecniques used in section IV of
|
This section aims to use some of the tecniques used in section IV of
|
||||||
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
the Ros\`a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||||
between task and job events by gathering event statistics at those events. In
|
between task and job events by gathering event statistics at those events. In
|
||||||
particular, Section~\ref{tabIII-section} explores how the success of a
|
particular, Section~\ref{tabIII-section} explores how the success of a
|
||||||
task is inter-correlated with its own event patterns, which
|
task is inter-correlated with its own event patterns, which
|
||||||
|
@ -873,15 +873,16 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||||
|
|
||||||
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||||
|
|
||||||
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
|
This section re-applies the tecniques used in Section V of the Ros\`a et al.\
|
||||||
paper~\cite{dsn-paper} to find patterns and interpendencies
|
paper~\cite{dsn-paper} to find causes for unsuccessful events related to
|
||||||
between task and job events by gathering event statistics at those events. In
|
task-level parameters (analyzed in Section~\ref{fig7-section}),
|
||||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
usage of machine resources by tasks (analyzed in Section~\ref{fig8-section}),
|
||||||
task is inter-correlated with its own event patterns, which
|
and job-level parameters (analyzed in Section~\ref{fig9-section}). In all the
|
||||||
Section~\ref{figV-section} explores even further by computing task success
|
analyses we use the ``event rate'' metric, which represents the relative
|
||||||
probabilities based on the number of task termination events of a specific type.
|
percentage of termination type events over a certain task/job parameter
|
||||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
configuration. We compute this metric for all the possible terminations (i.e.\
|
||||||
the job level.
|
\texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH} and \texttt{KILL}) in order to
|
||||||
|
find correlations with the several trace parameters.
|
||||||
|
|
||||||
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
||||||
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
||||||
|
@ -911,7 +912,7 @@ From this analysis we can make the following observations:
|
||||||
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
||||||
are quite different than 2011 ones, here it
|
are quite different than 2011 ones, here it
|
||||||
seems there is a good correlation between short task execution times
|
seems there is a good correlation between short task execution times
|
||||||
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
|
and finish event rates, instead of the ``U shape'' curve found in the Ros\`a
|
||||||
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
||||||
\item
|
\item
|
||||||
The behaviour among different clusters for the event execution time
|
The behaviour among different clusters for the event execution time
|
||||||
|
|
|
@ -229,7 +229,7 @@
|
||||||
{\newpage }
|
{\newpage }
|
||||||
{\textwidth 5cm}
|
{\textwidth 5cm}
|
||||||
|
|
||||||
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the hyperref package and depends on the nohyper option
|
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the ryperref package and depends on the nohyper option
|
||||||
|
|
||||||
%%% other useful packages
|
%%% other useful packages
|
||||||
|
|
||||||
|
@ -241,7 +241,8 @@
|
||||||
\RequirePackage{amsmath}
|
\RequirePackage{amsmath}
|
||||||
%%% switch on hyperref support
|
%%% switch on hyperref support
|
||||||
\ifthenelse{\boolean{@hypermode}}{%
|
\ifthenelse{\boolean{@hypermode}}{%
|
||||||
\RequirePackage[unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
|
\RequirePackage[svgnames]{xcolor}
|
||||||
|
\RequirePackage[colorlinks=true,linkcolor=Maroon,allcolors=Maroon,unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
|
||||||
\RequirePackage[all]{hypcap}
|
\RequirePackage[all]{hypcap}
|
||||||
|
|
||||||
}{}
|
}{}
|
||||||
|
@ -256,7 +257,7 @@
|
||||||
\textsf{Advisor's approval}{}
|
\textsf{Advisor's approval}{}
|
||||||
(\DTLforeach*[\DTLiseq{\type}{r}]{committee}%
|
(\DTLforeach*[\DTLiseq{\type}{r}]{committee}%
|
||||||
{\actitle=title,\first=first,\last=last,\type=type}{%
|
{\actitle=title,\first=first,\last=last,\type=type}{%
|
||||||
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\'a}):%
|
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\`a}):%
|
||||||
\hspace{4cm}
|
\hspace{4cm}
|
||||||
& \textsf{Date: }
|
& \textsf{Date: }
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in a new issue