This commit is contained in:
Claudio Maggioni 2021-07-12 21:28:33 +02:00
parent 3885afeb4d
commit f8375ddeb9
3 changed files with 23 additions and 21 deletions

Binary file not shown.

View file

@ -37,7 +37,7 @@
\advisor[Universit\`a della Svizzera Italiana, \advisor[Universit\`a della Svizzera Italiana,
Switzerland]{Prof.}{Walter}{Binder} Switzerland]{Prof.}{Walter}{Binder}
\assistant[Universit\`a della Svizzera Italiana, \assistant[Universit\`a della Svizzera Italiana,
Switzerland]{Dr.}{Andrea}{Ros\'a} Switzerland]{Dr.}{Andrea}{Ros\`a}
\end{committee} \end{committee}
\abstract{The thesis aims at comparing two different traces coming from large \abstract{The thesis aims at comparing two different traces coming from large
@ -65,7 +65,7 @@ avoid wasting resources and avoid failures.
In 2011 Google released a month long data trace of their own cluster management In 2011 Google released a month long data trace of their own cluster management
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
scheduling, priority management, and failures of a real production workload. scheduling, priority management, and failures of a real production workload.
This data was the foundation of the 2015 Ros\'a et al.\ paper This data was the foundation of the 2015 Ros\`a et al.\ paper
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond \textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
for better cluster management highlighting the high amount of failures found in for better cluster management highlighting the high amount of failures found in
@ -116,7 +116,7 @@ exploiting the power of parallel computing, following most of the time a
MapReduce-like structure. MapReduce-like structure.
%\subsection{Contribution} %\subsection{Contribution}
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\ This project aims to repeat the analysis performed in 2015 DSN Ros\`a et al.\
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
workload and the behaviour and patterns of executions within it. Thanks to this workload and the behaviour and patterns of executions within it. Thanks to this
analysis, we aim to understand even better the causes of failures and how to analysis, we aim to understand even better the causes of failures and how to
@ -207,7 +207,7 @@ bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
However, the community has not yet performed any research on the new Borg However, the community has not yet performed any research on the new Borg
traces analysing unsuccessful executions, their possible causes, and the traces analysing unsuccessful executions, their possible causes, and the
relationships between tasks and jobs. Therefore, the only current research in relationships between tasks and jobs. Therefore, the only current research in
this field is this very report, providing and update to the the 2015 Ros\'a et this field is this very report, providing and update to the the 2015 Ros\`a et
al.\ paper~\cite{dsn-paper} focusing on the new trace. al.\ paper~\cite{dsn-paper} focusing on the new trace.
\section{Background}\label{sec3} \section{Background}\label{sec3}
@ -517,7 +517,7 @@ task termination counts. After the task events are sorted, the script iterates
over the events in chronological order, storing each execution attempt time and over the events in chronological order, storing each execution attempt time and
registering all execution termination types by checking the event type field. registering all execution termination types by checking the event type field.
The task termination is then equal to the last execution termination type, The task termination is then equal to the last execution termination type,
following the definition originally given in the 2015 Ros\'a et al. DSN paper. following the definition originally given in the 2015 Ros\`a et al. DSN paper.
If the task termination is determined to be unsuccessful, the tally counter of If the task termination is determined to be unsuccessful, the tally counter of
task terminations for the matching task property is increased. Otherwise, all task terminations for the matching task property is increased. Otherwise, all
@ -533,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5} \section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
Our first investigation focuses on replicating the analysis done by the paper of Our first investigation focuses on replicating the analysis done by the paper of
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time Ros\`a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
and resources. and resources.
In this section we perform several analyses focusing on how machine time and In this section we perform several analyses focusing on how machine time and
@ -639,7 +639,7 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
means are computed on a cluster-by-cluster basis for 2019 data in means are computed on a cluster-by-cluster basis for 2019 data in
Figure~\ref{fig:taskslowdown-csts}. Figure~\ref{fig:taskslowdown-csts}.
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task In 2015 Ros\`a et al.~\cite{dsn-paper} measured mean task slowdown per each task
priority value, which at the time were numeric values between 0 and 11. However, priority value, which at the time were numeric values between 0 and 11. However,
in 2019 traces, task priorities are given as a numeric value between 0 and 500. in 2019 traces, task priorities are given as a numeric value between 0 and 500.
Therefore, to allow an easier comparison, mean task slowdown values are computed Therefore, to allow an easier comparison, mean task slowdown values are computed
@ -740,7 +740,7 @@ traces.
\section{Analysis: Patterns of Task and Job Events}\label{sec6} \section{Analysis: Patterns of Task and Job Events}\label{sec6}
This section aims to use some of the tecniques used in section IV of This section aims to use some of the tecniques used in section IV of
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies the Ros\`a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events. In between task and job events by gathering event statistics at those events. In
particular, Section~\ref{tabIII-section} explores how the success of a particular, Section~\ref{tabIII-section} explores how the success of a
task is inter-correlated with its own event patterns, which task is inter-correlated with its own event patterns, which
@ -873,15 +873,16 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7} \section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
This section re-applies the tecniques used in Section V of the Ros\'a et al.\ This section re-applies the tecniques used in Section V of the Ros\`a et al.\
paper~\cite{dsn-paper} to find patterns and interpendencies paper~\cite{dsn-paper} to find causes for unsuccessful events related to
between task and job events by gathering event statistics at those events. In task-level parameters (analyzed in Section~\ref{fig7-section}),
particular, Section~\ref{tabIII-section} explores how tasks of the success of a usage of machine resources by tasks (analyzed in Section~\ref{fig8-section}),
task is inter-correlated with its own event patterns, which and job-level parameters (analyzed in Section~\ref{fig9-section}). In all the
Section~\ref{figV-section} explores even further by computing task success analyses we use the ``event rate'' metric, which represents the relative
probabilities based on the number of task termination events of a specific type. percentage of termination type events over a certain task/job parameter
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at configuration. We compute this metric for all the possible terminations (i.e.\
the job level. \texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH} and \texttt{KILL}) in order to
find correlations with the several trace parameters.
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and \subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7} Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
@ -911,7 +912,7 @@ From this analysis we can make the following observations:
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
are quite different than 2011 ones, here it are quite different than 2011 ones, here it
seems there is a good correlation between short task execution times seems there is a good correlation between short task execution times
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a and finish event rates, instead of the ``U shape'' curve found in the Ros\`a
et al.\ 2015 DSN paper~\cite{dsn-paper}; et al.\ 2015 DSN paper~\cite{dsn-paper};
\item \item
The behaviour among different clusters for the event execution time The behaviour among different clusters for the event execution time

View file

@ -229,7 +229,7 @@
{\newpage } {\newpage }
{\textwidth 5cm} {\textwidth 5cm}
%%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the hyperref package and depends on the nohyper option %%% put ToC, LoF, LoT and Index entries in the ToC use of \phantomsection is required for dealing with the ryperref package and depends on the nohyper option
%%% other useful packages %%% other useful packages
@ -241,7 +241,8 @@
\RequirePackage{amsmath} \RequirePackage{amsmath}
%%% switch on hyperref support %%% switch on hyperref support
\ifthenelse{\boolean{@hypermode}}{% \ifthenelse{\boolean{@hypermode}}{%
\RequirePackage[unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref} \RequirePackage[svgnames]{xcolor}
\RequirePackage[colorlinks=true,linkcolor=Maroon,allcolors=Maroon,unicode,plainpages=false,pdfpagelabels,breaklinks]{hyperref}
\RequirePackage[all]{hypcap} \RequirePackage[all]{hypcap}
}{} }{}
@ -256,7 +257,7 @@
\textsf{Advisor's approval}{} \textsf{Advisor's approval}{}
(\DTLforeach*[\DTLiseq{\type}{r}]{committee}% (\DTLforeach*[\DTLiseq{\type}{r}]{committee}%
{\actitle=title,\first=first,\last=last,\type=type}{% {\actitle=title,\first=first,\last=last,\type=type}{%
\DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\'a}):% \DTLiffirstrow{}{, }\textsf{\print@blank{\actitle}\first \ \last}, \textsf{Dr. Andrea Ros\`a}):%
\hspace{4cm} \hspace{4cm}
& \textsf{Date: } & \textsf{Date: }
} }