Compare commits

...

2 Commits

2 changed files with 95 additions and 70 deletions

Binary file not shown.

View File

@ -89,7 +89,20 @@ old analysis to understand even better the causes of failures and how to prevent
them. Additionally, this report provides an overview of the data engineering
techniques used to perform the queries and analyses on the 2019 traces.
\section{State of the art}
\subsection{Outline}
The report is structured as follows. Section~\ref{sec2} contains information about the
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
provides an overview including technical background information on the data to
analyze and its storage format. Section~\ref{sec4} will discuss about the
project requirements and the data science methods used to perform the analysis.
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
obtained while analyzing, respectively the performance input of
unsuccessful executions, the patterns of task and job events, and the potential
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
conclusions.
\section{State of the art}\label{sec2}
\begin{figure}[t]
\begin{center}
\begin{tabular}{cc}
@ -142,7 +155,7 @@ machines.
\input{figures/machine_configs}
\section{Background information}
\section{Background information}\label{sec3}
\textit{Borg} is Google's own cluster management software able to run
thousands of different jobs. Among the various cluster management services it
@ -275,7 +288,7 @@ science technologies like Apache Spark were used to achieve efficient
and parallelized computations. This approach is discussed with further
detail in the following section.
\section{Project Requirements and Analysis Methodology}
\section{Project Requirements and Analysis Methodology}\label{sec4}
The aim of this project is to repeat the analysis performed in 2015 on the
dataset Google has released in 2019 in order to find similarities and
@ -460,7 +473,7 @@ computing slowdown values given the previously computed execution attempt time
deltas. Finally, the mean of the computed slowdown values is computed resulting
in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions}
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
Our first investigation focuses on replicating the analysis done by the paper of
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
@ -667,7 +680,7 @@ With more than 98\% of both CPU and memory resources used by
non-successful tasks, it is clear the spatial resource waste is high in the 2019
traces.
\section{Analysis: Patterns of Task and Job Events}
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
This section aims to use some of the tecniques used in section IV of
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
@ -801,84 +814,96 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
% \section{Analysis: Potential Causes of Unsuccessful Executions}
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
% The aim of this section is to analyze several task-level and job-level
% parameters in order to find correlations with the success of an execution. By
% using the tecniques used in Section V of the Rosa\' et al.\
% paper\cite{dsn-paper} we analyze
% task events' metadata, the use of CPU and Memory resources at the task level,
% and job metadata respectively in Section~\ref{fig7-section},
% Section~\ref{fig8-section} and Section~\ref{fig9-section}.
This section re-applies the tecniques used in section V of the Ros\'a et al.\
paper\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events. In
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
task is inter-correlated with its own event patterns, which
Section~\ref{figV-section} explores even further by computing task success
probabilities based on the number of task termination events of a specific type.
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
the job level.
% \subsection{Event rates vs.\ task priority, event execution time, and machine
% concurrency.}\label{fig7-section}
\section{Analysis: Potential Causes of Unsuccessful Executions}
% \input{figures/figure_7}
The aim of this section is to analyze several task-level and job-level
parameters in order to find correlations with the success of an execution. By
using the tecniques used in Section V of the Rosa\' et al.\
paper\cite{dsn-paper} we analyze
task events' metadata, the use of CPU and Memory resources at the task level,
and job metadata respectively in Section~\ref{fig7-section},
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
% Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
% \ref{fig:figureVII-c}.
\subsection{Event rates vs.\ task priority, event execution time, and machine
concurrency.}\label{fig7-section}
% \textbf{Observations}:
\input{figures/figure_7}
% \begin{itemize}
% \item
% No smooth curves in this figure either, unlike 2011 traces
% \item
% The behaviour of curves for 7a (priority) is almost the opposite of
% 2011, i.e. in-between priorities have higher kill rates while
% priorities at the extremum have lower kill rates. This could also be
% due bt the inherent distribution of job terminations;
% \item
% Event execution time curves are quite different than 2011, here it
% seems there is a good correlation between short task execution times
% and finish event rates, instead of the U shape curve in 2015 DSN
% \item
% In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
% \item
% Machine concurrency seems to play little role in the event termination
% distribution, as for all concurrency factors the kill rate is at 90\%.
% \end{itemize}
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
\ref{fig:figureVII-c}.
% \subsection{Event Rates vs. Requested Resources, Resource Reservation, and
% Resource Utilization}\label{fig8-section}
% \input{figures/figure_8}
\textbf{Observations}:
% Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
% Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
% Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
% Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
% Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
% Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
\begin{itemize}
\item
No smooth curves in this figure either, unlike 2011 traces
\item
The behaviour of curves for 7a (priority) is almost the opposite of
2011, i.e. in-between priorities have higher kill rates while
priorities at the extremum have lower kill rates. This could also be
due bt the inherent distribution of job terminations;
\item
Event execution time curves are quite different than 2011, here it
seems there is a good correlation between short task execution times
and finish event rates, instead of the U shape curve in 2015 DSN
\item
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
\item
Machine concurrency seems to play little role in the event termination
distribution, as for all concurrency factors the kill rate is at 90\%.
\end{itemize}
% \subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
% }\label{fig9-section}
% \input{figures/figure_9}
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
Resource Utilization}\label{fig8-section}
\input{figures/figure_8}
% Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
% \ref{fig:figureIX-c}.
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
% \textbf{Observations}:
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
}\label{fig9-section}
\input{figures/figure_9}
% \begin{itemize}
% \item
% Behaviour between cluster varies a lot
% \item
% There are no ``smooth'' gradients in the various curves unlike in the
% 2011 traces
% \item
% Killed jobs have higher event rates in general, and overall dominate
% all event rates measures
% \item
% There still seems to be a correlation between short execution job
% times and successfull final termination, and likewise for kills and
% higher job terminations
% \item
% Across all clusters, a machine locality factor of 1 seems to lead to
% the highest success event rate
% \end{itemize}
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
\ref{fig:figureIX-c}.
\section{Conclusions, Future Work and Possible Developments}
\textbf{Observations}:
\begin{itemize}
\item
Behaviour between cluster varies a lot
\item
There are no ``smooth'' gradients in the various curves unlike in the
2011 traces
\item
Killed jobs have higher event rates in general, and overall dominate
all event rates measures
\item
There still seems to be a correlation between short execution job
times and successfull final termination, and likewise for kills and
higher job terminations
\item
Across all clusters, a machine locality factor of 1 seems to lead to
the highest success event rate
\end{itemize}
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
\textbf{TBD}
\newpage