Compare commits
2 Commits
158fb32041
...
d1ae92f239
Author | SHA1 | Date | |
---|---|---|---|
d1ae92f239 | |||
b58c2aaa52 |
Binary file not shown.
|
@ -89,7 +89,20 @@ old analysis to understand even better the causes of failures and how to prevent
|
||||||
them. Additionally, this report provides an overview of the data engineering
|
them. Additionally, this report provides an overview of the data engineering
|
||||||
techniques used to perform the queries and analyses on the 2019 traces.
|
techniques used to perform the queries and analyses on the 2019 traces.
|
||||||
|
|
||||||
\section{State of the art}
|
\subsection{Outline}
|
||||||
|
The report is structured as follows. Section~\ref{sec2} contains information about the
|
||||||
|
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
|
||||||
|
provides an overview including technical background information on the data to
|
||||||
|
analyze and its storage format. Section~\ref{sec4} will discuss about the
|
||||||
|
project requirements and the data science methods used to perform the analysis.
|
||||||
|
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
|
||||||
|
obtained while analyzing, respectively the performance input of
|
||||||
|
unsuccessful executions, the patterns of task and job events, and the potential
|
||||||
|
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
|
||||||
|
conclusions.
|
||||||
|
|
||||||
|
\section{State of the art}\label{sec2}
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tabular}{cc}
|
\begin{tabular}{cc}
|
||||||
|
@ -142,7 +155,7 @@ machines.
|
||||||
|
|
||||||
\input{figures/machine_configs}
|
\input{figures/machine_configs}
|
||||||
|
|
||||||
\section{Background information}
|
\section{Background information}\label{sec3}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
thousands of different jobs. Among the various cluster management services it
|
thousands of different jobs. Among the various cluster management services it
|
||||||
|
@ -275,7 +288,7 @@ science technologies like Apache Spark were used to achieve efficient
|
||||||
and parallelized computations. This approach is discussed with further
|
and parallelized computations. This approach is discussed with further
|
||||||
detail in the following section.
|
detail in the following section.
|
||||||
|
|
||||||
\section{Project Requirements and Analysis Methodology}
|
\section{Project Requirements and Analysis Methodology}\label{sec4}
|
||||||
|
|
||||||
The aim of this project is to repeat the analysis performed in 2015 on the
|
The aim of this project is to repeat the analysis performed in 2015 on the
|
||||||
dataset Google has released in 2019 in order to find similarities and
|
dataset Google has released in 2019 in order to find similarities and
|
||||||
|
@ -460,7 +473,7 @@ computing slowdown values given the previously computed execution attempt time
|
||||||
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
||||||
in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
||||||
|
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the analysis done by the paper of
|
Our first investigation focuses on replicating the analysis done by the paper of
|
||||||
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||||
|
@ -667,7 +680,7 @@ With more than 98\% of both CPU and memory resources used by
|
||||||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||||
traces.
|
traces.
|
||||||
|
|
||||||
\section{Analysis: Patterns of Task and Job Events}
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||||
|
|
||||||
This section aims to use some of the tecniques used in section IV of
|
This section aims to use some of the tecniques used in section IV of
|
||||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||||
|
@ -801,84 +814,96 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
|
||||||
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
||||||
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||||
|
|
||||||
% \section{Analysis: Potential Causes of Unsuccessful Executions}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||||
|
|
||||||
% The aim of this section is to analyze several task-level and job-level
|
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
||||||
% parameters in order to find correlations with the success of an execution. By
|
paper\cite{dsn-paper} to find patterns and interpendencies
|
||||||
% using the tecniques used in Section V of the Rosa\' et al.\
|
between task and job events by gathering event statistics at those events. In
|
||||||
% paper\cite{dsn-paper} we analyze
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||||
% task events' metadata, the use of CPU and Memory resources at the task level,
|
task is inter-correlated with its own event patterns, which
|
||||||
% and job metadata respectively in Section~\ref{fig7-section},
|
Section~\ref{figV-section} explores even further by computing task success
|
||||||
% Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
probabilities based on the number of task termination events of a specific type.
|
||||||
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||||
|
the job level.
|
||||||
|
|
||||||
% \subsection{Event rates vs.\ task priority, event execution time, and machine
|
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||||
% concurrency.}\label{fig7-section}
|
|
||||||
|
|
||||||
% \input{figures/figure_7}
|
The aim of this section is to analyze several task-level and job-level
|
||||||
|
parameters in order to find correlations with the success of an execution. By
|
||||||
|
using the tecniques used in Section V of the Rosa\' et al.\
|
||||||
|
paper\cite{dsn-paper} we analyze
|
||||||
|
task events' metadata, the use of CPU and Memory resources at the task level,
|
||||||
|
and job metadata respectively in Section~\ref{fig7-section},
|
||||||
|
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
||||||
|
|
||||||
% Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
||||||
% \ref{fig:figureVII-c}.
|
concurrency.}\label{fig7-section}
|
||||||
|
|
||||||
% \textbf{Observations}:
|
\input{figures/figure_7}
|
||||||
|
|
||||||
% \begin{itemize}
|
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
||||||
% \item
|
\ref{fig:figureVII-c}.
|
||||||
% No smooth curves in this figure either, unlike 2011 traces
|
|
||||||
% \item
|
|
||||||
% The behaviour of curves for 7a (priority) is almost the opposite of
|
|
||||||
% 2011, i.e. in-between priorities have higher kill rates while
|
|
||||||
% priorities at the extremum have lower kill rates. This could also be
|
|
||||||
% due bt the inherent distribution of job terminations;
|
|
||||||
% \item
|
|
||||||
% Event execution time curves are quite different than 2011, here it
|
|
||||||
% seems there is a good correlation between short task execution times
|
|
||||||
% and finish event rates, instead of the U shape curve in 2015 DSN
|
|
||||||
% \item
|
|
||||||
% In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
|
||||||
% \item
|
|
||||||
% Machine concurrency seems to play little role in the event termination
|
|
||||||
% distribution, as for all concurrency factors the kill rate is at 90\%.
|
|
||||||
% \end{itemize}
|
|
||||||
|
|
||||||
% \subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
\textbf{Observations}:
|
||||||
% Resource Utilization}\label{fig8-section}
|
|
||||||
% \input{figures/figure_8}
|
|
||||||
|
|
||||||
% Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
\begin{itemize}
|
||||||
% Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
\item
|
||||||
% Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
No smooth curves in this figure either, unlike 2011 traces
|
||||||
% Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
\item
|
||||||
% Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
The behaviour of curves for 7a (priority) is almost the opposite of
|
||||||
% Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
2011, i.e. in-between priorities have higher kill rates while
|
||||||
|
priorities at the extremum have lower kill rates. This could also be
|
||||||
|
due bt the inherent distribution of job terminations;
|
||||||
|
\item
|
||||||
|
Event execution time curves are quite different than 2011, here it
|
||||||
|
seems there is a good correlation between short task execution times
|
||||||
|
and finish event rates, instead of the U shape curve in 2015 DSN
|
||||||
|
\item
|
||||||
|
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
||||||
|
\item
|
||||||
|
Machine concurrency seems to play little role in the event termination
|
||||||
|
distribution, as for all concurrency factors the kill rate is at 90\%.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
% \subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
||||||
% }\label{fig9-section}
|
Resource Utilization}\label{fig8-section}
|
||||||
% \input{figures/figure_9}
|
\input{figures/figure_8}
|
||||||
|
|
||||||
% Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
||||||
% \ref{fig:figureIX-c}.
|
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
||||||
|
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
||||||
|
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
||||||
|
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
||||||
|
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
||||||
|
|
||||||
% \textbf{Observations}:
|
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
||||||
|
}\label{fig9-section}
|
||||||
|
\input{figures/figure_9}
|
||||||
|
|
||||||
% \begin{itemize}
|
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||||
% \item
|
\ref{fig:figureIX-c}.
|
||||||
% Behaviour between cluster varies a lot
|
|
||||||
% \item
|
|
||||||
% There are no ``smooth'' gradients in the various curves unlike in the
|
|
||||||
% 2011 traces
|
|
||||||
% \item
|
|
||||||
% Killed jobs have higher event rates in general, and overall dominate
|
|
||||||
% all event rates measures
|
|
||||||
% \item
|
|
||||||
% There still seems to be a correlation between short execution job
|
|
||||||
% times and successfull final termination, and likewise for kills and
|
|
||||||
% higher job terminations
|
|
||||||
% \item
|
|
||||||
% Across all clusters, a machine locality factor of 1 seems to lead to
|
|
||||||
% the highest success event rate
|
|
||||||
% \end{itemize}
|
|
||||||
|
|
||||||
\section{Conclusions, Future Work and Possible Developments}
|
\textbf{Observations}:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
Behaviour between cluster varies a lot
|
||||||
|
\item
|
||||||
|
There are no ``smooth'' gradients in the various curves unlike in the
|
||||||
|
2011 traces
|
||||||
|
\item
|
||||||
|
Killed jobs have higher event rates in general, and overall dominate
|
||||||
|
all event rates measures
|
||||||
|
\item
|
||||||
|
There still seems to be a correlation between short execution job
|
||||||
|
times and successfull final termination, and likewise for kills and
|
||||||
|
higher job terminations
|
||||||
|
\item
|
||||||
|
Across all clusters, a machine locality factor of 1 seems to lead to
|
||||||
|
the highest success event rate
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
||||||
\textbf{TBD}
|
\textbf{TBD}
|
||||||
|
|
||||||
\newpage
|
\newpage
|
||||||
|
|
Loading…
Reference in New Issue
Block a user