Compare commits
2 Commits
158fb32041
...
d1ae92f239
Author | SHA1 | Date | |
---|---|---|---|
d1ae92f239 | |||
b58c2aaa52 |
Binary file not shown.
|
@ -89,7 +89,20 @@ old analysis to understand even better the causes of failures and how to prevent
|
|||
them. Additionally, this report provides an overview of the data engineering
|
||||
techniques used to perform the queries and analyses on the 2019 traces.
|
||||
|
||||
\section{State of the art}
|
||||
\subsection{Outline}
|
||||
The report is structured as follows. Section~\ref{sec2} contains information about the
|
||||
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
|
||||
provides an overview including technical background information on the data to
|
||||
analyze and its storage format. Section~\ref{sec4} will discuss about the
|
||||
project requirements and the data science methods used to perform the analysis.
|
||||
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
|
||||
obtained while analyzing, respectively the performance input of
|
||||
unsuccessful executions, the patterns of task and job events, and the potential
|
||||
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
|
||||
conclusions.
|
||||
|
||||
\section{State of the art}\label{sec2}
|
||||
|
||||
\begin{figure}[t]
|
||||
\begin{center}
|
||||
\begin{tabular}{cc}
|
||||
|
@ -142,7 +155,7 @@ machines.
|
|||
|
||||
\input{figures/machine_configs}
|
||||
|
||||
\section{Background information}
|
||||
\section{Background information}\label{sec3}
|
||||
|
||||
\textit{Borg} is Google's own cluster management software able to run
|
||||
thousands of different jobs. Among the various cluster management services it
|
||||
|
@ -275,7 +288,7 @@ science technologies like Apache Spark were used to achieve efficient
|
|||
and parallelized computations. This approach is discussed with further
|
||||
detail in the following section.
|
||||
|
||||
\section{Project Requirements and Analysis Methodology}
|
||||
\section{Project Requirements and Analysis Methodology}\label{sec4}
|
||||
|
||||
The aim of this project is to repeat the analysis performed in 2015 on the
|
||||
dataset Google has released in 2019 in order to find similarities and
|
||||
|
@ -460,7 +473,7 @@ computing slowdown values given the previously computed execution attempt time
|
|||
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
||||
in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
||||
|
||||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
||||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||
|
||||
Our first investigation focuses on replicating the analysis done by the paper of
|
||||
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||
|
@ -667,7 +680,7 @@ With more than 98\% of both CPU and memory resources used by
|
|||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||
traces.
|
||||
|
||||
\section{Analysis: Patterns of Task and Job Events}
|
||||
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||
|
||||
This section aims to use some of the tecniques used in section IV of
|
||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||
|
@ -801,84 +814,96 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
|
|||
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
||||
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||
|
||||
% \section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||
|
||||
% The aim of this section is to analyze several task-level and job-level
|
||||
% parameters in order to find correlations with the success of an execution. By
|
||||
% using the tecniques used in Section V of the Rosa\' et al.\
|
||||
% paper\cite{dsn-paper} we analyze
|
||||
% task events' metadata, the use of CPU and Memory resources at the task level,
|
||||
% and job metadata respectively in Section~\ref{fig7-section},
|
||||
% Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
||||
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
||||
paper\cite{dsn-paper} to find patterns and interpendencies
|
||||
between task and job events by gathering event statistics at those events. In
|
||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||
task is inter-correlated with its own event patterns, which
|
||||
Section~\ref{figV-section} explores even further by computing task success
|
||||
probabilities based on the number of task termination events of a specific type.
|
||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||
the job level.
|
||||
|
||||
% \subsection{Event rates vs.\ task priority, event execution time, and machine
|
||||
% concurrency.}\label{fig7-section}
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||
|
||||
% \input{figures/figure_7}
|
||||
The aim of this section is to analyze several task-level and job-level
|
||||
parameters in order to find correlations with the success of an execution. By
|
||||
using the tecniques used in Section V of the Rosa\' et al.\
|
||||
paper\cite{dsn-paper} we analyze
|
||||
task events' metadata, the use of CPU and Memory resources at the task level,
|
||||
and job metadata respectively in Section~\ref{fig7-section},
|
||||
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
||||
|
||||
% Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
||||
% \ref{fig:figureVII-c}.
|
||||
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
||||
concurrency.}\label{fig7-section}
|
||||
|
||||
% \textbf{Observations}:
|
||||
\input{figures/figure_7}
|
||||
|
||||
% \begin{itemize}
|
||||
% \item
|
||||
% No smooth curves in this figure either, unlike 2011 traces
|
||||
% \item
|
||||
% The behaviour of curves for 7a (priority) is almost the opposite of
|
||||
% 2011, i.e. in-between priorities have higher kill rates while
|
||||
% priorities at the extremum have lower kill rates. This could also be
|
||||
% due bt the inherent distribution of job terminations;
|
||||
% \item
|
||||
% Event execution time curves are quite different than 2011, here it
|
||||
% seems there is a good correlation between short task execution times
|
||||
% and finish event rates, instead of the U shape curve in 2015 DSN
|
||||
% \item
|
||||
% In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
||||
% \item
|
||||
% Machine concurrency seems to play little role in the event termination
|
||||
% distribution, as for all concurrency factors the kill rate is at 90\%.
|
||||
% \end{itemize}
|
||||
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
||||
\ref{fig:figureVII-c}.
|
||||
|
||||
% \subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
||||
% Resource Utilization}\label{fig8-section}
|
||||
% \input{figures/figure_8}
|
||||
\textbf{Observations}:
|
||||
|
||||
% Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
||||
% Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
||||
% Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
||||
% Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
||||
% Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
||||
% Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
||||
\begin{itemize}
|
||||
\item
|
||||
No smooth curves in this figure either, unlike 2011 traces
|
||||
\item
|
||||
The behaviour of curves for 7a (priority) is almost the opposite of
|
||||
2011, i.e. in-between priorities have higher kill rates while
|
||||
priorities at the extremum have lower kill rates. This could also be
|
||||
due bt the inherent distribution of job terminations;
|
||||
\item
|
||||
Event execution time curves are quite different than 2011, here it
|
||||
seems there is a good correlation between short task execution times
|
||||
and finish event rates, instead of the U shape curve in 2015 DSN
|
||||
\item
|
||||
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
||||
\item
|
||||
Machine concurrency seems to play little role in the event termination
|
||||
distribution, as for all concurrency factors the kill rate is at 90\%.
|
||||
\end{itemize}
|
||||
|
||||
% \subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
||||
% }\label{fig9-section}
|
||||
% \input{figures/figure_9}
|
||||
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
||||
Resource Utilization}\label{fig8-section}
|
||||
\input{figures/figure_8}
|
||||
|
||||
% Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||
% \ref{fig:figureIX-c}.
|
||||
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
||||
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
||||
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
||||
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
||||
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
||||
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
||||
|
||||
% \textbf{Observations}:
|
||||
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
||||
}\label{fig9-section}
|
||||
\input{figures/figure_9}
|
||||
|
||||
% \begin{itemize}
|
||||
% \item
|
||||
% Behaviour between cluster varies a lot
|
||||
% \item
|
||||
% There are no ``smooth'' gradients in the various curves unlike in the
|
||||
% 2011 traces
|
||||
% \item
|
||||
% Killed jobs have higher event rates in general, and overall dominate
|
||||
% all event rates measures
|
||||
% \item
|
||||
% There still seems to be a correlation between short execution job
|
||||
% times and successfull final termination, and likewise for kills and
|
||||
% higher job terminations
|
||||
% \item
|
||||
% Across all clusters, a machine locality factor of 1 seems to lead to
|
||||
% the highest success event rate
|
||||
% \end{itemize}
|
||||
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||
\ref{fig:figureIX-c}.
|
||||
|
||||
\section{Conclusions, Future Work and Possible Developments}
|
||||
\textbf{Observations}:
|
||||
|
||||
\begin{itemize}
|
||||
\item
|
||||
Behaviour between cluster varies a lot
|
||||
\item
|
||||
There are no ``smooth'' gradients in the various curves unlike in the
|
||||
2011 traces
|
||||
\item
|
||||
Killed jobs have higher event rates in general, and overall dominate
|
||||
all event rates measures
|
||||
\item
|
||||
There still seems to be a correlation between short execution job
|
||||
times and successfull final termination, and likewise for kills and
|
||||
higher job terminations
|
||||
\item
|
||||
Across all clusters, a machine locality factor of 1 seems to lead to
|
||||
the highest success event rate
|
||||
\end{itemize}
|
||||
|
||||
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
||||
\textbf{TBD}
|
||||
|
||||
\newpage
|
||||
|
|
Loading…
Reference in New Issue
Block a user