introduction done
This commit is contained in:
parent
2752ad249f
commit
a05bd53fe6
2 changed files with 61 additions and 32 deletions
Binary file not shown.
|
@ -76,37 +76,75 @@ In 2019 Google released an updated version of the \textit{Borg} cluster
|
|||
traces\cite{google-marso-19}, not only containing data from a far bigger
|
||||
workload due to improvements in computational technology, but also providing
|
||||
data from 8 different \textit{Borg} cells from datacenters located all over the
|
||||
world. These new traces are therefore about 100 times larger than the old
|
||||
traces, weighing in terms of storage spaces approximately 8TiB (when compressed
|
||||
and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
|
||||
amount of computational power to analyze them and the implementation of special
|
||||
data engineering techniques for analysis of the data.
|
||||
world.
|
||||
|
||||
\subsection{Motivation}
|
||||
Even by glancing at some of the spatial and temporal analyses performed on the
|
||||
Google Borg traces in this report, it is evident that unsuccessful executions
|
||||
play a major role in the waste of resources in clusterized computations. For
|
||||
examples, Figure~\ref{fig:machinetimewaste-rel} shows the distribution of
|
||||
machine time over ``tasks'' (i.e.\ executables running in Borg) with different
|
||||
termination ``states'', of which \texttt{FINISH} is the only successful one. For
|
||||
the 2011 Borg traces we have that more than half of the machine time is invested
|
||||
in carrying out non-successful executions, i.e.\ executing programs that would
|
||||
eventually ``crash'' and potentially not leading to useful results\footnote{This
|
||||
is only a speculation, since both the 2011 and the 2019 traces only provide a
|
||||
``black box'' view of the Borg cluster system. Neither of the accompanying
|
||||
papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
|
||||
documentation for the 2019 traces\cite{google-drive-marso} ever mention if
|
||||
non-successful tasks produce any useful result.}. The 2019 subplot paints an
|
||||
even darker picture, with less than 5\% of machine time used for successful
|
||||
computation.
|
||||
|
||||
This project aims to repeat the analysis performed in 2015 to highlight
|
||||
similarities and differences in workload this decade brought, and expanding the
|
||||
old analysis to understand even better the causes of failures and how to prevent
|
||||
them. Additionally, this report provides an overview of the data engineering
|
||||
techniques used to perform the queries and analyses on the 2019 traces.
|
||||
Given that even a major player in big data computation like Google is struggling
|
||||
at efficiently allocating computational resources, the impact of execution
|
||||
failures is indeed significant and worthy of study. Given also the significance
|
||||
and data richness of both trace packages, the analysis performed in this report
|
||||
can be of interest for understanding the behaviour of failures in
|
||||
similar clusterized systems, and could potentially be used to build predictive
|
||||
models to mitigate or erase the resource impact of unsuccessful executions.
|
||||
|
||||
\subsection{Challenges}
|
||||
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
||||
than the 2011 ones, and given that the entire compressed traces package has a
|
||||
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
|
||||
computations required to perform the analysis we illustrate in this report
|
||||
cannot be performed with classical data science techniques. A
|
||||
considerable amount of computational power was needed to carry out the
|
||||
computations, involving at their peek 3 dedicated Apache Spark servers over the
|
||||
span of 3 months. Additionally, the analysis scripts have been written by
|
||||
exploiting the power of parallel computing, following most of the time a
|
||||
MapReduce-like structure.
|
||||
|
||||
\subsection{Contribution}
|
||||
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
||||
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
||||
workload and the behaviour and patterns of executions within it. Thanks to this
|
||||
analysis, we aim to understand even better the causes of failures and how to
|
||||
prevent them. Additionally, given the technical challenge this analysis posed,
|
||||
the report aims to provide an overview of some basic data engineering techniques
|
||||
for big data applications.
|
||||
|
||||
\subsection{Outline}
|
||||
The report is structured as follows. Section~\ref{sec2} contains information about the
|
||||
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
|
||||
provides an overview including technical background information on the data to
|
||||
analyze and its storage format. Section~\ref{sec4} will discuss about the
|
||||
project requirements and the data science methods used to perform the analysis.
|
||||
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
|
||||
obtained while analyzing, respectively the performance input of
|
||||
unsuccessful executions, the patterns of task and job events, and the potential
|
||||
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
|
||||
conclusions.
|
||||
The report is structured as follows. Section~\ref{sec2} contains information
|
||||
about the current state of the art for Google Borg cluster traces.
|
||||
Section~\ref{sec3} provides an overview including technical background
|
||||
information on the data to analyze and its storage format. Section~\ref{sec4}
|
||||
will discuss about the project requirements and the data science methods used to
|
||||
perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
|
||||
Section~\ref{sec7} show the result obtained while analyzing, respectively the
|
||||
performance input of unsuccessful executions, the patterns of task and job
|
||||
events, and the potential causes of unsuccessful executions. Finally,
|
||||
Section~\ref{sec8} contains the conclusions.
|
||||
|
||||
|
||||
\section{State of the art}\label{sec2}
|
||||
|
||||
\begin{figure}[t]
|
||||
\begin{center}
|
||||
\begin{tabular}{cc}
|
||||
\textbf{Cluster} & \textbf{Timezone} \\ \hline
|
||||
\toprule
|
||||
\textbf{Cluster} & \textbf{Timezone} \\ \midrule
|
||||
A & America/New York \\
|
||||
B & America/Chicago \\
|
||||
C & America/New York \\
|
||||
|
@ -115,6 +153,7 @@ E & Europe/Helsinki \\
|
|||
F & America/Chicago \\
|
||||
G & Asia/Singapore \\
|
||||
H & Europe/Brussels \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\caption{Approximate geographical location obtained from the datacenter's
|
||||
|
@ -826,16 +865,6 @@ probabilities based on the number of task termination events of a specific type.
|
|||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||
the job level.
|
||||
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||
|
||||
The aim of this section is to analyze several task-level and job-level
|
||||
parameters in order to find correlations with the success of an execution. By
|
||||
using the tecniques used in Section V of the Rosa\' et al.\
|
||||
paper\cite{dsn-paper} we analyze
|
||||
task events' metadata, the use of CPU and Memory resources at the task level,
|
||||
and job metadata respectively in Section~\ref{fig7-section},
|
||||
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
||||
|
||||
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
||||
concurrency.}\label{fig7-section}
|
||||
|
||||
|
@ -907,7 +936,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|||
\textbf{TBD}
|
||||
|
||||
\newpage
|
||||
\printbibliography
|
||||
\printbibliography%
|
||||
|
||||
\end{document}
|
||||
% vim: set ts=2 sw=2 et tw=80:
|
||||
|
|
Loading…
Reference in a new issue