introduction done

This commit is contained in:
Claudio Maggioni 2021-06-17 17:07:13 +02:00
parent d1ae92f239
commit d02d46d4bc
2 changed files with 61 additions and 32 deletions

Binary file not shown.

View File

@ -76,37 +76,75 @@ In 2019 Google released an updated version of the \textit{Borg} cluster
traces\cite{google-marso-19}, not only containing data from a far bigger
workload due to improvements in computational technology, but also providing
data from 8 different \textit{Borg} cells from datacenters located all over the
world. These new traces are therefore about 100 times larger than the old
traces, weighing in terms of storage spaces approximately 8TiB (when compressed
and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
amount of computational power to analyze them and the implementation of special
data engineering techniques for analysis of the data.
world.
\subsection{Motivation}
Even by glancing at some of the spatial and temporal analyses performed on the
Google Borg traces in this report, it is evident that unsuccessful executions
play a major role in the waste of resources in clusterized computations. For
examples, Figure~\ref{fig:machinetimewaste-rel} shows the distribution of
machine time over ``tasks'' (i.e.\ executables running in Borg) with different
termination ``states'', of which \texttt{FINISH} is the only successful one. For
the 2011 Borg traces we have that more than half of the machine time is invested
in carrying out non-successful executions, i.e.\ executing programs that would
eventually ``crash'' and potentially not leading to useful results\footnote{This
is only a speculation, since both the 2011 and the 2019 traces only provide a
``black box'' view of the Borg cluster system. Neither of the accompanying
papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
documentation for the 2019 traces\cite{google-drive-marso} ever mention if
non-successful tasks produce any useful result.}. The 2019 subplot paints an
even darker picture, with less than 5\% of machine time used for successful
computation.
This project aims to repeat the analysis performed in 2015 to highlight
similarities and differences in workload this decade brought, and expanding the
old analysis to understand even better the causes of failures and how to prevent
them. Additionally, this report provides an overview of the data engineering
techniques used to perform the queries and analyses on the 2019 traces.
Given that even a major player in big data computation like Google is struggling
at efficiently allocating computational resources, the impact of execution
failures is indeed significant and worthy of study. Given also the significance
and data richness of both trace packages, the analysis performed in this report
can be of interest for understanding the behaviour of failures in
similar clusterized systems, and could potentially be used to build predictive
models to mitigate or erase the resource impact of unsuccessful executions.
\subsection{Challenges}
Given that the new 2019 Google Borg cluster traces are about 100 times larger
than the 2011 ones, and given that the entire compressed traces package has a
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
computations required to perform the analysis we illustrate in this report
cannot be performed with classical data science techniques. A
considerable amount of computational power was needed to carry out the
computations, involving at their peek 3 dedicated Apache Spark servers over the
span of 3 months. Additionally, the analysis scripts have been written by
exploiting the power of parallel computing, following most of the time a
MapReduce-like structure.
\subsection{Contribution}
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
workload and the behaviour and patterns of executions within it. Thanks to this
analysis, we aim to understand even better the causes of failures and how to
prevent them. Additionally, given the technical challenge this analysis posed,
the report aims to provide an overview of some basic data engineering techniques
for big data applications.
\subsection{Outline}
The report is structured as follows. Section~\ref{sec2} contains information about the
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
provides an overview including technical background information on the data to
analyze and its storage format. Section~\ref{sec4} will discuss about the
project requirements and the data science methods used to perform the analysis.
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
obtained while analyzing, respectively the performance input of
unsuccessful executions, the patterns of task and job events, and the potential
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
conclusions.
The report is structured as follows. Section~\ref{sec2} contains information
about the current state of the art for Google Borg cluster traces.
Section~\ref{sec3} provides an overview including technical background
information on the data to analyze and its storage format. Section~\ref{sec4}
will discuss about the project requirements and the data science methods used to
perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
Section~\ref{sec7} show the result obtained while analyzing, respectively the
performance input of unsuccessful executions, the patterns of task and job
events, and the potential causes of unsuccessful executions. Finally,
Section~\ref{sec8} contains the conclusions.
\section{State of the art}\label{sec2}
\begin{figure}[t]
\begin{center}
\begin{tabular}{cc}
\textbf{Cluster} & \textbf{Timezone} \\ \hline
\toprule
\textbf{Cluster} & \textbf{Timezone} \\ \midrule
A & America/New York \\
B & America/Chicago \\
C & America/New York \\
@ -115,6 +153,7 @@ E & Europe/Helsinki \\
F & America/Chicago \\
G & Asia/Singapore \\
H & Europe/Brussels \\
\bottomrule
\end{tabular}
\end{center}
\caption{Approximate geographical location obtained from the datacenter's
@ -826,16 +865,6 @@ probabilities based on the number of task termination events of a specific type.
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
the job level.
\section{Analysis: Potential Causes of Unsuccessful Executions}
The aim of this section is to analyze several task-level and job-level
parameters in order to find correlations with the success of an execution. By
using the tecniques used in Section V of the Rosa\' et al.\
paper\cite{dsn-paper} we analyze
task events' metadata, the use of CPU and Memory resources at the task level,
and job metadata respectively in Section~\ref{fig7-section},
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
\subsection{Event rates vs.\ task priority, event execution time, and machine
concurrency.}\label{fig7-section}
@ -907,7 +936,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
\textbf{TBD}
\newpage
\printbibliography
\printbibliography%
\end{document}
% vim: set ts=2 sw=2 et tw=80: