introduction done

2021-06-17 17:07:13 +02:00 · 2021-06-17 17:07:13 +02:00 · a05bd53fe6
commit a05bd53fe6
parent 2752ad249f
2 changed files with 61 additions and 32 deletions
--- a/report/Claudio_Maggioni_report.pdf
+++ b/report/Claudio_Maggioni_report.pdf
--- a/report/Claudio_Maggioni_report.tex
+++ b/report/Claudio_Maggioni_report.tex
@ -76,37 +76,75 @@ In 2019 Google released an updated version of the \textit{Borg} cluster
 traces\cite{google-marso-19}, not only containing data from a far bigger
 workload due to improvements in computational technology, but also providing
 data from 8 different \textit{Borg} cells from datacenters located all over the
-world. These new traces are therefore about 100 times larger than the old
-traces, weighing in terms of storage spaces approximately 8TiB (when compressed
-and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
-amount of computational power to analyze them and the implementation of special
-data engineering techniques for analysis of the data.
+world.

+\subsection{Motivation}
+Even by glancing at some of the spatial and temporal analyses performed on the
+Google Borg traces in this report, it is evident that unsuccessful executions
+play a major role in the waste of resources in clusterized computations. For
+examples, Figure~\ref{fig:machinetimewaste-rel} shows the distribution of
+machine time over ``tasks'' (i.e.\ executables running in Borg) with different
+termination ``states'', of which \texttt{FINISH} is the only successful one. For
+the 2011 Borg traces we have that more than half of the machine time is invested
+in carrying out non-successful executions, i.e.\ executing programs that would
+eventually ``crash'' and potentially not leading to useful results\footnote{This
+is only a speculation, since both the 2011 and the 2019 traces only provide a
+``black box'' view of the Borg cluster system. Neither of the accompanying
+papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
+documentation for the 2019 traces\cite{google-drive-marso} ever mention if
+non-successful tasks produce any useful result.}. The 2019 subplot paints an
+even darker picture, with less than 5\% of machine time used for successful
+computation.

-This project aims to repeat the analysis performed in 2015 to highlight
-similarities and differences in workload this decade brought, and expanding the
-old analysis to understand even better the causes of failures and how to prevent
-them. Additionally, this report provides an overview of the data engineering
-techniques used to perform the queries and analyses on the 2019 traces.
+Given that even a major player in big data computation like Google is struggling
+at efficiently allocating computational resources, the impact of execution
+failures is indeed significant and worthy of study. Given also the significance
+and data richness of both trace packages, the analysis performed in this report
+can be of interest for understanding the behaviour of failures in
+similar clusterized systems, and could potentially be used to build predictive
+models to mitigate or erase the resource impact of unsuccessful executions.
+
+\subsection{Challenges}
+Given that the new 2019 Google Borg cluster traces are about 100 times larger
+than the 2011 ones, and given that the entire compressed traces package has a
+non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
+computations required to perform the analysis we illustrate in this report
+cannot be performed with classical data science techniques. A
+considerable amount of computational power was needed to carry out the
+computations, involving at their peek 3 dedicated Apache Spark servers over the
+span of 3 months. Additionally, the analysis scripts have been written by
+exploiting the power of parallel computing, following most of the time a
+MapReduce-like structure.
+
+\subsection{Contribution}
+This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
+paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
+workload and the behaviour and patterns of executions within it. Thanks to this
+analysis, we aim to understand even better the causes of failures and how to
+prevent them. Additionally, given the technical challenge this analysis posed,
+the report aims to provide an overview of some basic data engineering techniques
+for big data applications.

 \subsection{Outline}
-The report is structured as follows. Section~\ref{sec2} contains information about the
-current state of the art for Google Borg cluster traces. Section~\ref{sec3}
-provides an overview including technical background information on the data to
-analyze and its storage format. Section~\ref{sec4} will discuss about the
-project requirements and the data science methods used to perform the analysis.
-Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
-obtained while analyzing, respectively the performance input of
-unsuccessful executions, the patterns of task and job events, and the potential
-causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
-conclusions.
+The report is structured as follows. Section~\ref{sec2} contains information
+about the current state of the art for Google Borg cluster traces.
+Section~\ref{sec3} provides an overview including technical background
+information on the data to analyze and its storage format. Section~\ref{sec4}
+will discuss about the project requirements and the data science methods used to
+perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
+Section~\ref{sec7} show the result obtained while analyzing, respectively the
+performance input of unsuccessful executions, the patterns of task and job
+events, and the potential causes of unsuccessful executions. Finally,
+Section~\ref{sec8} contains the conclusions.
+

 \section{State of the art}\label{sec2}

 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{cc}
-\textbf{Cluster} & \textbf{Timezone}  \\ \hline
+\toprule
+\textbf{Cluster} & \textbf{Timezone}  \\ \midrule
 A & America/New York \\
 B & America/Chicago \\
 C & America/New York \\
@ -115,6 +153,7 @@ E & Europe/Helsinki \\
 F & America/Chicago \\
 G & Asia/Singapore \\
 H & Europe/Brussels \\
+\bottomrule
 \end{tabular}
 \end{center}
 \caption{Approximate geographical location obtained from the datacenter's
@ -826,16 +865,6 @@ probabilities based on the number of task termination events of a specific type.
 Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
 the job level.

-\section{Analysis: Potential Causes of Unsuccessful Executions}
-
-The aim of this section is to analyze several task-level and job-level
-parameters in order to find correlations with the success of an execution. By
-using the tecniques used in Section V of the Rosa\' et al.\
-paper\cite{dsn-paper} we analyze
-task events' metadata, the use of CPU and Memory resources at the task level,
-and job metadata respectively in Section~\ref{fig7-section},
-Section~\ref{fig8-section} and Section~\ref{fig9-section}.
-
 \subsection{Event rates vs.\ task priority, event execution time, and machine
 concurrency.}\label{fig7-section}

@ -907,7 +936,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
 \textbf{TBD}

 \newpage
-\printbibliography
+\printbibliography%

 \end{document}
 % vim: set ts=2 sw=2 et tw=80: