introduction done

2021-06-17 17:07:13 +02:00 · 2021-06-17 17:07:13 +02:00 · a05bd53fe6
commit a05bd53fe6
parent 2752ad249f
2 changed files with 61 additions and 32 deletions
--- a/report/Claudio_Maggioni_report.pdf
+++ b/report/Claudio_Maggioni_report.pdf
--- a/report/Claudio_Maggioni_report.tex
+++ b/report/Claudio_Maggioni_report.tex
@ -76,37 +76,75 @@ In 2019 Google released an updated version of the \textit{Borg} cluster
 traces\cite{google-marso-19}, not only containing data from a far bigger
 workload due to improvements in computational technology, but also providing
 data from 8 different \textit{Borg} cells from datacenters located all over the
-world. These new traces are therefore about 100 times larger than the old
+world.
 traces, weighing in terms of storage spaces approximately 8TiB (when compressed
 and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
 amount of computational power to analyze them and the implementation of special
 data engineering techniques for analysis of the data.
 \subsection{Motivation}
 Even by glancing at some of the spatial and temporal analyses performed on the
 Google Borg traces in this report, it is evident that unsuccessful executions
 play a major role in the waste of resources in clusterized computations. For
 examples, Figure~\ref{fig:machinetimewaste-rel} shows the distribution of
 machine time over ``tasks'' (i.e.\ executables running in Borg) with different
 termination ``states'', of which \texttt{FINISH} is the only successful one. For
 the 2011 Borg traces we have that more than half of the machine time is invested
 in carrying out non-successful executions, i.e.\ executing programs that would
 eventually ``crash'' and potentially not leading to useful results\footnote{This
 is only a speculation, since both the 2011 and the 2019 traces only provide a
 ``black box'' view of the Borg cluster system. Neither of the accompanying
 papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
 documentation for the 2019 traces\cite{google-drive-marso} ever mention if
 non-successful tasks produce any useful result.}. The 2019 subplot paints an
 even darker picture, with less than 5\% of machine time used for successful
 computation.
-This project aims to repeat the analysis performed in 2015 to highlight
+Given that even a major player in big data computation like Google is struggling
-similarities and differences in workload this decade brought, and expanding the
+at efficiently allocating computational resources, the impact of execution
-old analysis to understand even better the causes of failures and how to prevent
+failures is indeed significant and worthy of study. Given also the significance
-them. Additionally, this report provides an overview of the data engineering
+and data richness of both trace packages, the analysis performed in this report
-techniques used to perform the queries and analyses on the 2019 traces.
+can be of interest for understanding the behaviour of failures in
 similar clusterized systems, and could potentially be used to build predictive
 models to mitigate or erase the resource impact of unsuccessful executions.
 \subsection{Challenges}
 Given that the new 2019 Google Borg cluster traces are about 100 times larger
 than the 2011 ones, and given that the entire compressed traces package has a
 non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
 computations required to perform the analysis we illustrate in this report
 cannot be performed with classical data science techniques. A
 considerable amount of computational power was needed to carry out the
 computations, involving at their peek 3 dedicated Apache Spark servers over the
 span of 3 months. Additionally, the analysis scripts have been written by
 exploiting the power of parallel computing, following most of the time a
 MapReduce-like structure.
 \subsection{Contribution}
 This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
 paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
 workload and the behaviour and patterns of executions within it. Thanks to this
 analysis, we aim to understand even better the causes of failures and how to
 prevent them. Additionally, given the technical challenge this analysis posed,
 the report aims to provide an overview of some basic data engineering techniques
 for big data applications.
 \subsection{Outline}
-The report is structured as follows. Section~\ref{sec2} contains information about the
+The report is structured as follows. Section~\ref{sec2} contains information
-current state of the art for Google Borg cluster traces. Section~\ref{sec3}
+about the current state of the art for Google Borg cluster traces.
-provides an overview including technical background information on the data to
+Section~\ref{sec3} provides an overview including technical background
-analyze and its storage format. Section~\ref{sec4} will discuss about the
+information on the data to analyze and its storage format. Section~\ref{sec4}
-project requirements and the data science methods used to perform the analysis.
+will discuss about the project requirements and the data science methods used to
-Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
+perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
-obtained while analyzing, respectively the performance input of
+Section~\ref{sec7} show the result obtained while analyzing, respectively the
-unsuccessful executions, the patterns of task and job events, and the potential
+performance input of unsuccessful executions, the patterns of task and job
-causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
+events, and the potential causes of unsuccessful executions. Finally,
-conclusions.
+Section~\ref{sec8} contains the conclusions.
 \section{State of the art}\label{sec2}
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{cc}
-\textbf{Cluster} & \textbf{Timezone}  \\ \hline
+\toprule
 \textbf{Cluster} & \textbf{Timezone}  \\ \midrule
 A & America/New York \\
 B & America/Chicago \\
 C & America/New York \\
@ -115,6 +153,7 @@ E & Europe/Helsinki \\
 F & America/Chicago \\
 G & Asia/Singapore \\
 H & Europe/Brussels \\
 \bottomrule
 \end{tabular}
 \end{center}
 \caption{Approximate geographical location obtained from the datacenter's
@ -826,16 +865,6 @@ probabilities based on the number of task termination events of a specific type.
 Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
 the job level.
 \section{Analysis: Potential Causes of Unsuccessful Executions}
 The aim of this section is to analyze several task-level and job-level
 parameters in order to find correlations with the success of an execution. By
 using the tecniques used in Section V of the Rosa\' et al.\
 paper\cite{dsn-paper} we analyze
 task events' metadata, the use of CPU and Memory resources at the task level,
 and job metadata respectively in Section~\ref{fig7-section},
 Section~\ref{fig8-section} and Section~\ref{fig9-section}.
 \subsection{Event rates vs.\ task priority, event execution time, and machine
 concurrency.}\label{fig7-section}
@ -907,7 +936,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
 \textbf{TBD}
 \newpage
-\printbibliography
+\printbibliography%
 \end{document}
 % vim: set ts=2 sw=2 et tw=80: