diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index b0c4be14..e847ef0c 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index 15947841..fd63a3d2 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -76,37 +76,75 @@ In 2019 Google released an updated version of the \textit{Borg} cluster traces\cite{google-marso-19}, not only containing data from a far bigger workload due to improvements in computational technology, but also providing data from 8 different \textit{Borg} cells from datacenters located all over the -world. These new traces are therefore about 100 times larger than the old -traces, weighing in terms of storage spaces approximately 8TiB (when compressed -and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable -amount of computational power to analyze them and the implementation of special -data engineering techniques for analysis of the data. +world. +\subsection{Motivation} +Even by glancing at some of the spatial and temporal analyses performed on the +Google Borg traces in this report, it is evident that unsuccessful executions +play a major role in the waste of resources in clusterized computations. For +examples, Figure~\ref{fig:machinetimewaste-rel} shows the distribution of +machine time over ``tasks'' (i.e.\ executables running in Borg) with different +termination ``states'', of which \texttt{FINISH} is the only successful one. For +the 2011 Borg traces we have that more than half of the machine time is invested +in carrying out non-successful executions, i.e.\ executing programs that would +eventually ``crash'' and potentially not leading to useful results\footnote{This +is only a speculation, since both the 2011 and the 2019 traces only provide a +``black box'' view of the Borg cluster system. Neither of the accompanying +papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the +documentation for the 2019 traces\cite{google-drive-marso} ever mention if +non-successful tasks produce any useful result.}. The 2019 subplot paints an +even darker picture, with less than 5\% of machine time used for successful +computation. -This project aims to repeat the analysis performed in 2015 to highlight -similarities and differences in workload this decade brought, and expanding the -old analysis to understand even better the causes of failures and how to prevent -them. Additionally, this report provides an overview of the data engineering -techniques used to perform the queries and analyses on the 2019 traces. +Given that even a major player in big data computation like Google is struggling +at efficiently allocating computational resources, the impact of execution +failures is indeed significant and worthy of study. Given also the significance +and data richness of both trace packages, the analysis performed in this report +can be of interest for understanding the behaviour of failures in +similar clusterized systems, and could potentially be used to build predictive +models to mitigate or erase the resource impact of unsuccessful executions. + +\subsection{Challenges} +Given that the new 2019 Google Borg cluster traces are about 100 times larger +than the 2011 ones, and given that the entire compressed traces package has a +non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the +computations required to perform the analysis we illustrate in this report +cannot be performed with classical data science techniques. A +considerable amount of computational power was needed to carry out the +computations, involving at their peek 3 dedicated Apache Spark servers over the +span of 3 months. Additionally, the analysis scripts have been written by +exploiting the power of parallel computing, following most of the time a +MapReduce-like structure. + +\subsection{Contribution} +This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\ +paper\cite{dsn-paper} to highlight similarities and differences in Google Borg +workload and the behaviour and patterns of executions within it. Thanks to this +analysis, we aim to understand even better the causes of failures and how to +prevent them. Additionally, given the technical challenge this analysis posed, +the report aims to provide an overview of some basic data engineering techniques +for big data applications. \subsection{Outline} -The report is structured as follows. Section~\ref{sec2} contains information about the -current state of the art for Google Borg cluster traces. Section~\ref{sec3} -provides an overview including technical background information on the data to -analyze and its storage format. Section~\ref{sec4} will discuss about the -project requirements and the data science methods used to perform the analysis. -Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result -obtained while analyzing, respectively the performance input of -unsuccessful executions, the patterns of task and job events, and the potential -causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the -conclusions. +The report is structured as follows. Section~\ref{sec2} contains information +about the current state of the art for Google Borg cluster traces. +Section~\ref{sec3} provides an overview including technical background +information on the data to analyze and its storage format. Section~\ref{sec4} +will discuss about the project requirements and the data science methods used to +perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and +Section~\ref{sec7} show the result obtained while analyzing, respectively the +performance input of unsuccessful executions, the patterns of task and job +events, and the potential causes of unsuccessful executions. Finally, +Section~\ref{sec8} contains the conclusions. + \section{State of the art}\label{sec2} \begin{figure}[t] \begin{center} \begin{tabular}{cc} -\textbf{Cluster} & \textbf{Timezone} \\ \hline +\toprule +\textbf{Cluster} & \textbf{Timezone} \\ \midrule A & America/New York \\ B & America/Chicago \\ C & America/New York \\ @@ -115,6 +153,7 @@ E & Europe/Helsinki \\ F & America/Chicago \\ G & Asia/Singapore \\ H & Europe/Brussels \\ +\bottomrule \end{tabular} \end{center} \caption{Approximate geographical location obtained from the datacenter's @@ -826,16 +865,6 @@ probabilities based on the number of task termination events of a specific type. Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at the job level. -\section{Analysis: Potential Causes of Unsuccessful Executions} - -The aim of this section is to analyze several task-level and job-level -parameters in order to find correlations with the success of an execution. By -using the tecniques used in Section V of the Rosa\' et al.\ -paper\cite{dsn-paper} we analyze -task events' metadata, the use of CPU and Memory resources at the task level, -and job metadata respectively in Section~\ref{fig7-section}, -Section~\ref{fig8-section} and Section~\ref{fig9-section}. - \subsection{Event rates vs.\ task priority, event execution time, and machine concurrency.}\label{fig7-section} @@ -907,7 +936,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and \textbf{TBD} \newpage -\printbibliography +\printbibliography% \end{document} % vim: set ts=2 sw=2 et tw=80: