diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index 959ac14f..5c31684e 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index becac66d..6059715e 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -97,6 +97,20 @@ old analysis to understand even better the causes of failures and how to prevent them. Additionally, this report will provide an overview on the data engineering techniques used to perform the queries and analyses on the 2019 traces. +\section{State of the art} + +\textbf{TBD (introduce only 2015 dsn paper)} + +In 2015, Dr.~Andrea Rosà et al.\ published a +research paper titled \textit{Understanding the Dark Side of Big Data Clusters: +An Analysis beyond Failures}\cite{vino-paper} in which they performed several +analysis on unsuccessful executions in the Google's 2011 Borg cluster traces +with the aim of identifying their resource waste, their impacts on the +performance of the application, and any causes that may lie behind such +failures. The salient conclusion of that research is that actually lots of +computations performed by Google would eventually end in failure, then leading +to large amounts of computational power being wasted. + \section{Background information} \textit{Borg} is Google's own cluster management software able to run @@ -131,33 +145,12 @@ In general events can be of two kinds, there are events that are relative to the status of the schedule, and there are other events that are relative to the status of a task itself. -% \subsection{Rosà et al.~2015 DSN paper} - -In 2015, Dr.~Andrea Rosà, Lydia Y. Chen and Prof.~Walter Binder published a -research paper titled \textit{Understanding the Dark Side of Big Data Clusters: -An Analysis beyond Failures}\cite{vino-paper} in which they performed several -analysis on unsuccessful executions in the Google's 2011 Borg cluster traces -with the aim of identifying their resource waste, their impacts on the -performance of the application, and any causes that may lie behind such -failures. The salient conclusion of that research is that actually lots of -computations performed by Google would eventually end in failure, then leading -to large amounts of computational power being wasted. - \begin{figure}[h] \begin{center} \begin{tabular}{p{3cm}p{12cm}} \toprule \textbf{Type code} & \textbf{Description} \\ \midrule -% SUGGERIMENTO, NON CANCELLARE MAI, A MENO CHE NON SONO COSE COMPLETAMENTE -% INUTILI, IN MOLTI CASI VA BENE COMMENTARE, INTANTO NON INFLUISCONO CON LA -% COMPILAZIONE. -% \texttt{QUEUE} & The job or task was marked not eligible for scheduling -% by Borg's scheduler, and thus Borg will move the job/task in a long -% wait queue\\ -% \texttt{SUBMIT} & The job or task was submitted to Borg for execution\\ -% \texttt{ENABLE} & The job or task became eligible for scheduling\\ -% \texttt{SCHEDULE} & The job or task's execution started\\ \texttt{EVICT} & The job or task was terminated in order to free computational resources for an higher priority job\\ \texttt{FAIL} & The job or task terminated its execution unsuccesfully @@ -165,15 +158,6 @@ to large amounts of computational power being wasted. \texttt{FINISH} & The job or task terminated succesfully\\ \texttt{KILL} & The job or task terminated its execution because of a manual request to stop it\\ -% \texttt{LOST} & It is assumed a job or task is has been terminated, but -% due to missing data there is insufficent information to identify when -% or how\\ -% \texttt{UPDATE\_PENDING} & The metadata (scheduling class, resource -% requirements, \ldots) of the job/task was updated while the job was -% waiting to be scheduled\\ -% \texttt{UPDATE\_RUNNING} & The metadata (scheduling class, resource -% requirements, \ldots) of the job/task was updated while the job was in -% execution\\ \bottomrule \end{tabular} \end{center} @@ -259,12 +243,9 @@ science technologies like Apache Spark were used to achieve efficient and parallelized computations. This approach is discussed with further detail in the following section. -\hypertarget{project-requirements-and-analysis}{% -\section{Project requirements and -analysis}\label{project-requirements-and-analysis}} +\section{Project Requirements and Analysis Methodology} -\textbf{TBD} (describe our objective with this analysis in detail) -The aim of this thesis is to repeat the analysis performed in 2015 on the +The aim of this project is to repeat the analysis performed in 2015 on the dataset Google has released in 2019 in order to find similarities and differences with the previous analysis, and ultimately find whether computational power is indeed wasted in this new workload as well. The 2019 data @@ -272,10 +253,6 @@ comes from 8 Borg cells spanning 8 different datacenters located in different geographical positions, all focused on computational oriented workloads. The data collection time span matches the entire month of May 2019. - -\hypertarget{analysis-methodology}{% -\section{Analysis methodology}\label{analysis-methodology}} - Due to the inherent complexity in analyzing traces of this size, novel bleeding-edge data engineering tecniques were adopted to performed the required computations. We used the framework Apache Spark to perform efficient and @@ -461,6 +438,11 @@ the perspective of single tasks as well as jobs. We then compare the results from the 2019 traces to the ones that were obtained in 2015 to understand the workload evolution inside Borg between 2011 and 2019. +We discover that the spatial and temporal impact of unsuccessful +executions is very significant, more than in the 2011 traces. In particular, +resource usage is overall dominated by tasks with a final \texttt{KILL} +termination event. + \subsection{Temporal Impact: Machine Time Waste} \input{figures/machine_time_waste} @@ -669,11 +651,6 @@ Refer to figure \ref{fig:tableIII}. traces \end{itemize} -\hypertarget{probability-of-task-successful-termination-given-its-unsuccesful-events}{% -\subsection{Probability of task successful termination given its -unsuccesful -events}\label{probability-of-task-successful-termination-given-its-unsuccesful-events}} - \subsection{Conditional Probability of Task Success} \input{figures/figure_5} @@ -692,9 +669,6 @@ Refer to figure \ref{fig:figureV}. lot for small \# evts differences. This may be due to an uneven distribution of \# evts in the traces. \end{itemize} -\hypertarget{correlation-between-task-events-metadata-and-task-termination}{% -\subsection{Correlation between task events' metadata and task -termination}\label{correlation-between-task-events-metadata-and-task-termination}} \section{Analysis: Potential Causes of Unsuccessful Executions} @@ -729,10 +703,15 @@ Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and \subsection{Event Rates vs. Requested Resources, Resource Reservation, and Resource Utilization} - -\subsection{Figure 8 tbd} \input{figures/figure_8} +Refer to figure~\ref{fig:figureVIII-a}, figure~\ref{fig:figureVIII-a-csts} +figure~\ref{fig:figureVIII-b}, figure~\ref{fig:figureVIII-b-csts} +figure~\ref{fig:figureVIII-c}, figure~\ref{fig:figureVIII-c-csts} +figure~\ref{fig:figureVIII-d}, figure~\ref{fig:figureVIII-d-csts} +figure~\ref{fig:figureVIII-e}, figure~\ref{fig:figureVIII-e-csts} +figure~\ref{fig:figureVIII-f}, and figure~\ref{fig:figureVIII-f-csts}. + \subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality} \input{figures/figure_9} @@ -759,44 +738,10 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and the highest success event rate \end{itemize} -\hypertarget{mean-number-of-tasks-and-event-distribution-per-task-type}{% -\subsection{Mean number of tasks and event distribution per task -type}\label{mean-number-of-tasks-and-event-distribution-per-task-type}} - - -\hypertarget{potential-causes-of-unsuccesful-executions}{% -\subsection{Potential causes of unsuccesful -executions}\label{potential-causes-of-unsuccesful-executions}} - -\textbf{TBD} - -\hypertarget{implementation-issues-analysis-limitations}{% -\section{Implementation issues -- Analysis -limitations}\label{implementation-issues-analysis-limitations}} - -\hypertarget{discussion-on-unknown-fields}{% -\subsection{Discussion on unknown -fields}\label{discussion-on-unknown-fields}} - -\textbf{TBD} - -\hypertarget{limitation-on-computation-resources-required-for-the-analysis}{% -\subsection{Limitation on computation resources required for the -analysis}\label{limitation-on-computation-resources-required-for-the-analysis}} - -\textbf{TBD} - -\hypertarget{other-limitations}{% -\subsection{Other limitations \ldots{}}\label{other-limitations}} - -\textbf{TBD} - -\hypertarget{conclusions-and-future-work-or-possible-developments}{% -\section{Conclusions and future work or possible -developments}\label{conclusions-and-future-work-or-possible-developments}} - +\section{Conclusions, Future Work and Possible Developments} \textbf{TBD} +\newpage \printbibliography \end{document}