From b58c2aaa5261ed98d22edc182eee5f942f23d46a Mon Sep 17 00:00:00 2001 From: Claudio Maggioni Date: Thu, 17 Jun 2021 15:59:53 +0200 Subject: [PATCH] report --- report/Claudio_Maggioni_report.tex | 60 ++++++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index 91b99344..87cbeecf 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -97,7 +97,19 @@ old analysis to understand even better the causes of failures and how to prevent them. Additionally, this report will provide an overview on the data engineering techniques used to perform the queries and analyses on the 2019 traces. -\section{State of the art} +\subsection{Outline} +The report is structured as follows. Section~\ref{sec2} contains information about the +current state of the art for Google Borg cluster traces. Section~\ref{sec3} +provides an overview including technical background information on the data to +analyze and its storage format. Section~\ref{sec4} will discuss about the +project requirements and the data science methods used to perform the analysis. +Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result +obtained while analyzing, respectively the performance input of +unsuccessful executions, the patterns of task and job events, and the potential +causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the +conclusions. + +\section{State of the art}\label{sec2} \textbf{TBD (introduce only 2015 dsn paper)} @@ -111,7 +123,7 @@ failures. The salient conclusion of that research is that actually lots of computations performed by Google would eventually end in failure, then leading to large amounts of computational power being wasted. -\section{Background information} +\section{Background information}\label{sec3} \textit{Borg} is Google's own cluster management software able to run thousands of different jobs. Among the various cluster management services it @@ -243,7 +255,7 @@ science technologies like Apache Spark were used to achieve efficient and parallelized computations. This approach is discussed with further detail in the following section. -\section{Project Requirements and Analysis Methodology} +\section{Project Requirements and Analysis Methodology}\label{sec4} The aim of this project is to repeat the analysis performed in 2015 on the dataset Google has released in 2019 in order to find similarities and @@ -426,7 +438,7 @@ computing slowdown values given the previously computed execution attempt time deltas. Finally, the mean of the computed slowdown values is computed resulting in the clear and coincise tables found in figure~\ref{fig:taskslowdown}. -\section{Analysis: Performance Input of Unsuccessful Executions} +\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5} Our first investigation focuses on replicating the methodologies used in the 2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time @@ -614,7 +626,7 @@ With more than 98\% of both CPU and memory resources used by non-successful tasks, it is clear the spatial resource waste is high in the 2019 traces. -\section{Analysis: Patterns of Task and Job Events} +\section{Analysis: Patterns of Task and Job Events}\label{sec6} This section aims to use some of the tecniques used in section IV of the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies @@ -626,7 +638,6 @@ probabilities based on the number of task termination events of a specific type. Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at the job level. - The results found the the 2019 traces seldomly show the same patterns in terms of task events and job/task distributions, in particular highlighting again the overall non-trivial impact of \texttt{KILL} events, no matter the task and job @@ -749,7 +760,40 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a \texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same. Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs. -\section{Analysis: Potential Causes of Unsuccessful Executions} +\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7} + +This section re-applies the tecniques used in section V of the Ros\'a et al.\ +paper\cite{dsn-paper} to find patterns and interpendencies +between task and job events by gathering event statistics at those events. In +particular, Section~\ref{tabIII-section} explores how tasks of the success of a +task is inter-correlated with its own event patterns, which +Section~\ref{figV-section} explores even further by computing task success +probabilities based on the number of task termination events of a specific type. +Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at +the job level. + + +In this section, we search for the root causes of different unsuccessful jobs +and events, and derive their implications on system design. Our analysis resorts +to a black-box approach due to the limited information available on the system. +We consider two levels of statistics, i.e., events vs. jobs, where the former +directly impacts spatial and temporal waste, whereas the latter is directly +correlated to the performance perceived by users. For the event analysis, we +focus on task priority, event execution time, machine concurrency, and requested +resources. Moreover, to see the impact of resource efficiency on tasks +executions, we correlate events with resource reservation and utilization on +machines. As for the job analysis, we study the job size, machine locality, and +job execution time. + +In the following analysis, we present how different event/job types happen, with +respect to different ranges of attributes. For each type $i$, we compute the +metric of event (job) rate, defined as the number of type $i$ events (jobs) +divided by the total number of events (jobs). Event/job rates are computed for +each range of attributes. For example, one can compute the eviction rate for +priorities in the range $[0,1]$ as the number of eviction events that involved +priorities [0,1] divided by the total number of events for priorities $[0,1] .$ +One can also view event/job rates as the probability that events/jobs end with +certain types of outcomes. \subsection{Event rates vs. task priority, event execution time, and machine concurrency.} @@ -817,7 +861,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and the highest success event rate \end{itemize} -\section{Conclusions, Future Work and Possible Developments} +\section{Conclusions, Future Work and Possible Developments}\label{sec8} \textbf{TBD} \newpage