report

2021-06-17 15:59:53 +02:00 · 2021-06-17 15:59:53 +02:00 · b8f93d2da2
commit b8f93d2da2
parent cd33279754
1 changed files with 52 additions and 8 deletions
--- a/report/Claudio_Maggioni_report.tex
+++ b/report/Claudio_Maggioni_report.tex
@ -97,7 +97,19 @@ old analysis to understand even better the causes of failures and how to prevent
 them. Additionally, this report will provide an overview on the data engineering
 techniques used to perform the queries and analyses on the 2019 traces.

-\section{State of the art}
+\subsection{Outline}
+The report is structured as follows. Section~\ref{sec2} contains information about the
+current state of the art for Google Borg cluster traces. Section~\ref{sec3}
+provides an overview including technical background information on the data to
+analyze and its storage format. Section~\ref{sec4} will discuss about the
+project requirements and the data science methods used to perform the analysis.
+Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
+obtained while analyzing, respectively the performance input of
+unsuccessful executions, the patterns of task and job events, and the potential
+causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
+conclusions.
+
+\section{State of the art}\label{sec2}

 \textbf{TBD (introduce only 2015 dsn paper)}

@ -111,7 +123,7 @@ failures. The salient conclusion of that research is that actually lots of
 computations performed by Google would eventually end in failure, then leading
 to large amounts of computational power being wasted.

-\section{Background information}
+\section{Background information}\label{sec3}

  \textit{Borg} is Google's own cluster management software able to run
  thousands of different jobs. Among the various cluster management services it
@ -243,7 +255,7 @@ science technologies like Apache Spark were used to achieve efficient
 and parallelized computations. This approach is discussed with further
 detail in the following section.

-\section{Project Requirements and Analysis Methodology}
+\section{Project Requirements and Analysis Methodology}\label{sec4}

 The aim of this project is to repeat the analysis performed in 2015 on the
 dataset Google has released in 2019 in order to find similarities and
@ -426,7 +438,7 @@ computing slowdown values given the previously computed execution attempt time
 deltas. Finally, the mean of the computed slowdown values is computed resulting
 in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.

-\section{Analysis: Performance Input of Unsuccessful Executions}
+\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}

 Our first investigation focuses on replicating the methodologies used in the
 2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
@ -614,7 +626,7 @@ With more than 98\% of both CPU and memory resources used by
 non-successful tasks, it is clear the spatial resource waste is high in the 2019
 traces.

-\section{Analysis: Patterns of Task and Job Events}
+\section{Analysis: Patterns of Task and Job Events}\label{sec6}

 This section aims to use some of the tecniques used in section IV of
 the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
@ -626,7 +638,6 @@ probabilities based on the number of task termination events of a specific type.
 Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
 the job level.

-
 The results found the the 2019 traces seldomly show the same patterns in terms
 of task events and job/task distributions, in particular highlighting again the
 overall non-trivial impact of \texttt{KILL} events, no matter the task and job
@ -749,7 +760,40 @@ one. For some clusters (namely B, C, and  D), the mean number of \texttt{FAIL} a
 \texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
 Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.

-\section{Analysis: Potential Causes of Unsuccessful Executions}
+\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
+
+This section re-applies the tecniques used in section V of the Ros\'a et al.\
+paper\cite{dsn-paper} to find patterns and interpendencies
+between task and job events by gathering event statistics at those events. In
+particular, Section~\ref{tabIII-section} explores how tasks of the success of a
+task is inter-correlated with its own event patterns, which
+Section~\ref{figV-section} explores even further by computing task success
+probabilities based on the number of task termination events of a specific type.
+Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
+the job level.
+
+
+In this section, we search for the root causes of different unsuccessful jobs
+and events, and derive their implications on system design. Our analysis resorts
+to a black-box approach due to the limited information available on the system.
+We consider two levels of statistics, i.e., events vs. jobs, where the former
+directly impacts spatial and temporal waste, whereas the latter is directly
+correlated to the performance perceived by users. For the event analysis, we
+focus on task priority, event execution time, machine concurrency, and requested
+resources. Moreover, to see the impact of resource efficiency on tasks
+executions, we correlate events with resource reservation and utilization on
+machines. As for the job analysis, we study the job size, machine locality, and
+job execution time.
+
+In the following analysis, we present how different event/job types happen, with
+respect to different ranges of attributes. For each type $i$, we compute the
+metric of event (job) rate, defined as the number of type $i$ events (jobs)
+divided by the total number of events (jobs). Event/job rates are computed for
+each range of attributes. For example, one can compute the eviction rate for
+priorities in the range $[0,1]$ as the number of eviction events that involved
+priorities [0,1] divided by the total number of events for priorities $[0,1] .$
+One can also view event/job rates as the probability that events/jobs end with
+certain types of outcomes.

 \subsection{Event rates vs. task priority, event execution time, and machine
 concurrency.}
@ -817,7 +861,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
  the highest success event rate
 \end{itemize}

-\section{Conclusions, Future Work and Possible Developments}
+\section{Conclusions, Future Work and Possible Developments}\label{sec8}
 \textbf{TBD}

 \newpage