This commit is contained in:
Claudio Maggioni 2021-06-17 15:59:53 +02:00
parent 2329b3f801
commit b58c2aaa52
1 changed files with 52 additions and 8 deletions

View File

@ -97,7 +97,19 @@ old analysis to understand even better the causes of failures and how to prevent
them. Additionally, this report will provide an overview on the data engineering
techniques used to perform the queries and analyses on the 2019 traces.
\section{State of the art}
\subsection{Outline}
The report is structured as follows. Section~\ref{sec2} contains information about the
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
provides an overview including technical background information on the data to
analyze and its storage format. Section~\ref{sec4} will discuss about the
project requirements and the data science methods used to perform the analysis.
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
obtained while analyzing, respectively the performance input of
unsuccessful executions, the patterns of task and job events, and the potential
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
conclusions.
\section{State of the art}\label{sec2}
\textbf{TBD (introduce only 2015 dsn paper)}
@ -111,7 +123,7 @@ failures. The salient conclusion of that research is that actually lots of
computations performed by Google would eventually end in failure, then leading
to large amounts of computational power being wasted.
\section{Background information}
\section{Background information}\label{sec3}
\textit{Borg} is Google's own cluster management software able to run
thousands of different jobs. Among the various cluster management services it
@ -243,7 +255,7 @@ science technologies like Apache Spark were used to achieve efficient
and parallelized computations. This approach is discussed with further
detail in the following section.
\section{Project Requirements and Analysis Methodology}
\section{Project Requirements and Analysis Methodology}\label{sec4}
The aim of this project is to repeat the analysis performed in 2015 on the
dataset Google has released in 2019 in order to find similarities and
@ -426,7 +438,7 @@ computing slowdown values given the previously computed execution attempt time
deltas. Finally, the mean of the computed slowdown values is computed resulting
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions}
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
Our first investigation focuses on replicating the methodologies used in the
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
@ -614,7 +626,7 @@ With more than 98\% of both CPU and memory resources used by
non-successful tasks, it is clear the spatial resource waste is high in the 2019
traces.
\section{Analysis: Patterns of Task and Job Events}
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
This section aims to use some of the tecniques used in section IV of
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
@ -626,7 +638,6 @@ probabilities based on the number of task termination events of a specific type.
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
the job level.
The results found the the 2019 traces seldomly show the same patterns in terms
of task events and job/task distributions, in particular highlighting again the
overall non-trivial impact of \texttt{KILL} events, no matter the task and job
@ -749,7 +760,40 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
\section{Analysis: Potential Causes of Unsuccessful Executions}
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
This section re-applies the tecniques used in section V of the Ros\'a et al.\
paper\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events. In
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
task is inter-correlated with its own event patterns, which
Section~\ref{figV-section} explores even further by computing task success
probabilities based on the number of task termination events of a specific type.
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
the job level.
In this section, we search for the root causes of different unsuccessful jobs
and events, and derive their implications on system design. Our analysis resorts
to a black-box approach due to the limited information available on the system.
We consider two levels of statistics, i.e., events vs. jobs, where the former
directly impacts spatial and temporal waste, whereas the latter is directly
correlated to the performance perceived by users. For the event analysis, we
focus on task priority, event execution time, machine concurrency, and requested
resources. Moreover, to see the impact of resource efficiency on tasks
executions, we correlate events with resource reservation and utilization on
machines. As for the job analysis, we study the job size, machine locality, and
job execution time.
In the following analysis, we present how different event/job types happen, with
respect to different ranges of attributes. For each type $i$, we compute the
metric of event (job) rate, defined as the number of type $i$ events (jobs)
divided by the total number of events (jobs). Event/job rates are computed for
each range of attributes. For example, one can compute the eviction rate for
priorities in the range $[0,1]$ as the number of eviction events that involved
priorities [0,1] divided by the total number of events for priorities $[0,1] .$
One can also view event/job rates as the probability that events/jobs end with
certain types of outcomes.
\subsection{Event rates vs. task priority, event execution time, and machine
concurrency.}
@ -817,7 +861,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
the highest success event rate
\end{itemize}
\section{Conclusions, Future Work and Possible Developments}
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
\textbf{TBD}
\newpage