report
This commit is contained in:
parent
cd33279754
commit
b8f93d2da2
1 changed files with 52 additions and 8 deletions
|
@ -97,7 +97,19 @@ old analysis to understand even better the causes of failures and how to prevent
|
|||
them. Additionally, this report will provide an overview on the data engineering
|
||||
techniques used to perform the queries and analyses on the 2019 traces.
|
||||
|
||||
\section{State of the art}
|
||||
\subsection{Outline}
|
||||
The report is structured as follows. Section~\ref{sec2} contains information about the
|
||||
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
|
||||
provides an overview including technical background information on the data to
|
||||
analyze and its storage format. Section~\ref{sec4} will discuss about the
|
||||
project requirements and the data science methods used to perform the analysis.
|
||||
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
|
||||
obtained while analyzing, respectively the performance input of
|
||||
unsuccessful executions, the patterns of task and job events, and the potential
|
||||
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
|
||||
conclusions.
|
||||
|
||||
\section{State of the art}\label{sec2}
|
||||
|
||||
\textbf{TBD (introduce only 2015 dsn paper)}
|
||||
|
||||
|
@ -111,7 +123,7 @@ failures. The salient conclusion of that research is that actually lots of
|
|||
computations performed by Google would eventually end in failure, then leading
|
||||
to large amounts of computational power being wasted.
|
||||
|
||||
\section{Background information}
|
||||
\section{Background information}\label{sec3}
|
||||
|
||||
\textit{Borg} is Google's own cluster management software able to run
|
||||
thousands of different jobs. Among the various cluster management services it
|
||||
|
@ -243,7 +255,7 @@ science technologies like Apache Spark were used to achieve efficient
|
|||
and parallelized computations. This approach is discussed with further
|
||||
detail in the following section.
|
||||
|
||||
\section{Project Requirements and Analysis Methodology}
|
||||
\section{Project Requirements and Analysis Methodology}\label{sec4}
|
||||
|
||||
The aim of this project is to repeat the analysis performed in 2015 on the
|
||||
dataset Google has released in 2019 in order to find similarities and
|
||||
|
@ -426,7 +438,7 @@ computing slowdown values given the previously computed execution attempt time
|
|||
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
||||
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
||||
|
||||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
||||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||
|
||||
Our first investigation focuses on replicating the methodologies used in the
|
||||
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||
|
@ -614,7 +626,7 @@ With more than 98\% of both CPU and memory resources used by
|
|||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||
traces.
|
||||
|
||||
\section{Analysis: Patterns of Task and Job Events}
|
||||
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||
|
||||
This section aims to use some of the tecniques used in section IV of
|
||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||
|
@ -626,7 +638,6 @@ probabilities based on the number of task termination events of a specific type.
|
|||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||
the job level.
|
||||
|
||||
|
||||
The results found the the 2019 traces seldomly show the same patterns in terms
|
||||
of task events and job/task distributions, in particular highlighting again the
|
||||
overall non-trivial impact of \texttt{KILL} events, no matter the task and job
|
||||
|
@ -749,7 +760,40 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
|
|||
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
||||
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||
|
||||
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
||||
paper\cite{dsn-paper} to find patterns and interpendencies
|
||||
between task and job events by gathering event statistics at those events. In
|
||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||
task is inter-correlated with its own event patterns, which
|
||||
Section~\ref{figV-section} explores even further by computing task success
|
||||
probabilities based on the number of task termination events of a specific type.
|
||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||
the job level.
|
||||
|
||||
|
||||
In this section, we search for the root causes of different unsuccessful jobs
|
||||
and events, and derive their implications on system design. Our analysis resorts
|
||||
to a black-box approach due to the limited information available on the system.
|
||||
We consider two levels of statistics, i.e., events vs. jobs, where the former
|
||||
directly impacts spatial and temporal waste, whereas the latter is directly
|
||||
correlated to the performance perceived by users. For the event analysis, we
|
||||
focus on task priority, event execution time, machine concurrency, and requested
|
||||
resources. Moreover, to see the impact of resource efficiency on tasks
|
||||
executions, we correlate events with resource reservation and utilization on
|
||||
machines. As for the job analysis, we study the job size, machine locality, and
|
||||
job execution time.
|
||||
|
||||
In the following analysis, we present how different event/job types happen, with
|
||||
respect to different ranges of attributes. For each type $i$, we compute the
|
||||
metric of event (job) rate, defined as the number of type $i$ events (jobs)
|
||||
divided by the total number of events (jobs). Event/job rates are computed for
|
||||
each range of attributes. For example, one can compute the eviction rate for
|
||||
priorities in the range $[0,1]$ as the number of eviction events that involved
|
||||
priorities [0,1] divided by the total number of events for priorities $[0,1] .$
|
||||
One can also view event/job rates as the probability that events/jobs end with
|
||||
certain types of outcomes.
|
||||
|
||||
\subsection{Event rates vs. task priority, event execution time, and machine
|
||||
concurrency.}
|
||||
|
@ -817,7 +861,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|||
the highest success event rate
|
||||
\end{itemize}
|
||||
|
||||
\section{Conclusions, Future Work and Possible Developments}
|
||||
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
||||
\textbf{TBD}
|
||||
|
||||
\newpage
|
||||
|
|
Loading…
Reference in a new issue