report
This commit is contained in:
parent
cd33279754
commit
b8f93d2da2
1 changed files with 52 additions and 8 deletions
|
@ -97,7 +97,19 @@ old analysis to understand even better the causes of failures and how to prevent
|
||||||
them. Additionally, this report will provide an overview on the data engineering
|
them. Additionally, this report will provide an overview on the data engineering
|
||||||
techniques used to perform the queries and analyses on the 2019 traces.
|
techniques used to perform the queries and analyses on the 2019 traces.
|
||||||
|
|
||||||
\section{State of the art}
|
\subsection{Outline}
|
||||||
|
The report is structured as follows. Section~\ref{sec2} contains information about the
|
||||||
|
current state of the art for Google Borg cluster traces. Section~\ref{sec3}
|
||||||
|
provides an overview including technical background information on the data to
|
||||||
|
analyze and its storage format. Section~\ref{sec4} will discuss about the
|
||||||
|
project requirements and the data science methods used to perform the analysis.
|
||||||
|
Section~\ref{sec5}, Section~\ref{sec6} and Section~\ref{sec7} show the result
|
||||||
|
obtained while analyzing, respectively the performance input of
|
||||||
|
unsuccessful executions, the patterns of task and job events, and the potential
|
||||||
|
causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the
|
||||||
|
conclusions.
|
||||||
|
|
||||||
|
\section{State of the art}\label{sec2}
|
||||||
|
|
||||||
\textbf{TBD (introduce only 2015 dsn paper)}
|
\textbf{TBD (introduce only 2015 dsn paper)}
|
||||||
|
|
||||||
|
@ -111,7 +123,7 @@ failures. The salient conclusion of that research is that actually lots of
|
||||||
computations performed by Google would eventually end in failure, then leading
|
computations performed by Google would eventually end in failure, then leading
|
||||||
to large amounts of computational power being wasted.
|
to large amounts of computational power being wasted.
|
||||||
|
|
||||||
\section{Background information}
|
\section{Background information}\label{sec3}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
thousands of different jobs. Among the various cluster management services it
|
thousands of different jobs. Among the various cluster management services it
|
||||||
|
@ -243,7 +255,7 @@ science technologies like Apache Spark were used to achieve efficient
|
||||||
and parallelized computations. This approach is discussed with further
|
and parallelized computations. This approach is discussed with further
|
||||||
detail in the following section.
|
detail in the following section.
|
||||||
|
|
||||||
\section{Project Requirements and Analysis Methodology}
|
\section{Project Requirements and Analysis Methodology}\label{sec4}
|
||||||
|
|
||||||
The aim of this project is to repeat the analysis performed in 2015 on the
|
The aim of this project is to repeat the analysis performed in 2015 on the
|
||||||
dataset Google has released in 2019 in order to find similarities and
|
dataset Google has released in 2019 in order to find similarities and
|
||||||
|
@ -426,7 +438,7 @@ computing slowdown values given the previously computed execution attempt time
|
||||||
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
||||||
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
||||||
|
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the methodologies used in the
|
Our first investigation focuses on replicating the methodologies used in the
|
||||||
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||||
|
@ -614,7 +626,7 @@ With more than 98\% of both CPU and memory resources used by
|
||||||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||||
traces.
|
traces.
|
||||||
|
|
||||||
\section{Analysis: Patterns of Task and Job Events}
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||||
|
|
||||||
This section aims to use some of the tecniques used in section IV of
|
This section aims to use some of the tecniques used in section IV of
|
||||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||||
|
@ -626,7 +638,6 @@ probabilities based on the number of task termination events of a specific type.
|
||||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||||
the job level.
|
the job level.
|
||||||
|
|
||||||
|
|
||||||
The results found the the 2019 traces seldomly show the same patterns in terms
|
The results found the the 2019 traces seldomly show the same patterns in terms
|
||||||
of task events and job/task distributions, in particular highlighting again the
|
of task events and job/task distributions, in particular highlighting again the
|
||||||
overall non-trivial impact of \texttt{KILL} events, no matter the task and job
|
overall non-trivial impact of \texttt{KILL} events, no matter the task and job
|
||||||
|
@ -749,7 +760,40 @@ one. For some clusters (namely B, C, and D), the mean number of \texttt{FAIL} a
|
||||||
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
\texttt{KILL} task events for \texttt{FINISH}ed jobs is almost the same.
|
||||||
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||||
|
|
||||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||||
|
|
||||||
|
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
||||||
|
paper\cite{dsn-paper} to find patterns and interpendencies
|
||||||
|
between task and job events by gathering event statistics at those events. In
|
||||||
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||||
|
task is inter-correlated with its own event patterns, which
|
||||||
|
Section~\ref{figV-section} explores even further by computing task success
|
||||||
|
probabilities based on the number of task termination events of a specific type.
|
||||||
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||||
|
the job level.
|
||||||
|
|
||||||
|
|
||||||
|
In this section, we search for the root causes of different unsuccessful jobs
|
||||||
|
and events, and derive their implications on system design. Our analysis resorts
|
||||||
|
to a black-box approach due to the limited information available on the system.
|
||||||
|
We consider two levels of statistics, i.e., events vs. jobs, where the former
|
||||||
|
directly impacts spatial and temporal waste, whereas the latter is directly
|
||||||
|
correlated to the performance perceived by users. For the event analysis, we
|
||||||
|
focus on task priority, event execution time, machine concurrency, and requested
|
||||||
|
resources. Moreover, to see the impact of resource efficiency on tasks
|
||||||
|
executions, we correlate events with resource reservation and utilization on
|
||||||
|
machines. As for the job analysis, we study the job size, machine locality, and
|
||||||
|
job execution time.
|
||||||
|
|
||||||
|
In the following analysis, we present how different event/job types happen, with
|
||||||
|
respect to different ranges of attributes. For each type $i$, we compute the
|
||||||
|
metric of event (job) rate, defined as the number of type $i$ events (jobs)
|
||||||
|
divided by the total number of events (jobs). Event/job rates are computed for
|
||||||
|
each range of attributes. For example, one can compute the eviction rate for
|
||||||
|
priorities in the range $[0,1]$ as the number of eviction events that involved
|
||||||
|
priorities [0,1] divided by the total number of events for priorities $[0,1] .$
|
||||||
|
One can also view event/job rates as the probability that events/jobs end with
|
||||||
|
certain types of outcomes.
|
||||||
|
|
||||||
\subsection{Event rates vs. task priority, event execution time, and machine
|
\subsection{Event rates vs. task priority, event execution time, and machine
|
||||||
concurrency.}
|
concurrency.}
|
||||||
|
@ -817,7 +861,7 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||||
the highest success event rate
|
the highest success event rate
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\section{Conclusions, Future Work and Possible Developments}
|
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
||||||
\textbf{TBD}
|
\textbf{TBD}
|
||||||
|
|
||||||
\newpage
|
\newpage
|
||||||
|
|
Loading…
Reference in a new issue