|
|
|
@ -40,13 +40,13 @@ Switzerland]{Prof.}{Walter}{Binder}
|
|
|
|
|
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
|
|
|
|
\end{committee}
|
|
|
|
|
|
|
|
|
|
\abstract{The project aims at comparing two different traces coming from large
|
|
|
|
|
\abstract{The thesis aims at comparing two different traces coming from large
|
|
|
|
|
datacenters, focusing in particular on unsuccessful executions of jobs and
|
|
|
|
|
tasks submitted by users. The objective of this project is to compare the
|
|
|
|
|
tasks submitted by users. The objective of this thesis is to compare the
|
|
|
|
|
resource waste caused by unsuccessful executions, their impact on application
|
|
|
|
|
performance, and their root causes. We show the strong negative impact on
|
|
|
|
|
CPU and RAM usage and on task slowdown. We analyze patterns of
|
|
|
|
|
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
|
|
|
|
unsuccessful jobs and tasks, focusing on their interdependency.
|
|
|
|
|
Moreover, we uncover their root causes by inspecting key workload and
|
|
|
|
|
system attributes such as machine locality and concurrency level.}
|
|
|
|
|
|
|
|
|
@ -56,24 +56,23 @@ system attributes such as machine locality and concurrency level.}
|
|
|
|
|
\newpage
|
|
|
|
|
|
|
|
|
|
\section{Introduction} In today's world there is an ever growing demand for
|
|
|
|
|
efficient, large scale computations. The rising trend of ``big data'' put the
|
|
|
|
|
efficient, large scale computations. The rising trend of ``big data'' puts the
|
|
|
|
|
need for efficient management of large scaled parallelized computing at an all
|
|
|
|
|
time high. This fact also increases the demand for research in the field of
|
|
|
|
|
distributed systems, in particular in how to schedule computations effectively,
|
|
|
|
|
avoid wasting resources and avoid failures.
|
|
|
|
|
|
|
|
|
|
In 2011 Google released a month long data trace of their own cluster management
|
|
|
|
|
system\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
|
|
|
|
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
|
|
|
|
scheduling, priority management, and failures of a real production workload.
|
|
|
|
|
This data was 2009
|
|
|
|
|
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
|
|
|
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
|
|
|
|
Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need
|
|
|
|
|
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
|
|
|
|
for better cluster management highlighting the high amount of failures found in
|
|
|
|
|
the traces.
|
|
|
|
|
|
|
|
|
|
In 2019 Google released an updated version of the \textit{Borg} cluster
|
|
|
|
|
traces\cite{google-marso-19}, not only containing data from a far bigger
|
|
|
|
|
traces~\cite{google-marso-19}, not only containing data from a far bigger
|
|
|
|
|
workload due to improvements in computational technology, but also providing
|
|
|
|
|
data from 8 different \textit{Borg} cells from datacenters located all over the
|
|
|
|
|
world.
|
|
|
|
@ -90,8 +89,8 @@ in carrying out non-successful executions, i.e.\ executing programs that would
|
|
|
|
|
eventually ``crash'' and potentially not leading to useful results\footnote{This
|
|
|
|
|
is only a speculation, since both the 2011 and the 2019 traces only provide a
|
|
|
|
|
``black box'' view of the Borg cluster system. Neither of the accompanying
|
|
|
|
|
papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
|
|
|
|
|
documentation for the 2019 traces\cite{google-drive-marso} ever mention if
|
|
|
|
|
papers for both traces~\cite{google-marso-11}~\cite{google-marso-19} or the
|
|
|
|
|
documentation for the 2019 traces~\cite{google-drive-marso} ever mention if
|
|
|
|
|
non-successful tasks produce any useful result.}. The 2019 subplot paints an
|
|
|
|
|
even darker picture, with less than 5\% of machine time used for successful
|
|
|
|
|
computation.
|
|
|
|
@ -107,7 +106,7 @@ models to mitigate or erase the resource impact of unsuccessful executions.
|
|
|
|
|
%\subsection{Challenges}
|
|
|
|
|
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
|
|
|
|
than the 2011 ones, and given that the entire compressed traces package has a
|
|
|
|
|
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
|
|
|
|
|
non-trivial size (weighing approximately 8 TiB~\cite{google-drive-marso}), the
|
|
|
|
|
computations required to perform the analysis we illustrate in this report
|
|
|
|
|
cannot be performed with classical data science techniques. A
|
|
|
|
|
considerable amount of computational power was needed to carry out the
|
|
|
|
@ -118,7 +117,7 @@ MapReduce-like structure.
|
|
|
|
|
|
|
|
|
|
%\subsection{Contribution}
|
|
|
|
|
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
|
|
|
|
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
|
|
|
|
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
|
|
|
|
workload and the behaviour and patterns of executions within it. Thanks to this
|
|
|
|
|
analysis, we aim to understand even better the causes of failures and how to
|
|
|
|
|
prevent them. Additionally, given the technical challenge this analysis posed,
|
|
|
|
@ -130,12 +129,12 @@ The report is structured as follows. Section~\ref{sec2} contains information
|
|
|
|
|
about the current state of the art for Google Borg cluster traces.
|
|
|
|
|
Section~\ref{sec3} provides an overview including technical background
|
|
|
|
|
information on the data to analyze and its storage format. Section~\ref{sec4}
|
|
|
|
|
will discuss about the project requirements and the data science methods used to
|
|
|
|
|
discusses about the project requirements and the data science methods used to
|
|
|
|
|
perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
|
|
|
|
|
Section~\ref{sec7} show the result obtained while analyzing, respectively the
|
|
|
|
|
performance input of unsuccessful executions, the patterns of task and job
|
|
|
|
|
events, and the potential causes of unsuccessful executions. Finally,
|
|
|
|
|
Section~\ref{sec8} contains the conclusions.
|
|
|
|
|
Section~\ref{sec8} concludes.
|
|
|
|
|
|
|
|
|
|
\section{State of the Art}\label{sec2}
|
|
|
|
|
|
|
|
|
@ -159,17 +158,17 @@ H & Europe/Brussels \\
|
|
|
|
|
timezone of each cluster in the 2019 Google Borg traces.}\label{fig:clusters}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
In 2015, Dr.~Andrea Rosà et al.\ published a
|
|
|
|
|
In 2015, Rosà et al.\ published a
|
|
|
|
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
|
|
|
|
An Analysis beyond Failures}\cite{dsn-paper} in which they performed several
|
|
|
|
|
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
|
|
|
|
An Analysis beyond Failures}~\cite{dsn-paper} in which they performed several
|
|
|
|
|
analyses on unsuccessful executions in the Google's 2011 Borg cluster traces
|
|
|
|
|
with the aim of identifying their resource waste, their impacts on the
|
|
|
|
|
performance of the application, and any causes that may lie behind such
|
|
|
|
|
failures. The salient conclusion of that research is that actually lots of
|
|
|
|
|
computations performed by Google would eventually end in failure, then leading
|
|
|
|
|
to large amounts of computational power being wasted.
|
|
|
|
|
|
|
|
|
|
However, with the release of the new 2019 traces\cite{google-marso-19},
|
|
|
|
|
However, with the release of the new 2019 traces~\cite{google-marso-19},
|
|
|
|
|
the results and conclusions found by that paper could be potentially outdated
|
|
|
|
|
in the current large-scale computing world.
|
|
|
|
|
The new traces not only provide updated data on Borg's
|
|
|
|
@ -180,7 +179,7 @@ from now on referred as ``Cluster A'' to ``Cluster H''.
|
|
|
|
|
The geographical
|
|
|
|
|
location of each cluster can be consulted in Figure~\ref{fig:clusters}. The
|
|
|
|
|
information in that table was provided by the 2019 traces
|
|
|
|
|
documentation\cite{google-drive-marso}.
|
|
|
|
|
documentation~\cite{google-drive-marso}.
|
|
|
|
|
|
|
|
|
|
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
|
|
|
|
example, the amount and variety of server configurations per cluster increased
|
|
|
|
@ -191,47 +190,53 @@ Figure~\ref{fig:machineconfigs-csts} on a cluster-by-cluster basis.
|
|
|
|
|
|
|
|
|
|
\input{figures/machine_configs}
|
|
|
|
|
|
|
|
|
|
There are two main works covering the new data,
|
|
|
|
|
one being the paper \textit{Borg: The Next Generation}\cite{google-marso-19},
|
|
|
|
|
which compares the overall features of the trace with the 2011
|
|
|
|
|
one\cite{google-marso-11}\cite{github-marso}, and one covering the features and
|
|
|
|
|
performance of
|
|
|
|
|
\textit{Autopilot}\cite{james-muratore}, a software that provides autoscaling
|
|
|
|
|
features in Borg. The new traces have also been analyzed from the execution
|
|
|
|
|
priority perspective\cite{down-under}, as well as from a cluster-by-cluster
|
|
|
|
|
comparison\cite{golf-course} given the multi-cluster nature of the new traces.
|
|
|
|
|
There are two main works covering the new data, one being the paper
|
|
|
|
|
\textit{Borg: The Next Generation}~\cite{google-marso-19}, which compares the
|
|
|
|
|
overall features of the trace with the 2011
|
|
|
|
|
one~\cite{google-marso-11}~\cite{github-marso}, and one covering the features
|
|
|
|
|
and performance of \textit{Autopilot}~\cite{james-muratore}, a software that
|
|
|
|
|
provides autoscaling features in Borg. The new traces have also been analyzed
|
|
|
|
|
from the execution priority perspective~\cite{down-under}, as well as from a
|
|
|
|
|
cluster-by-cluster comparison~\cite{golf-course} given the multi-cluster nature
|
|
|
|
|
of the new traces.
|
|
|
|
|
|
|
|
|
|
Other studies have been performed in similar big-data systems focusing on the
|
|
|
|
|
failure of hardware components and software
|
|
|
|
|
bugs\cite{9}\cite{10}\cite{11}\cite{12}.
|
|
|
|
|
bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
|
|
|
|
|
|
|
|
|
|
However, the community has not yet performed any research on the new Borg
|
|
|
|
|
traces analysing unsuccessful executions, their possible causes, and the
|
|
|
|
|
relationships between tasks and jobs. Therefore, the only current research in
|
|
|
|
|
this field (beside this report) is the 2015 Ros\'a et al.\
|
|
|
|
|
paper\cite{dsn-paper}.
|
|
|
|
|
this field is this very report, providing and update to the the 2015 Ros\'a et
|
|
|
|
|
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
|
|
|
|
|
|
|
|
|
\section{Background information}\label{sec3}
|
|
|
|
|
\section{Background}\label{sec3}
|
|
|
|
|
|
|
|
|
|
\textit{Borg} is Google's own cluster management software able to run
|
|
|
|
|
thousands of different jobs. Among the various cluster management services it
|
|
|
|
|
provides, the main ones are: job queuing, scheduling, allocation, and
|
|
|
|
|
deallocation due to higher priority computations.
|
|
|
|
|
\textit{Borg} is Google's own cluster management software able to run
|
|
|
|
|
thousands of different jobs. Among the various cluster management services it
|
|
|
|
|
provides, the main ones are: job queuing, scheduling, allocation, and
|
|
|
|
|
deallocation due to higher priority computations.
|
|
|
|
|
|
|
|
|
|
The core structure of Borg is a cell, a set of
|
|
|
|
|
machines usually all within the same cluster, whose work is allocated by the
|
|
|
|
|
same cluster-management system and hence a cell is handled as a unit. Each
|
|
|
|
|
cell may run large computational workload that is submitted to Borg. Such
|
|
|
|
|
workload is called ``job'', which outlines the computations that a user wants
|
|
|
|
|
to run and is made up of several ``tasks''. A task is an executable program,
|
|
|
|
|
consisting of multiple processes, which has to be run on a single machine.
|
|
|
|
|
Those tasks may be ran sequentially or in parallel, and the condition for a
|
|
|
|
|
job's successful termination is nontrivial.
|
|
|
|
|
% Both tasks and jobs lifecyles are represented by several events, which are
|
|
|
|
|
% encoded and stored in the trace as rows of various tables. Among the
|
|
|
|
|
% information events provide, the field ``type'' provides information on the
|
|
|
|
|
% execution status of the job or task. This field can have several values,
|
|
|
|
|
% which are illustrated in Figure~\ref{fig:eventtypes}.
|
|
|
|
|
The core structure of Borg is a cell, a set of
|
|
|
|
|
machines usually all within the same cluster, whose work is allocated by the
|
|
|
|
|
same cluster-management system and hence a cell is handled as a unit. Each
|
|
|
|
|
cell may run large computational workload that is submitted to Borg. Such
|
|
|
|
|
workload is called ``job'', which outlines the computations that a user wants
|
|
|
|
|
to run and is made up of several ``tasks''. A task is an executable program,
|
|
|
|
|
consisting of multiple processes, which has to be run on a single machine.
|
|
|
|
|
Those tasks may be ran sequentially or in parallel, and the condition for a
|
|
|
|
|
job's successful termination is nontrivial.
|
|
|
|
|
|
|
|
|
|
Both tasks and jobs lifecyles are represented by several events, which are
|
|
|
|
|
encoded and stored in the trace as rows of various tables. Among the
|
|
|
|
|
information events provide, the field ``type'' provides information on the
|
|
|
|
|
execution status of the job or task. We focus only on events whose ``types''
|
|
|
|
|
indicate a termination, i.e.\ the end of a task or job's execution.
|
|
|
|
|
These termination event types are illustrated in Figure~\ref{fig:eventtypes}.
|
|
|
|
|
We then define an unsuccessful execution to be an execution characterized by a
|
|
|
|
|
termination event of type \texttt{EVICT}, \texttt{FAIL} or \texttt{KILL}.
|
|
|
|
|
Conversely, a successful execution is characterized by a \texttt{FINISH}
|
|
|
|
|
termination event.
|
|
|
|
|
|
|
|
|
|
\subsection{Traces}
|
|
|
|
|
|
|
|
|
@ -276,7 +281,7 @@ Google\label{fig:eventTypes}}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
\hypertarget{traces-contents}{%
|
|
|
|
|
\subsection{Traces contents}\label{traces-contents}}
|
|
|
|
|
\subsection{Trace Contents}\label{traces-contents}}
|
|
|
|
|
|
|
|
|
|
The traces provided by Google contain mainly a collection of job and
|
|
|
|
|
task events spanning a month of execution of the 8 different clusters.
|
|
|
|
@ -301,8 +306,7 @@ used by Google that combines several parameters like number of
|
|
|
|
|
processors and cores, clock frequency, and architecture (i.e.~ISA).
|
|
|
|
|
|
|
|
|
|
\hypertarget{overview-of-traces-format}{%
|
|
|
|
|
\subsection{Overview of traces'
|
|
|
|
|
format}\label{overview-of-traces-format}}
|
|
|
|
|
\subsection{Trace Format}\label{overview-of-traces-format}}
|
|
|
|
|
|
|
|
|
|
The traces have a collective size of approximately 8TiB and are stored
|
|
|
|
|
in a Gzip-compressed JSONL (JSON lines) format, which means that each
|
|
|
|
@ -328,7 +332,7 @@ The scope of this thesis focuses on the tables
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\hypertarget{remark-on-traces-size}{%
|
|
|
|
|
\subsection{Remark on traces size}\label{remark-on-traces-size}}
|
|
|
|
|
\subsection{Remark on Trace Size}\label{remark-on-traces-size}}
|
|
|
|
|
|
|
|
|
|
While the 2011 Google Borg traces were relatively small, with a total
|
|
|
|
|
size in the order of the tens of gigabytes, the 2019 traces are quite
|
|
|
|
@ -361,8 +365,7 @@ parallel Map-Reduce computations. In this section, we discuss the technical
|
|
|
|
|
details behind our approach.
|
|
|
|
|
|
|
|
|
|
\hypertarget{introduction-on-apache-spark}{%
|
|
|
|
|
\subsection{Introduction on Apache
|
|
|
|
|
Spark}\label{introduction-on-apache-spark}}
|
|
|
|
|
\subsection{Apache Spark}\label{introduction-on-apache-spark}}
|
|
|
|
|
|
|
|
|
|
Apache Spark is a unified analytics engine for large-scale data
|
|
|
|
|
processing. In layman's terms, Spark is really useful to parallelize
|
|
|
|
@ -386,7 +389,7 @@ Spark has very powerful native Python bindings in the form of the
|
|
|
|
|
\emph{PySpark} API, which were used to implement the various queries.
|
|
|
|
|
|
|
|
|
|
\hypertarget{query-architecture}{%
|
|
|
|
|
\subsection{Query architecture}\label{query-architecture}}
|
|
|
|
|
\subsection{Query Architecture}\label{query-architecture}}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
|
@ -406,13 +409,13 @@ Finally, a reduce operation is applied to either
|
|
|
|
|
further aggregate those computed properties or to generate an aggregated data
|
|
|
|
|
structure for storage purposes.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Parsing table files}
|
|
|
|
|
\subsubsection{Parsing Table Files}
|
|
|
|
|
|
|
|
|
|
As stated before, table ``files'' are composed of several Gzip-compressed
|
|
|
|
|
shards of JSONL record data. The specification for the types and constraints
|
|
|
|
|
of each record is outlined by Google in the form of a protobuffer specification
|
|
|
|
|
file found in the trace release
|
|
|
|
|
package\cite{google-proto-marso}. This file was used as
|
|
|
|
|
package~\cite{google-proto-marso}. This file was used as
|
|
|
|
|
the oracle specification and was a critical reference for writing the query
|
|
|
|
|
code that checks, parses and carefully sanitizes the various JSONL records
|
|
|
|
|
prior to actual computations.
|
|
|
|
@ -424,7 +427,7 @@ these values the key-value pair in the JSON object is outright omitted. When
|
|
|
|
|
reading the traces in Apache Spark is therefore necessary to check for this
|
|
|
|
|
possibility and insert back the omitted record attributes.
|
|
|
|
|
|
|
|
|
|
\subsubsection{The queries}
|
|
|
|
|
\subsubsection{The Queries}
|
|
|
|
|
|
|
|
|
|
Most queries use only two or three fields in each trace records, while the
|
|
|
|
|
original table records often are made of a couple of dozen fields. In order to
|
|
|
|
@ -452,14 +455,12 @@ and performing the desired computation on the obtained chronological event log.
|
|
|
|
|
Sometimes intermediate results are saved in Spark's parquet format in order to
|
|
|
|
|
compute and save intermediate results beforehand.
|
|
|
|
|
|
|
|
|
|
\subsection{Query script design}
|
|
|
|
|
\subsection{Query Script Design and the \textit{Task Slowdown} Script}
|
|
|
|
|
|
|
|
|
|
In this section we aim to show the general complexity behind the implementations
|
|
|
|
|
of query scripts by explaining in detail some sampled scripts to better
|
|
|
|
|
appreciate their behaviour.
|
|
|
|
|
|
|
|
|
|
\subsubsection{The ``task slowdown'' query script}
|
|
|
|
|
|
|
|
|
|
One example of analysis script with average complexity and a pretty
|
|
|
|
|
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
|
|
|
|
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
|
|
|
@ -532,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
|
|
|
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
|
|
|
|
|
|
|
|
|
Our first investigation focuses on replicating the analysis done by the paper of
|
|
|
|
|
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
|
|
|
|
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
|
|
|
|
and resources.
|
|
|
|
|
|
|
|
|
|
In this section we perform several analyses focusing on how machine time and
|
|
|
|
@ -638,13 +639,13 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
|
|
|
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
|
|
|
|
Figure~\ref{fig:taskslowdown-csts}.
|
|
|
|
|
|
|
|
|
|
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
|
|
|
|
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
|
|
|
|
priority value, which at the time were numeric values between 0 and 11. However,
|
|
|
|
|
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
|
|
|
|
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
|
|
|
|
by task priority tier over the 2019 data. Priority tiers are semantically
|
|
|
|
|
relevant priority ranges defined in the Tirmazi et al.\
|
|
|
|
|
2020\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
|
|
|
|
relevant priority ranges defined by Tirmazi et al.\ in
|
|
|
|
|
2020~\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
|
|
|
|
tiers are also provided next to the 2011 priority values in the table covering
|
|
|
|
|
the 2015 analysis.
|
|
|
|
|
|
|
|
|
@ -687,7 +688,7 @@ tier respectively, and Cluster D has 12.04 mean slowdown in its ``Free'' tier.
|
|
|
|
|
|
|
|
|
|
In this analysis we aim to understand how physical resources of machines
|
|
|
|
|
in the Borg cluster are used to complete tasks. In particular, we compare how
|
|
|
|
|
CPU and Memory resource allocation and usage are distributed among tasks based
|
|
|
|
|
CPU and memory resource allocation and usage are distributed among tasks based
|
|
|
|
|
on their termination
|
|
|
|
|
type.
|
|
|
|
|
|
|
|
|
@ -739,9 +740,9 @@ traces.
|
|
|
|
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
|
|
|
|
|
|
|
|
|
This section aims to use some of the tecniques used in section IV of
|
|
|
|
|
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
between task and job events by gathering event statistics at those events. In
|
|
|
|
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
|
|
|
|
particular, Section~\ref{tabIII-section} explores how the success of a
|
|
|
|
|
task is inter-correlated with its own event patterns, which
|
|
|
|
|
Section~\ref{figV-section} explores even further by computing task success
|
|
|
|
|
probabilities based on the number of task termination events of a specific type.
|
|
|
|
@ -765,7 +766,7 @@ traces is shown in Figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
|
|
|
|
breakdown of the same data for the 2019 traces is shown in
|
|
|
|
|
Figure~\ref{fig:tableIII-csts}.
|
|
|
|
|
|
|
|
|
|
Each table from these figure shows the mean and the 95-th percentile of the
|
|
|
|
|
Each table from these figures shows the mean and the 95-th percentile of the
|
|
|
|
|
number of termination events per task, broke down by task termination. In
|
|
|
|
|
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
|
|
|
|
|
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
|
|
|
|
@ -807,10 +808,10 @@ overall data from the 2019 ones, and in Figure~\ref{fig:figureV-csts}, as a
|
|
|
|
|
cluster-by-cluster breakdown of the same data for the 2019 traces.
|
|
|
|
|
|
|
|
|
|
In Figure~\ref{fig:figureV} the 2011 and 2019 plots differ in their x-axis:
|
|
|
|
|
for 2011 data conditional probabilities are computed for a maximum event coun
|
|
|
|
|
t of 30, while for 2019 data are computed for up to 50 events of a specific
|
|
|
|
|
kind. Nevertheless, another quite striking difference between the two plots can
|
|
|
|
|
be seen: while 2011 data has relatively smooth decreasing curves for all event
|
|
|
|
|
for 2011 data conditional probabilities are computed for a maximum event count
|
|
|
|
|
of 30, while for 2019 data are computed for up to 50 events of a specific kind.
|
|
|
|
|
Nevertheless, another quite striking difference between the two plots can be
|
|
|
|
|
seen: while 2011 data has relatively smooth decreasing curves for all event
|
|
|
|
|
types, the curves in the 2019 data almost immediately plateau with no
|
|
|
|
|
significant change easily observed after 5 events of any kind.
|
|
|
|
|
|
|
|
|
@ -872,8 +873,8 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
|
|
|
|
|
|
|
|
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
|
|
|
|
|
|
|
|
|
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
|
|
|
|
paper\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
|
|
|
|
|
paper~\cite{dsn-paper} to find patterns and interpendencies
|
|
|
|
|
between task and job events by gathering event statistics at those events. In
|
|
|
|
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
|
|
|
|
task is inter-correlated with its own event patterns, which
|
|
|
|
@ -882,109 +883,177 @@ probabilities based on the number of task termination events of a specific type.
|
|
|
|
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
|
|
|
|
the job level.
|
|
|
|
|
|
|
|
|
|
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
|
|
|
|
concurrency.}\label{fig7-section}
|
|
|
|
|
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
|
|
|
|
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
|
|
|
|
|
|
|
|
|
\input{figures/figure_7}
|
|
|
|
|
This analysis shows event rates (i.e.\ the relative percentage of termination
|
|
|
|
|
type events) over different configurations of task-level parameters.
|
|
|
|
|
Figure~\ref{fig:figureVII-a} and Figure~\ref{fig:figureVII-a-csts} show the
|
|
|
|
|
distribution of event rates over the various task priority tiers.
|
|
|
|
|
Figure~\ref{fig:figureVII-b} and Figure~\ref{fig:figureVII-b-csts} show the
|
|
|
|
|
distribution of event rates over the total event execution time. Finally,
|
|
|
|
|
Figure~\ref{fig:figureVII-c} and Figure~\ref{fig:figureVII-c-csts} show the
|
|
|
|
|
distribution of event rates over the metric of machine concurrency, defined as
|
|
|
|
|
the number of co-executing tasks on the machine and at the moment the
|
|
|
|
|
termination event is recorded.
|
|
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
|
|
|
|
\ref{fig:figureVII-c}.
|
|
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
From this analysis we can make the following observations:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item
|
|
|
|
|
No smooth curves in this figure either, unlike 2011 traces
|
|
|
|
|
The behaviour of the curves in the task priority distributions
|
|
|
|
|
(in Figure~\ref{fig:figureVII-a} and Figure~\ref{fig:figureVII-a-csts})
|
|
|
|
|
for the 2019 traces is almost the opposite of the
|
|
|
|
|
2011 ones, i.e.\ in-between priorities have higher kill rates while
|
|
|
|
|
priorities at the extremum have lower kill rates;
|
|
|
|
|
\item
|
|
|
|
|
The behaviour of curves for 7a (priority) is almost the opposite of
|
|
|
|
|
2011, i.e. in-between priorities have higher kill rates while
|
|
|
|
|
priorities at the extremum have lower kill rates. This could also be
|
|
|
|
|
due bt the inherent distribution of job terminations;
|
|
|
|
|
\item
|
|
|
|
|
Event execution time curves are quite different than 2011, here it
|
|
|
|
|
The event execution time curves (in Figure~\ref{fig:figureVII-b} and
|
|
|
|
|
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
|
|
|
|
are quite different than 2011 ones, here it
|
|
|
|
|
seems there is a good correlation between short task execution times
|
|
|
|
|
and finish event rates, instead of the U shape curve in 2015 DSN
|
|
|
|
|
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
|
|
|
|
|
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
|
|
|
|
\item
|
|
|
|
|
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
|
|
|
|
The behaviour among different clusters for the event execution time
|
|
|
|
|
distributions in Figure~\ref{fig:figureVII-b-csts} seem quite uniform;
|
|
|
|
|
\item
|
|
|
|
|
Machine concurrency seems to play little role in the event termination
|
|
|
|
|
distribution, as for all concurrency factors the kill rate is at 90\%.
|
|
|
|
|
The machine concurrency metric, for which a distribution of event rates is
|
|
|
|
|
computed in Figure~\ref{fig:figureVII-c} and
|
|
|
|
|
Figure~\ref{fig:figureVII-c-csts}, seems to play little role in the event
|
|
|
|
|
termination distribution, as for all concurrency factors the \texttt{KILL}
|
|
|
|
|
event rate is around 90\% with little fluctuation.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
|
|
|
|
\subsection{Task Event Rates vs.\ Requested Resources, Resource Reservation, and
|
|
|
|
|
Resource Utilization}\label{fig8-section}
|
|
|
|
|
\input{figures/figure_8}
|
|
|
|
|
|
|
|
|
|
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
|
|
|
|
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
|
|
|
|
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
|
|
|
|
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
|
|
|
|
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
|
|
|
|
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
|
|
|
|
This analysis is concerned with the distribution of event rates over several
|
|
|
|
|
resources related parameters.
|
|
|
|
|
Figure~\ref{fig:figureVIII-a} and Figure~\ref{fig:figureVIII-a-csts} show the
|
|
|
|
|
distribution of task event rates w.r.t.\ the amount of CPU the task has
|
|
|
|
|
requested, while Figure~\ref{fig:figureVIII-b} and
|
|
|
|
|
Figure~\ref{fig:figureVIII-b-csts} show task events rates vs.\ requested memory.
|
|
|
|
|
Figure~\ref{fig:figureVIII-c} and Figure~\ref{fig:figureVIII-c-csts} show the
|
|
|
|
|
distribution of task event rates w.r.t.\ the amount of CPU that has collectively
|
|
|
|
|
requested on the machine where the task is running, while
|
|
|
|
|
Figure~\ref{fig:figureVIII-d} and Figure~\ref{fig:figureVIII-d-csts} show a
|
|
|
|
|
similar distribution but for memory. Finally Figure~\ref{fig:figureVIII-e} and
|
|
|
|
|
Figure~\ref{fig:figureVIII-e-csts} show the distribution of task event rates
|
|
|
|
|
w.r.t.\ the amount of CPU the task has really been utilized, while
|
|
|
|
|
Figure~\ref{fig:figureVIII-f} and Figure~\ref{fig:figureVIII-f-csts} show task
|
|
|
|
|
events rates vs.\ used memory.
|
|
|
|
|
|
|
|
|
|
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
|
|
|
|
}\label{fig9-section}
|
|
|
|
|
From this analysis we can make the following observations:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item In the 2019 trace, the amount of requested CPU resources seem to play
|
|
|
|
|
little effect on the job termination as it can evinced in
|
|
|
|
|
Figure~\ref{fig:figureVIII-a} and Figure~\ref{fig:figureVIII-a-csts}.
|
|
|
|
|
Instead, the job rate distributions w.r.t.\ the amount of requested memory
|
|
|
|
|
(Figure~\ref{fig:figureVIII-b} and Figure~\ref{fig:figureVIII-b-csts}) show
|
|
|
|
|
no discernable pattern;
|
|
|
|
|
\item
|
|
|
|
|
Overall a significant increment in the killed event rate can be observed. They
|
|
|
|
|
seem to dominate all event rates measures;
|
|
|
|
|
\item
|
|
|
|
|
Among all clusters in Figure~\ref{fig:figureVIII-a-csts} there can be noted the
|
|
|
|
|
dominance of the killed event rate. In 2011, it was observed a more dominant
|
|
|
|
|
behaviour by the success event rate curve;
|
|
|
|
|
\item
|
|
|
|
|
For each analysed distribution, clusters do not show a common behaviour of the
|
|
|
|
|
curves. Some are similar, but they are generally distinguishable;
|
|
|
|
|
\item
|
|
|
|
|
In Figure~\ref{fig:figureVIII-e} there can be seen that while a drastic decrease
|
|
|
|
|
of the killed event rate curve is observed as the CPU utilization increases,
|
|
|
|
|
the success event rate does not increase much.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsection{Job Event Rates vs.\ Job Size, Job Execution Time, and Machine
|
|
|
|
|
Locality}\label{fig9-section}
|
|
|
|
|
\input{figures/figure_9}
|
|
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|
|
|
|
\ref{fig:figureIX-c}.
|
|
|
|
|
This analysis shows job event rates (i.e.\ the relative percentage of
|
|
|
|
|
termination type events) over different configurations of job size, job
|
|
|
|
|
execution time and machine locality.
|
|
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
Figure~\ref{fig:figureIX-a} and Figure~\ref{fig:figureIX-a-csts} provide
|
|
|
|
|
the plots of job event rates versus the job
|
|
|
|
|
size. Job size is defined as the number of tasks belonging to the job.
|
|
|
|
|
Figure~\ref{fig:figureIX-b} and Figure~\ref{fig:figureIX-b-csts} provide
|
|
|
|
|
the plots of the job event rates versus
|
|
|
|
|
execution time.
|
|
|
|
|
Figure~\ref{fig:figureIX-c} and Figure~\ref{fig:figureIX-c-csts} provide
|
|
|
|
|
the plots of the job event
|
|
|
|
|
rates versus machine locality.
|
|
|
|
|
Machine locality is defined as the ratio between the number
|
|
|
|
|
of machines used to execute the tasks inside the job and the job size.
|
|
|
|
|
|
|
|
|
|
By analysing these plots, we can make the following observations:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item
|
|
|
|
|
Behaviour between cluster varies a lot
|
|
|
|
|
There can be noted significant variations in the behaviour of the curves
|
|
|
|
|
between clusters;
|
|
|
|
|
\item
|
|
|
|
|
There are no ``smooth'' gradients in the various curves unlike in the
|
|
|
|
|
2011 traces
|
|
|
|
|
There are no smooth gradients in the various curves unlike in the
|
|
|
|
|
2011 traces;
|
|
|
|
|
\item
|
|
|
|
|
Killed jobs have higher event rates in general, and overall dominate
|
|
|
|
|
all event rates measures
|
|
|
|
|
all event rates measures. As can be seen in Figure~\ref{fig:figureIX-a}, an
|
|
|
|
|
higher number of tasks (i.e., an higher job size) seems to be correlated to
|
|
|
|
|
an higher killed event rate in 2019 rather than in 2011. In
|
|
|
|
|
Figure~\ref{fig:figureIX-b}, we observe the best success event rate for a
|
|
|
|
|
job execution time of 4-10 minutes, while in 2011, it seemed that the finish
|
|
|
|
|
event rate increases along with the job execution time;
|
|
|
|
|
\item
|
|
|
|
|
There still seems to be a correlation between short execution job
|
|
|
|
|
times and successfull final termination, and likewise for kills and
|
|
|
|
|
higher job terminations
|
|
|
|
|
There still seems to be a strong correlation between short execution job
|
|
|
|
|
times and successful final termination, and likewise for kills and
|
|
|
|
|
higher job terminations. Especially for these two curves, in most cases also
|
|
|
|
|
between the clusters, their behaviour suggests a specular trend;
|
|
|
|
|
\item
|
|
|
|
|
Across all clusters, a machine locality factor of 1 seems to lead to
|
|
|
|
|
the highest success event rate
|
|
|
|
|
As can be seen in Figure~\ref{fig:figureIX-c}, across all clusters, a machine
|
|
|
|
|
locality factor of 1 seems to lead to the highest success event rate, while
|
|
|
|
|
in 2011 the same machine locality factor led to the lowest success event
|
|
|
|
|
rate.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\section{Conclusions, Limitations and Future Work}\label{sec8}
|
|
|
|
|
In this report we analyze the Google Borg 2019 traces and compared them with
|
|
|
|
|
their 2011 counterpart from the perspective of failures, their impact on
|
|
|
|
|
resources and their causes. We discover that the impact of non-successful
|
|
|
|
|
executions (especially of \texttt{KILL}ed tasks and jobs) in the new traces is
|
|
|
|
|
still very relevant in terms of machine time and resources, even more so than in
|
|
|
|
|
2011. We also discover that unsuccessful job and task event patterns still play
|
|
|
|
|
a major role in the overall execution success of Borg jobs and tasks. We finally
|
|
|
|
|
discover that unsuccessful job and task event rates dominate the overall
|
|
|
|
|
landscape of Borg's own logs, even when grouping tasks and jobs by parameters
|
|
|
|
|
such as priority, resource request, reservation and utilization, and machine
|
|
|
|
|
locality.
|
|
|
|
|
In this report we analyzed the Google Borg 2019 traces and compared them with
|
|
|
|
|
their 2011 counterpart from the perspective of unsuccessful executions, their
|
|
|
|
|
impact on resources and their causes. We discover that the impact of
|
|
|
|
|
unsuccessful executions (especially of \texttt{KILL}ed tasks and jobs) in the
|
|
|
|
|
new traces is still very relevant in terms of machine time and resources, even
|
|
|
|
|
more so than in 2011. We also discover that unsuccessful job and task event
|
|
|
|
|
patterns still play a major role in the overall execution success of Borg jobs
|
|
|
|
|
and tasks. We finally discover that unsuccessful job and task event rates
|
|
|
|
|
dominate the overall landscape of Borg's own logs, even when grouping tasks and
|
|
|
|
|
jobs by parameters such as priority, resource request, reservation and
|
|
|
|
|
utilization, and machine locality.
|
|
|
|
|
|
|
|
|
|
We then can conclude that the performed analysis show a lot of clear trends
|
|
|
|
|
We then can conclude that the performed analysis show many clear trends
|
|
|
|
|
regarding the correlation of execution success with several parameters and
|
|
|
|
|
metadata. These trends can potentially be exploited to build better scheduling
|
|
|
|
|
algorithms and new predictive models
|
|
|
|
|
that could understand if an execution has high probability of failure based on
|
|
|
|
|
its own properties and metadata. The creation of such models could allow for
|
|
|
|
|
computational resources to be saved and used to either increase the throughput
|
|
|
|
|
of higher priority workloads or to allow for a larger workload altoghether.
|
|
|
|
|
algorithms and new predictive models that could understand if an execution has
|
|
|
|
|
high probability of failure based on its own properties and metadata. The
|
|
|
|
|
creation of such models could allow for computational resources to be saved and
|
|
|
|
|
used to either increase the throughput of higher priority workloads or to allow
|
|
|
|
|
for a larger workload altoghether.
|
|
|
|
|
|
|
|
|
|
The biggest limitation and threat to validity posed to this project is the
|
|
|
|
|
relative lack of infrormation provided by Google on the true meaning of
|
|
|
|
|
relative lack of information provided by Google on the true meaning of
|
|
|
|
|
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
|
|
|
|
|
and the rather scarcity of information in the traces
|
|
|
|
|
documentation\cite{google-drive-marso}, it is not clear if unsuccessful
|
|
|
|
|
documentation~\cite{google-drive-marso}, it is not clear if unsuccessful
|
|
|
|
|
executions yield any useful computation result or not. Our assumption in this
|
|
|
|
|
report is that unsuccesful jobs and tasks do not produce any result and are
|
|
|
|
|
therefore just burdens on machine time and resources, but should this assumption
|
|
|
|
|
be incorrect then the interpretation of the analyses might change significantly.
|
|
|
|
|
be incorrect then the interpretation of the analyses might change.
|
|
|
|
|
|
|
|
|
|
Given the significant computational time invested in obtaining the results shown
|
|
|
|
|
in this report and due to time and resource limitations, some of the analysis
|
|
|
|
|
were not completed. Our future work will focus on finishing these analysis,
|
|
|
|
|
namely by computing results for the missing clusters and obtaining a true
|
|
|
|
|
were not completed on all clusters. Our future work will focus on finishing
|
|
|
|
|
these analysis, computing results for the missing clusters and obtaining an
|
|
|
|
|
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
|
|
|
|
|
their causes.
|
|
|
|
|
|
|
|
|
|