report
This commit is contained in:
parent
f8045b560c
commit
d886f1a417
10 changed files with 267 additions and 198 deletions
BIN
report.zip
Normal file
BIN
report.zip
Normal file
Binary file not shown.
Binary file not shown.
|
@ -40,13 +40,13 @@ Switzerland]{Prof.}{Walter}{Binder}
|
||||||
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
||||||
\end{committee}
|
\end{committee}
|
||||||
|
|
||||||
\abstract{The project aims at comparing two different traces coming from large
|
\abstract{The thesis aims at comparing two different traces coming from large
|
||||||
datacenters, focusing in particular on unsuccessful executions of jobs and
|
datacenters, focusing in particular on unsuccessful executions of jobs and
|
||||||
tasks submitted by users. The objective of this project is to compare the
|
tasks submitted by users. The objective of this thesis is to compare the
|
||||||
resource waste caused by unsuccessful executions, their impact on application
|
resource waste caused by unsuccessful executions, their impact on application
|
||||||
performance, and their root causes. We show the strong negative impact on
|
performance, and their root causes. We show the strong negative impact on
|
||||||
CPU and RAM usage and on task slowdown. We analyze patterns of
|
CPU and RAM usage and on task slowdown. We analyze patterns of
|
||||||
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
unsuccessful jobs and tasks, focusing on their interdependency.
|
||||||
Moreover, we uncover their root causes by inspecting key workload and
|
Moreover, we uncover their root causes by inspecting key workload and
|
||||||
system attributes such as machine locality and concurrency level.}
|
system attributes such as machine locality and concurrency level.}
|
||||||
|
|
||||||
|
@ -56,24 +56,23 @@ system attributes such as machine locality and concurrency level.}
|
||||||
\newpage
|
\newpage
|
||||||
|
|
||||||
\section{Introduction} In today's world there is an ever growing demand for
|
\section{Introduction} In today's world there is an ever growing demand for
|
||||||
efficient, large scale computations. The rising trend of ``big data'' put the
|
efficient, large scale computations. The rising trend of ``big data'' puts the
|
||||||
need for efficient management of large scaled parallelized computing at an all
|
need for efficient management of large scaled parallelized computing at an all
|
||||||
time high. This fact also increases the demand for research in the field of
|
time high. This fact also increases the demand for research in the field of
|
||||||
distributed systems, in particular in how to schedule computations effectively,
|
distributed systems, in particular in how to schedule computations effectively,
|
||||||
avoid wasting resources and avoid failures.
|
avoid wasting resources and avoid failures.
|
||||||
|
|
||||||
In 2011 Google released a month long data trace of their own cluster management
|
In 2011 Google released a month long data trace of their own cluster management
|
||||||
system\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
system~\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
||||||
scheduling, priority management, and failures of a real production workload.
|
scheduling, priority management, and failures of a real production workload.
|
||||||
This data was 2009
|
|
||||||
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
||||||
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
||||||
Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need
|
Failures}~\cite{dsn-paper}, which in its many conclusions highlighted the need
|
||||||
for better cluster management highlighting the high amount of failures found in
|
for better cluster management highlighting the high amount of failures found in
|
||||||
the traces.
|
the traces.
|
||||||
|
|
||||||
In 2019 Google released an updated version of the \textit{Borg} cluster
|
In 2019 Google released an updated version of the \textit{Borg} cluster
|
||||||
traces\cite{google-marso-19}, not only containing data from a far bigger
|
traces~\cite{google-marso-19}, not only containing data from a far bigger
|
||||||
workload due to improvements in computational technology, but also providing
|
workload due to improvements in computational technology, but also providing
|
||||||
data from 8 different \textit{Borg} cells from datacenters located all over the
|
data from 8 different \textit{Borg} cells from datacenters located all over the
|
||||||
world.
|
world.
|
||||||
|
@ -90,8 +89,8 @@ in carrying out non-successful executions, i.e.\ executing programs that would
|
||||||
eventually ``crash'' and potentially not leading to useful results\footnote{This
|
eventually ``crash'' and potentially not leading to useful results\footnote{This
|
||||||
is only a speculation, since both the 2011 and the 2019 traces only provide a
|
is only a speculation, since both the 2011 and the 2019 traces only provide a
|
||||||
``black box'' view of the Borg cluster system. Neither of the accompanying
|
``black box'' view of the Borg cluster system. Neither of the accompanying
|
||||||
papers for both traces\cite{google-marso-11}\cite{google-marso-19} or the
|
papers for both traces~\cite{google-marso-11}~\cite{google-marso-19} or the
|
||||||
documentation for the 2019 traces\cite{google-drive-marso} ever mention if
|
documentation for the 2019 traces~\cite{google-drive-marso} ever mention if
|
||||||
non-successful tasks produce any useful result.}. The 2019 subplot paints an
|
non-successful tasks produce any useful result.}. The 2019 subplot paints an
|
||||||
even darker picture, with less than 5\% of machine time used for successful
|
even darker picture, with less than 5\% of machine time used for successful
|
||||||
computation.
|
computation.
|
||||||
|
@ -107,7 +106,7 @@ models to mitigate or erase the resource impact of unsuccessful executions.
|
||||||
%\subsection{Challenges}
|
%\subsection{Challenges}
|
||||||
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
||||||
than the 2011 ones, and given that the entire compressed traces package has a
|
than the 2011 ones, and given that the entire compressed traces package has a
|
||||||
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
|
non-trivial size (weighing approximately 8 TiB~\cite{google-drive-marso}), the
|
||||||
computations required to perform the analysis we illustrate in this report
|
computations required to perform the analysis we illustrate in this report
|
||||||
cannot be performed with classical data science techniques. A
|
cannot be performed with classical data science techniques. A
|
||||||
considerable amount of computational power was needed to carry out the
|
considerable amount of computational power was needed to carry out the
|
||||||
|
@ -118,7 +117,7 @@ MapReduce-like structure.
|
||||||
|
|
||||||
%\subsection{Contribution}
|
%\subsection{Contribution}
|
||||||
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
||||||
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
paper~\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
||||||
workload and the behaviour and patterns of executions within it. Thanks to this
|
workload and the behaviour and patterns of executions within it. Thanks to this
|
||||||
analysis, we aim to understand even better the causes of failures and how to
|
analysis, we aim to understand even better the causes of failures and how to
|
||||||
prevent them. Additionally, given the technical challenge this analysis posed,
|
prevent them. Additionally, given the technical challenge this analysis posed,
|
||||||
|
@ -130,12 +129,12 @@ The report is structured as follows. Section~\ref{sec2} contains information
|
||||||
about the current state of the art for Google Borg cluster traces.
|
about the current state of the art for Google Borg cluster traces.
|
||||||
Section~\ref{sec3} provides an overview including technical background
|
Section~\ref{sec3} provides an overview including technical background
|
||||||
information on the data to analyze and its storage format. Section~\ref{sec4}
|
information on the data to analyze and its storage format. Section~\ref{sec4}
|
||||||
will discuss about the project requirements and the data science methods used to
|
discusses about the project requirements and the data science methods used to
|
||||||
perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
|
perform the analysis. Section~\ref{sec5}, Section~\ref{sec6} and
|
||||||
Section~\ref{sec7} show the result obtained while analyzing, respectively the
|
Section~\ref{sec7} show the result obtained while analyzing, respectively the
|
||||||
performance input of unsuccessful executions, the patterns of task and job
|
performance input of unsuccessful executions, the patterns of task and job
|
||||||
events, and the potential causes of unsuccessful executions. Finally,
|
events, and the potential causes of unsuccessful executions. Finally,
|
||||||
Section~\ref{sec8} contains the conclusions.
|
Section~\ref{sec8} concludes.
|
||||||
|
|
||||||
\section{State of the Art}\label{sec2}
|
\section{State of the Art}\label{sec2}
|
||||||
|
|
||||||
|
@ -159,17 +158,17 @@ H & Europe/Brussels \\
|
||||||
timezone of each cluster in the 2019 Google Borg traces.}\label{fig:clusters}
|
timezone of each cluster in the 2019 Google Borg traces.}\label{fig:clusters}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
In 2015, Dr.~Andrea Rosà et al.\ published a
|
In 2015, Rosà et al.\ published a
|
||||||
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
||||||
An Analysis beyond Failures}\cite{dsn-paper} in which they performed several
|
An Analysis beyond Failures}~\cite{dsn-paper} in which they performed several
|
||||||
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
analyses on unsuccessful executions in the Google's 2011 Borg cluster traces
|
||||||
with the aim of identifying their resource waste, their impacts on the
|
with the aim of identifying their resource waste, their impacts on the
|
||||||
performance of the application, and any causes that may lie behind such
|
performance of the application, and any causes that may lie behind such
|
||||||
failures. The salient conclusion of that research is that actually lots of
|
failures. The salient conclusion of that research is that actually lots of
|
||||||
computations performed by Google would eventually end in failure, then leading
|
computations performed by Google would eventually end in failure, then leading
|
||||||
to large amounts of computational power being wasted.
|
to large amounts of computational power being wasted.
|
||||||
|
|
||||||
However, with the release of the new 2019 traces\cite{google-marso-19},
|
However, with the release of the new 2019 traces~\cite{google-marso-19},
|
||||||
the results and conclusions found by that paper could be potentially outdated
|
the results and conclusions found by that paper could be potentially outdated
|
||||||
in the current large-scale computing world.
|
in the current large-scale computing world.
|
||||||
The new traces not only provide updated data on Borg's
|
The new traces not only provide updated data on Borg's
|
||||||
|
@ -180,7 +179,7 @@ from now on referred as ``Cluster A'' to ``Cluster H''.
|
||||||
The geographical
|
The geographical
|
||||||
location of each cluster can be consulted in Figure~\ref{fig:clusters}. The
|
location of each cluster can be consulted in Figure~\ref{fig:clusters}. The
|
||||||
information in that table was provided by the 2019 traces
|
information in that table was provided by the 2019 traces
|
||||||
documentation\cite{google-drive-marso}.
|
documentation~\cite{google-drive-marso}.
|
||||||
|
|
||||||
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
||||||
example, the amount and variety of server configurations per cluster increased
|
example, the amount and variety of server configurations per cluster increased
|
||||||
|
@ -191,27 +190,27 @@ Figure~\ref{fig:machineconfigs-csts} on a cluster-by-cluster basis.
|
||||||
|
|
||||||
\input{figures/machine_configs}
|
\input{figures/machine_configs}
|
||||||
|
|
||||||
There are two main works covering the new data,
|
There are two main works covering the new data, one being the paper
|
||||||
one being the paper \textit{Borg: The Next Generation}\cite{google-marso-19},
|
\textit{Borg: The Next Generation}~\cite{google-marso-19}, which compares the
|
||||||
which compares the overall features of the trace with the 2011
|
overall features of the trace with the 2011
|
||||||
one\cite{google-marso-11}\cite{github-marso}, and one covering the features and
|
one~\cite{google-marso-11}~\cite{github-marso}, and one covering the features
|
||||||
performance of
|
and performance of \textit{Autopilot}~\cite{james-muratore}, a software that
|
||||||
\textit{Autopilot}\cite{james-muratore}, a software that provides autoscaling
|
provides autoscaling features in Borg. The new traces have also been analyzed
|
||||||
features in Borg. The new traces have also been analyzed from the execution
|
from the execution priority perspective~\cite{down-under}, as well as from a
|
||||||
priority perspective\cite{down-under}, as well as from a cluster-by-cluster
|
cluster-by-cluster comparison~\cite{golf-course} given the multi-cluster nature
|
||||||
comparison\cite{golf-course} given the multi-cluster nature of the new traces.
|
of the new traces.
|
||||||
|
|
||||||
Other studies have been performed in similar big-data systems focusing on the
|
Other studies have been performed in similar big-data systems focusing on the
|
||||||
failure of hardware components and software
|
failure of hardware components and software
|
||||||
bugs\cite{9}\cite{10}\cite{11}\cite{12}.
|
bugs~\cite{9}~\cite{10}~\cite{11}~\cite{12}.
|
||||||
|
|
||||||
However, the community has not yet performed any research on the new Borg
|
However, the community has not yet performed any research on the new Borg
|
||||||
traces analysing unsuccessful executions, their possible causes, and the
|
traces analysing unsuccessful executions, their possible causes, and the
|
||||||
relationships between tasks and jobs. Therefore, the only current research in
|
relationships between tasks and jobs. Therefore, the only current research in
|
||||||
this field (beside this report) is the 2015 Ros\'a et al.\
|
this field is this very report, providing and update to the the 2015 Ros\'a et
|
||||||
paper\cite{dsn-paper}.
|
al.\ paper~\cite{dsn-paper} focusing on the new trace.
|
||||||
|
|
||||||
\section{Background information}\label{sec3}
|
\section{Background}\label{sec3}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
thousands of different jobs. Among the various cluster management services it
|
thousands of different jobs. Among the various cluster management services it
|
||||||
|
@ -227,11 +226,17 @@ paper\cite{dsn-paper}.
|
||||||
consisting of multiple processes, which has to be run on a single machine.
|
consisting of multiple processes, which has to be run on a single machine.
|
||||||
Those tasks may be ran sequentially or in parallel, and the condition for a
|
Those tasks may be ran sequentially or in parallel, and the condition for a
|
||||||
job's successful termination is nontrivial.
|
job's successful termination is nontrivial.
|
||||||
% Both tasks and jobs lifecyles are represented by several events, which are
|
|
||||||
% encoded and stored in the trace as rows of various tables. Among the
|
Both tasks and jobs lifecyles are represented by several events, which are
|
||||||
% information events provide, the field ``type'' provides information on the
|
encoded and stored in the trace as rows of various tables. Among the
|
||||||
% execution status of the job or task. This field can have several values,
|
information events provide, the field ``type'' provides information on the
|
||||||
% which are illustrated in Figure~\ref{fig:eventtypes}.
|
execution status of the job or task. We focus only on events whose ``types''
|
||||||
|
indicate a termination, i.e.\ the end of a task or job's execution.
|
||||||
|
These termination event types are illustrated in Figure~\ref{fig:eventtypes}.
|
||||||
|
We then define an unsuccessful execution to be an execution characterized by a
|
||||||
|
termination event of type \texttt{EVICT}, \texttt{FAIL} or \texttt{KILL}.
|
||||||
|
Conversely, a successful execution is characterized by a \texttt{FINISH}
|
||||||
|
termination event.
|
||||||
|
|
||||||
\subsection{Traces}
|
\subsection{Traces}
|
||||||
|
|
||||||
|
@ -276,7 +281,7 @@ Google\label{fig:eventTypes}}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\hypertarget{traces-contents}{%
|
\hypertarget{traces-contents}{%
|
||||||
\subsection{Traces contents}\label{traces-contents}}
|
\subsection{Trace Contents}\label{traces-contents}}
|
||||||
|
|
||||||
The traces provided by Google contain mainly a collection of job and
|
The traces provided by Google contain mainly a collection of job and
|
||||||
task events spanning a month of execution of the 8 different clusters.
|
task events spanning a month of execution of the 8 different clusters.
|
||||||
|
@ -301,8 +306,7 @@ used by Google that combines several parameters like number of
|
||||||
processors and cores, clock frequency, and architecture (i.e.~ISA).
|
processors and cores, clock frequency, and architecture (i.e.~ISA).
|
||||||
|
|
||||||
\hypertarget{overview-of-traces-format}{%
|
\hypertarget{overview-of-traces-format}{%
|
||||||
\subsection{Overview of traces'
|
\subsection{Trace Format}\label{overview-of-traces-format}}
|
||||||
format}\label{overview-of-traces-format}}
|
|
||||||
|
|
||||||
The traces have a collective size of approximately 8TiB and are stored
|
The traces have a collective size of approximately 8TiB and are stored
|
||||||
in a Gzip-compressed JSONL (JSON lines) format, which means that each
|
in a Gzip-compressed JSONL (JSON lines) format, which means that each
|
||||||
|
@ -328,7 +332,7 @@ The scope of this thesis focuses on the tables
|
||||||
|
|
||||||
|
|
||||||
\hypertarget{remark-on-traces-size}{%
|
\hypertarget{remark-on-traces-size}{%
|
||||||
\subsection{Remark on traces size}\label{remark-on-traces-size}}
|
\subsection{Remark on Trace Size}\label{remark-on-traces-size}}
|
||||||
|
|
||||||
While the 2011 Google Borg traces were relatively small, with a total
|
While the 2011 Google Borg traces were relatively small, with a total
|
||||||
size in the order of the tens of gigabytes, the 2019 traces are quite
|
size in the order of the tens of gigabytes, the 2019 traces are quite
|
||||||
|
@ -361,8 +365,7 @@ parallel Map-Reduce computations. In this section, we discuss the technical
|
||||||
details behind our approach.
|
details behind our approach.
|
||||||
|
|
||||||
\hypertarget{introduction-on-apache-spark}{%
|
\hypertarget{introduction-on-apache-spark}{%
|
||||||
\subsection{Introduction on Apache
|
\subsection{Apache Spark}\label{introduction-on-apache-spark}}
|
||||||
Spark}\label{introduction-on-apache-spark}}
|
|
||||||
|
|
||||||
Apache Spark is a unified analytics engine for large-scale data
|
Apache Spark is a unified analytics engine for large-scale data
|
||||||
processing. In layman's terms, Spark is really useful to parallelize
|
processing. In layman's terms, Spark is really useful to parallelize
|
||||||
|
@ -386,7 +389,7 @@ Spark has very powerful native Python bindings in the form of the
|
||||||
\emph{PySpark} API, which were used to implement the various queries.
|
\emph{PySpark} API, which were used to implement the various queries.
|
||||||
|
|
||||||
\hypertarget{query-architecture}{%
|
\hypertarget{query-architecture}{%
|
||||||
\subsection{Query architecture}\label{query-architecture}}
|
\subsection{Query Architecture}\label{query-architecture}}
|
||||||
|
|
||||||
\subsubsection{Overview}
|
\subsubsection{Overview}
|
||||||
|
|
||||||
|
@ -406,13 +409,13 @@ Finally, a reduce operation is applied to either
|
||||||
further aggregate those computed properties or to generate an aggregated data
|
further aggregate those computed properties or to generate an aggregated data
|
||||||
structure for storage purposes.
|
structure for storage purposes.
|
||||||
|
|
||||||
\subsubsection{Parsing table files}
|
\subsubsection{Parsing Table Files}
|
||||||
|
|
||||||
As stated before, table ``files'' are composed of several Gzip-compressed
|
As stated before, table ``files'' are composed of several Gzip-compressed
|
||||||
shards of JSONL record data. The specification for the types and constraints
|
shards of JSONL record data. The specification for the types and constraints
|
||||||
of each record is outlined by Google in the form of a protobuffer specification
|
of each record is outlined by Google in the form of a protobuffer specification
|
||||||
file found in the trace release
|
file found in the trace release
|
||||||
package\cite{google-proto-marso}. This file was used as
|
package~\cite{google-proto-marso}. This file was used as
|
||||||
the oracle specification and was a critical reference for writing the query
|
the oracle specification and was a critical reference for writing the query
|
||||||
code that checks, parses and carefully sanitizes the various JSONL records
|
code that checks, parses and carefully sanitizes the various JSONL records
|
||||||
prior to actual computations.
|
prior to actual computations.
|
||||||
|
@ -424,7 +427,7 @@ these values the key-value pair in the JSON object is outright omitted. When
|
||||||
reading the traces in Apache Spark is therefore necessary to check for this
|
reading the traces in Apache Spark is therefore necessary to check for this
|
||||||
possibility and insert back the omitted record attributes.
|
possibility and insert back the omitted record attributes.
|
||||||
|
|
||||||
\subsubsection{The queries}
|
\subsubsection{The Queries}
|
||||||
|
|
||||||
Most queries use only two or three fields in each trace records, while the
|
Most queries use only two or three fields in each trace records, while the
|
||||||
original table records often are made of a couple of dozen fields. In order to
|
original table records often are made of a couple of dozen fields. In order to
|
||||||
|
@ -452,14 +455,12 @@ and performing the desired computation on the obtained chronological event log.
|
||||||
Sometimes intermediate results are saved in Spark's parquet format in order to
|
Sometimes intermediate results are saved in Spark's parquet format in order to
|
||||||
compute and save intermediate results beforehand.
|
compute and save intermediate results beforehand.
|
||||||
|
|
||||||
\subsection{Query script design}
|
\subsection{Query Script Design and the \textit{Task Slowdown} Script}
|
||||||
|
|
||||||
In this section we aim to show the general complexity behind the implementations
|
In this section we aim to show the general complexity behind the implementations
|
||||||
of query scripts by explaining in detail some sampled scripts to better
|
of query scripts by explaining in detail some sampled scripts to better
|
||||||
appreciate their behaviour.
|
appreciate their behaviour.
|
||||||
|
|
||||||
\subsubsection{The ``task slowdown'' query script}
|
|
||||||
|
|
||||||
One example of analysis script with average complexity and a pretty
|
One example of analysis script with average complexity and a pretty
|
||||||
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
||||||
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
||||||
|
@ -532,7 +533,7 @@ in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the analysis done by the paper of
|
Our first investigation focuses on replicating the analysis done by the paper of
|
||||||
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
Ros\'a et al.\ paper~\cite{dsn-paper} regarding usage of machine time
|
||||||
and resources.
|
and resources.
|
||||||
|
|
||||||
In this section we perform several analyses focusing on how machine time and
|
In this section we perform several analyses focusing on how machine time and
|
||||||
|
@ -638,13 +639,13 @@ Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
||||||
means are computed on a cluster-by-cluster basis for 2019 data in
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||||
Figure~\ref{fig:taskslowdown-csts}.
|
Figure~\ref{fig:taskslowdown-csts}.
|
||||||
|
|
||||||
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
In 2015 Ros\'a et al.~\cite{dsn-paper} measured mean task slowdown per each task
|
||||||
priority value, which at the time were numeric values between 0 and 11. However,
|
priority value, which at the time were numeric values between 0 and 11. However,
|
||||||
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
||||||
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
||||||
by task priority tier over the 2019 data. Priority tiers are semantically
|
by task priority tier over the 2019 data. Priority tiers are semantically
|
||||||
relevant priority ranges defined in the Tirmazi et al.\
|
relevant priority ranges defined by Tirmazi et al.\ in
|
||||||
2020\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
2020~\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
||||||
tiers are also provided next to the 2011 priority values in the table covering
|
tiers are also provided next to the 2011 priority values in the table covering
|
||||||
the 2015 analysis.
|
the 2015 analysis.
|
||||||
|
|
||||||
|
@ -687,7 +688,7 @@ tier respectively, and Cluster D has 12.04 mean slowdown in its ``Free'' tier.
|
||||||
|
|
||||||
In this analysis we aim to understand how physical resources of machines
|
In this analysis we aim to understand how physical resources of machines
|
||||||
in the Borg cluster are used to complete tasks. In particular, we compare how
|
in the Borg cluster are used to complete tasks. In particular, we compare how
|
||||||
CPU and Memory resource allocation and usage are distributed among tasks based
|
CPU and memory resource allocation and usage are distributed among tasks based
|
||||||
on their termination
|
on their termination
|
||||||
type.
|
type.
|
||||||
|
|
||||||
|
@ -739,9 +740,9 @@ traces.
|
||||||
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
\section{Analysis: Patterns of Task and Job Events}\label{sec6}
|
||||||
|
|
||||||
This section aims to use some of the tecniques used in section IV of
|
This section aims to use some of the tecniques used in section IV of
|
||||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
the Ros\'a et al.\ paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||||
between task and job events by gathering event statistics at those events. In
|
between task and job events by gathering event statistics at those events. In
|
||||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
particular, Section~\ref{tabIII-section} explores how the success of a
|
||||||
task is inter-correlated with its own event patterns, which
|
task is inter-correlated with its own event patterns, which
|
||||||
Section~\ref{figV-section} explores even further by computing task success
|
Section~\ref{figV-section} explores even further by computing task success
|
||||||
probabilities based on the number of task termination events of a specific type.
|
probabilities based on the number of task termination events of a specific type.
|
||||||
|
@ -765,7 +766,7 @@ traces is shown in Figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
||||||
breakdown of the same data for the 2019 traces is shown in
|
breakdown of the same data for the 2019 traces is shown in
|
||||||
Figure~\ref{fig:tableIII-csts}.
|
Figure~\ref{fig:tableIII-csts}.
|
||||||
|
|
||||||
Each table from these figure shows the mean and the 95-th percentile of the
|
Each table from these figures shows the mean and the 95-th percentile of the
|
||||||
number of termination events per task, broke down by task termination. In
|
number of termination events per task, broke down by task termination. In
|
||||||
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
|
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
|
||||||
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
|
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
|
||||||
|
@ -807,10 +808,10 @@ overall data from the 2019 ones, and in Figure~\ref{fig:figureV-csts}, as a
|
||||||
cluster-by-cluster breakdown of the same data for the 2019 traces.
|
cluster-by-cluster breakdown of the same data for the 2019 traces.
|
||||||
|
|
||||||
In Figure~\ref{fig:figureV} the 2011 and 2019 plots differ in their x-axis:
|
In Figure~\ref{fig:figureV} the 2011 and 2019 plots differ in their x-axis:
|
||||||
for 2011 data conditional probabilities are computed for a maximum event coun
|
for 2011 data conditional probabilities are computed for a maximum event count
|
||||||
t of 30, while for 2019 data are computed for up to 50 events of a specific
|
of 30, while for 2019 data are computed for up to 50 events of a specific kind.
|
||||||
kind. Nevertheless, another quite striking difference between the two plots can
|
Nevertheless, another quite striking difference between the two plots can be
|
||||||
be seen: while 2011 data has relatively smooth decreasing curves for all event
|
seen: while 2011 data has relatively smooth decreasing curves for all event
|
||||||
types, the curves in the 2019 data almost immediately plateau with no
|
types, the curves in the 2019 data almost immediately plateau with no
|
||||||
significant change easily observed after 5 events of any kind.
|
significant change easily observed after 5 events of any kind.
|
||||||
|
|
||||||
|
@ -872,8 +873,8 @@ Additionally, it is noteworthy that cluster A has no \texttt{EVICT}ed jobs.
|
||||||
|
|
||||||
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}\label{sec7}
|
||||||
|
|
||||||
This section re-applies the tecniques used in section V of the Ros\'a et al.\
|
This section re-applies the tecniques used in Section V of the Ros\'a et al.\
|
||||||
paper\cite{dsn-paper} to find patterns and interpendencies
|
paper~\cite{dsn-paper} to find patterns and interpendencies
|
||||||
between task and job events by gathering event statistics at those events. In
|
between task and job events by gathering event statistics at those events. In
|
||||||
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
particular, Section~\ref{tabIII-section} explores how tasks of the success of a
|
||||||
task is inter-correlated with its own event patterns, which
|
task is inter-correlated with its own event patterns, which
|
||||||
|
@ -882,109 +883,177 @@ probabilities based on the number of task termination events of a specific type.
|
||||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||||
the job level.
|
the job level.
|
||||||
|
|
||||||
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
\subsection{Task Event Rates vs.\ Task Priority, Event Execution Time, and
|
||||||
concurrency.}\label{fig7-section}
|
Machine Concurrency.}\label{fig7-section} \input{figures/figure_7}
|
||||||
|
|
||||||
\input{figures/figure_7}
|
This analysis shows event rates (i.e.\ the relative percentage of termination
|
||||||
|
type events) over different configurations of task-level parameters.
|
||||||
|
Figure~\ref{fig:figureVII-a} and Figure~\ref{fig:figureVII-a-csts} show the
|
||||||
|
distribution of event rates over the various task priority tiers.
|
||||||
|
Figure~\ref{fig:figureVII-b} and Figure~\ref{fig:figureVII-b-csts} show the
|
||||||
|
distribution of event rates over the total event execution time. Finally,
|
||||||
|
Figure~\ref{fig:figureVII-c} and Figure~\ref{fig:figureVII-c-csts} show the
|
||||||
|
distribution of event rates over the metric of machine concurrency, defined as
|
||||||
|
the number of co-executing tasks on the machine and at the moment the
|
||||||
|
termination event is recorded.
|
||||||
|
|
||||||
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
From this analysis we can make the following observations:
|
||||||
\ref{fig:figureVII-c}.
|
|
||||||
|
|
||||||
\textbf{Observations}:
|
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item
|
\item
|
||||||
No smooth curves in this figure either, unlike 2011 traces
|
The behaviour of the curves in the task priority distributions
|
||||||
|
(in Figure~\ref{fig:figureVII-a} and Figure~\ref{fig:figureVII-a-csts})
|
||||||
|
for the 2019 traces is almost the opposite of the
|
||||||
|
2011 ones, i.e.\ in-between priorities have higher kill rates while
|
||||||
|
priorities at the extremum have lower kill rates;
|
||||||
\item
|
\item
|
||||||
The behaviour of curves for 7a (priority) is almost the opposite of
|
The event execution time curves (in Figure~\ref{fig:figureVII-b} and
|
||||||
2011, i.e. in-between priorities have higher kill rates while
|
Figure~\ref{fig:figureVII-b-csts}) for the 2019 traces
|
||||||
priorities at the extremum have lower kill rates. This could also be
|
are quite different than 2011 ones, here it
|
||||||
due bt the inherent distribution of job terminations;
|
|
||||||
\item
|
|
||||||
Event execution time curves are quite different than 2011, here it
|
|
||||||
seems there is a good correlation between short task execution times
|
seems there is a good correlation between short task execution times
|
||||||
and finish event rates, instead of the U shape curve in 2015 DSN
|
and finish event rates, instead of the ``U shape'' curve found in the Ros\'a
|
||||||
|
et al.\ 2015 DSN paper~\cite{dsn-paper};
|
||||||
\item
|
\item
|
||||||
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
The behaviour among different clusters for the event execution time
|
||||||
|
distributions in Figure~\ref{fig:figureVII-b-csts} seem quite uniform;
|
||||||
\item
|
\item
|
||||||
Machine concurrency seems to play little role in the event termination
|
The machine concurrency metric, for which a distribution of event rates is
|
||||||
distribution, as for all concurrency factors the kill rate is at 90\%.
|
computed in Figure~\ref{fig:figureVII-c} and
|
||||||
|
Figure~\ref{fig:figureVII-c-csts}, seems to play little role in the event
|
||||||
|
termination distribution, as for all concurrency factors the \texttt{KILL}
|
||||||
|
event rate is around 90\% with little fluctuation.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
\subsection{Task Event Rates vs.\ Requested Resources, Resource Reservation, and
|
||||||
Resource Utilization}\label{fig8-section}
|
Resource Utilization}\label{fig8-section}
|
||||||
\input{figures/figure_8}
|
\input{figures/figure_8}
|
||||||
|
|
||||||
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
This analysis is concerned with the distribution of event rates over several
|
||||||
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
resources related parameters.
|
||||||
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
Figure~\ref{fig:figureVIII-a} and Figure~\ref{fig:figureVIII-a-csts} show the
|
||||||
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
distribution of task event rates w.r.t.\ the amount of CPU the task has
|
||||||
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
requested, while Figure~\ref{fig:figureVIII-b} and
|
||||||
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
Figure~\ref{fig:figureVIII-b-csts} show task events rates vs.\ requested memory.
|
||||||
|
Figure~\ref{fig:figureVIII-c} and Figure~\ref{fig:figureVIII-c-csts} show the
|
||||||
|
distribution of task event rates w.r.t.\ the amount of CPU that has collectively
|
||||||
|
requested on the machine where the task is running, while
|
||||||
|
Figure~\ref{fig:figureVIII-d} and Figure~\ref{fig:figureVIII-d-csts} show a
|
||||||
|
similar distribution but for memory. Finally Figure~\ref{fig:figureVIII-e} and
|
||||||
|
Figure~\ref{fig:figureVIII-e-csts} show the distribution of task event rates
|
||||||
|
w.r.t.\ the amount of CPU the task has really been utilized, while
|
||||||
|
Figure~\ref{fig:figureVIII-f} and Figure~\ref{fig:figureVIII-f-csts} show task
|
||||||
|
events rates vs.\ used memory.
|
||||||
|
|
||||||
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
From this analysis we can make the following observations:
|
||||||
}\label{fig9-section}
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item In the 2019 trace, the amount of requested CPU resources seem to play
|
||||||
|
little effect on the job termination as it can evinced in
|
||||||
|
Figure~\ref{fig:figureVIII-a} and Figure~\ref{fig:figureVIII-a-csts}.
|
||||||
|
Instead, the job rate distributions w.r.t.\ the amount of requested memory
|
||||||
|
(Figure~\ref{fig:figureVIII-b} and Figure~\ref{fig:figureVIII-b-csts}) show
|
||||||
|
no discernable pattern;
|
||||||
|
\item
|
||||||
|
Overall a significant increment in the killed event rate can be observed. They
|
||||||
|
seem to dominate all event rates measures;
|
||||||
|
\item
|
||||||
|
Among all clusters in Figure~\ref{fig:figureVIII-a-csts} there can be noted the
|
||||||
|
dominance of the killed event rate. In 2011, it was observed a more dominant
|
||||||
|
behaviour by the success event rate curve;
|
||||||
|
\item
|
||||||
|
For each analysed distribution, clusters do not show a common behaviour of the
|
||||||
|
curves. Some are similar, but they are generally distinguishable;
|
||||||
|
\item
|
||||||
|
In Figure~\ref{fig:figureVIII-e} there can be seen that while a drastic decrease
|
||||||
|
of the killed event rate curve is observed as the CPU utilization increases,
|
||||||
|
the success event rate does not increase much.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Job Event Rates vs.\ Job Size, Job Execution Time, and Machine
|
||||||
|
Locality}\label{fig9-section}
|
||||||
\input{figures/figure_9}
|
\input{figures/figure_9}
|
||||||
|
|
||||||
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
This analysis shows job event rates (i.e.\ the relative percentage of
|
||||||
\ref{fig:figureIX-c}.
|
termination type events) over different configurations of job size, job
|
||||||
|
execution time and machine locality.
|
||||||
|
|
||||||
\textbf{Observations}:
|
Figure~\ref{fig:figureIX-a} and Figure~\ref{fig:figureIX-a-csts} provide
|
||||||
|
the plots of job event rates versus the job
|
||||||
|
size. Job size is defined as the number of tasks belonging to the job.
|
||||||
|
Figure~\ref{fig:figureIX-b} and Figure~\ref{fig:figureIX-b-csts} provide
|
||||||
|
the plots of the job event rates versus
|
||||||
|
execution time.
|
||||||
|
Figure~\ref{fig:figureIX-c} and Figure~\ref{fig:figureIX-c-csts} provide
|
||||||
|
the plots of the job event
|
||||||
|
rates versus machine locality.
|
||||||
|
Machine locality is defined as the ratio between the number
|
||||||
|
of machines used to execute the tasks inside the job and the job size.
|
||||||
|
|
||||||
|
By analysing these plots, we can make the following observations:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item
|
\item
|
||||||
Behaviour between cluster varies a lot
|
There can be noted significant variations in the behaviour of the curves
|
||||||
|
between clusters;
|
||||||
\item
|
\item
|
||||||
There are no ``smooth'' gradients in the various curves unlike in the
|
There are no smooth gradients in the various curves unlike in the
|
||||||
2011 traces
|
2011 traces;
|
||||||
\item
|
\item
|
||||||
Killed jobs have higher event rates in general, and overall dominate
|
Killed jobs have higher event rates in general, and overall dominate
|
||||||
all event rates measures
|
all event rates measures. As can be seen in Figure~\ref{fig:figureIX-a}, an
|
||||||
|
higher number of tasks (i.e., an higher job size) seems to be correlated to
|
||||||
|
an higher killed event rate in 2019 rather than in 2011. In
|
||||||
|
Figure~\ref{fig:figureIX-b}, we observe the best success event rate for a
|
||||||
|
job execution time of 4-10 minutes, while in 2011, it seemed that the finish
|
||||||
|
event rate increases along with the job execution time;
|
||||||
\item
|
\item
|
||||||
There still seems to be a correlation between short execution job
|
There still seems to be a strong correlation between short execution job
|
||||||
times and successfull final termination, and likewise for kills and
|
times and successful final termination, and likewise for kills and
|
||||||
higher job terminations
|
higher job terminations. Especially for these two curves, in most cases also
|
||||||
|
between the clusters, their behaviour suggests a specular trend;
|
||||||
\item
|
\item
|
||||||
Across all clusters, a machine locality factor of 1 seems to lead to
|
As can be seen in Figure~\ref{fig:figureIX-c}, across all clusters, a machine
|
||||||
the highest success event rate
|
locality factor of 1 seems to lead to the highest success event rate, while
|
||||||
|
in 2011 the same machine locality factor led to the lowest success event
|
||||||
|
rate.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\section{Conclusions, Limitations and Future Work}\label{sec8}
|
\section{Conclusions, Limitations and Future Work}\label{sec8}
|
||||||
In this report we analyze the Google Borg 2019 traces and compared them with
|
In this report we analyzed the Google Borg 2019 traces and compared them with
|
||||||
their 2011 counterpart from the perspective of failures, their impact on
|
their 2011 counterpart from the perspective of unsuccessful executions, their
|
||||||
resources and their causes. We discover that the impact of non-successful
|
impact on resources and their causes. We discover that the impact of
|
||||||
executions (especially of \texttt{KILL}ed tasks and jobs) in the new traces is
|
unsuccessful executions (especially of \texttt{KILL}ed tasks and jobs) in the
|
||||||
still very relevant in terms of machine time and resources, even more so than in
|
new traces is still very relevant in terms of machine time and resources, even
|
||||||
2011. We also discover that unsuccessful job and task event patterns still play
|
more so than in 2011. We also discover that unsuccessful job and task event
|
||||||
a major role in the overall execution success of Borg jobs and tasks. We finally
|
patterns still play a major role in the overall execution success of Borg jobs
|
||||||
discover that unsuccessful job and task event rates dominate the overall
|
and tasks. We finally discover that unsuccessful job and task event rates
|
||||||
landscape of Borg's own logs, even when grouping tasks and jobs by parameters
|
dominate the overall landscape of Borg's own logs, even when grouping tasks and
|
||||||
such as priority, resource request, reservation and utilization, and machine
|
jobs by parameters such as priority, resource request, reservation and
|
||||||
locality.
|
utilization, and machine locality.
|
||||||
|
|
||||||
We then can conclude that the performed analysis show a lot of clear trends
|
We then can conclude that the performed analysis show many clear trends
|
||||||
regarding the correlation of execution success with several parameters and
|
regarding the correlation of execution success with several parameters and
|
||||||
metadata. These trends can potentially be exploited to build better scheduling
|
metadata. These trends can potentially be exploited to build better scheduling
|
||||||
algorithms and new predictive models
|
algorithms and new predictive models that could understand if an execution has
|
||||||
that could understand if an execution has high probability of failure based on
|
high probability of failure based on its own properties and metadata. The
|
||||||
its own properties and metadata. The creation of such models could allow for
|
creation of such models could allow for computational resources to be saved and
|
||||||
computational resources to be saved and used to either increase the throughput
|
used to either increase the throughput of higher priority workloads or to allow
|
||||||
of higher priority workloads or to allow for a larger workload altoghether.
|
for a larger workload altoghether.
|
||||||
|
|
||||||
The biggest limitation and threat to validity posed to this project is the
|
The biggest limitation and threat to validity posed to this project is the
|
||||||
relative lack of infrormation provided by Google on the true meaning of
|
relative lack of information provided by Google on the true meaning of
|
||||||
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
|
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
|
||||||
and the rather scarcity of information in the traces
|
and the rather scarcity of information in the traces
|
||||||
documentation\cite{google-drive-marso}, it is not clear if unsuccessful
|
documentation~\cite{google-drive-marso}, it is not clear if unsuccessful
|
||||||
executions yield any useful computation result or not. Our assumption in this
|
executions yield any useful computation result or not. Our assumption in this
|
||||||
report is that unsuccesful jobs and tasks do not produce any result and are
|
report is that unsuccesful jobs and tasks do not produce any result and are
|
||||||
therefore just burdens on machine time and resources, but should this assumption
|
therefore just burdens on machine time and resources, but should this assumption
|
||||||
be incorrect then the interpretation of the analyses might change significantly.
|
be incorrect then the interpretation of the analyses might change.
|
||||||
|
|
||||||
Given the significant computational time invested in obtaining the results shown
|
Given the significant computational time invested in obtaining the results shown
|
||||||
in this report and due to time and resource limitations, some of the analysis
|
in this report and due to time and resource limitations, some of the analysis
|
||||||
were not completed. Our future work will focus on finishing these analysis,
|
were not completed on all clusters. Our future work will focus on finishing
|
||||||
namely by computing results for the missing clusters and obtaining a true
|
these analysis, computing results for the missing clusters and obtaining an
|
||||||
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
|
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
|
||||||
their causes.
|
their causes.
|
||||||
|
|
||||||
|
|
|
@ -11,7 +11,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{rcpu-2011}
|
\figureVIII[0.49\textwidth]{rcpu-2011}
|
||||||
\figureVIII[0.49\textwidth]{rcpu-all}
|
\figureVIII[0.49\textwidth]{rcpu-all}
|
||||||
\caption{Task event rates vs.\ requested CPU (expressed in \textit{Normalized compute units}),
|
\caption{Task event rates vs.\ requested CPU (expressed in \textit{NCUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (all cluster aggregated) traces.}\label{fig:figureVIII-a}
|
w.r.t.\ task termination for 2011 and 2019 (all cluster aggregated) traces.}\label{fig:figureVIII-a}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -24,7 +24,7 @@
|
||||||
\figureVIII{rcpu-f}
|
\figureVIII{rcpu-f}
|
||||||
\figureVIII{rcpu-g}
|
\figureVIII{rcpu-g}
|
||||||
\figureVIII{rcpu-h}
|
\figureVIII{rcpu-h}
|
||||||
\caption{Task event rates vs.\ requested CPU (expressed in \textit{Normalized compute units}),
|
\caption{Task event rates vs.\ requested CPU (expressed in \textit{NCUs}),
|
||||||
w.r.t.\ task termination for each cluster in the 2019 traces.}\label{fig:figureVIII-a-csts}
|
w.r.t.\ task termination for each cluster in the 2019 traces.}\label{fig:figureVIII-a-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -33,7 +33,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{rram-2011}
|
\figureVIII[0.49\textwidth]{rram-2011}
|
||||||
\figureVIII[0.49\textwidth]{rram-all}
|
\figureVIII[0.49\textwidth]{rram-all}
|
||||||
\caption{Task event rates vs.\ requested memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ requested memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (all cluster aggregated) traces.}\label{fig:figureVIII-b}
|
w.r.t.\ task termination for 2011 and 2019 (all cluster aggregated) traces.}\label{fig:figureVIII-b}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -46,7 +46,7 @@
|
||||||
\figureVIII{rram-f}
|
\figureVIII{rram-f}
|
||||||
\figureVIII{rram-g}
|
\figureVIII{rram-g}
|
||||||
\figureVIII{rram-h}
|
\figureVIII{rram-h}
|
||||||
\caption{Task event rates vs.\ requested memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ requested memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for each cluster in the 2019 traces.}\label{fig:figureVIII-b-csts}
|
w.r.t.\ task termination for each cluster in the 2019 traces.}\label{fig:figureVIII-b-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -55,7 +55,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{rscpu-2011}
|
\figureVIII[0.49\textwidth]{rscpu-2011}
|
||||||
\figureVIII[0.49\textwidth]{rscpu-all}
|
\figureVIII[0.49\textwidth]{rscpu-all}
|
||||||
\caption{Task event rates vs.\ reserved CPU (expressed in \textit{Normalized compute units}),
|
\caption{Task event rates vs.\ reserved CPU (expressed in \textit{NCUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (clusters A,B,E,F aggregated) traces.}\label{fig:figureVIII-c}
|
w.r.t.\ task termination for 2011 and 2019 (clusters A,B,E,F aggregated) traces.}\label{fig:figureVIII-c}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -65,7 +65,7 @@
|
||||||
\figureVIII[0.49\textwidth]{rscpu-e}
|
\figureVIII[0.49\textwidth]{rscpu-e}
|
||||||
\hfill
|
\hfill
|
||||||
\figureVIII[0.49\textwidth]{rscpu-f}
|
\figureVIII[0.49\textwidth]{rscpu-f}
|
||||||
\caption{Task event rates vs.\ reserved CPU (expressed in \textit{Normalized compute units}),
|
\caption{Task event rates vs.\ reserved CPU (expressed in \textit{NCUs}),
|
||||||
w.r.t.\ task termination for clusters A,B,E,F in the 2019 traces.}\label{fig:figureVIII-c-csts}
|
w.r.t.\ task termination for clusters A,B,E,F in the 2019 traces.}\label{fig:figureVIII-c-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -74,7 +74,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{rsram-2011}
|
\figureVIII[0.49\textwidth]{rsram-2011}
|
||||||
\figureVIII[0.49\textwidth]{rsram-all}
|
\figureVIII[0.49\textwidth]{rsram-all}
|
||||||
\caption{Task event rates vs.\ reserved memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ reserved memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (clusters A,B,E,F aggregated) traces.}\label{fig:figureVIII-d}
|
w.r.t.\ task termination for 2011 and 2019 (clusters A,B,E,F aggregated) traces.}\label{fig:figureVIII-d}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -84,7 +84,7 @@
|
||||||
\figureVIII[0.49\textwidth]{rsram-e}
|
\figureVIII[0.49\textwidth]{rsram-e}
|
||||||
\hfill
|
\hfill
|
||||||
\figureVIII[0.49\textwidth]{rsram-f}
|
\figureVIII[0.49\textwidth]{rsram-f}
|
||||||
\caption{Task event rates vs.\ reserved memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ reserved memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for clusters A,B,E,F in the 2019 traces.}\label{fig:figureVIII-d-csts}
|
w.r.t.\ task termination for clusters A,B,E,F in the 2019 traces.}\label{fig:figureVIII-d-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -93,7 +93,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{ucpu-2011}
|
\figureVIII[0.49\textwidth]{ucpu-2011}
|
||||||
\figureVIII[0.49\textwidth]{ucpu-all}
|
\figureVIII[0.49\textwidth]{ucpu-all}
|
||||||
\caption{Task event rates vs.\ used CPU (expressed in \textit{Normalized compute units}),
|
\caption{Task event rates vs.\ used CPU (expressed in \textit{NCUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (clusters A-D aggregated) traces.}\label{fig:figureVIII-e}
|
w.r.t.\ task termination for 2011 and 2019 (clusters A-D aggregated) traces.}\label{fig:figureVIII-e}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -103,7 +103,7 @@
|
||||||
\figureVIII[0.49\textwidth]{ucpu-c}
|
\figureVIII[0.49\textwidth]{ucpu-c}
|
||||||
\hfill
|
\hfill
|
||||||
\figureVIII[0.49\textwidth]{ucpu-d}
|
\figureVIII[0.49\textwidth]{ucpu-d}
|
||||||
\caption{Task event rates vs.\ used CPU (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ used CPU (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for clusters A-D in the 2019 traces.}\label{fig:figureVIII-e-csts}
|
w.r.t.\ task termination for clusters A-D in the 2019 traces.}\label{fig:figureVIII-e-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -112,7 +112,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureVIII[0.49\textwidth]{uram-2011}
|
\figureVIII[0.49\textwidth]{uram-2011}
|
||||||
\figureVIII[0.49\textwidth]{uram-all}
|
\figureVIII[0.49\textwidth]{uram-all}
|
||||||
\caption{Task event rates vs.\ used memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ used memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for 2011 and 2019 (clusters A-D aggregated) traces.}\label{fig:figureVIII-f}
|
w.r.t.\ task termination for 2011 and 2019 (clusters A-D aggregated) traces.}\label{fig:figureVIII-f}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -122,7 +122,7 @@
|
||||||
\figureVIII[0.49\textwidth]{uram-c}
|
\figureVIII[0.49\textwidth]{uram-c}
|
||||||
\hfill
|
\hfill
|
||||||
\figureVIII[0.49\textwidth]{uram-d}
|
\figureVIII[0.49\textwidth]{uram-d}
|
||||||
\caption{Task event rates vs.\ used memory (expressed in \textit{Normalized memory units}),
|
\caption{Task event rates vs.\ used memory (expressed in \textit{NMUs}),
|
||||||
w.r.t.\ task termination for clusters A-D in the 2019 traces.}\label{fig:figureVIII-f-csts}
|
w.r.t.\ task termination for clusters A-D in the 2019 traces.}\label{fig:figureVIII-f-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\figureIX[0.49\textwidth]{taskcount-2011.pgf}
|
\figureIX[0.49\textwidth]{taskcount-2011.pgf}
|
||||||
\figureIX[0.49\textwidth]{taskcount-all.pgf}
|
\figureIX[0.49\textwidth]{taskcount-all.pgf}
|
||||||
\caption{Job event rates vs.\ job size and final job termination in 2011 and 2019 (all clusters aggregated) traces. The job size is equivalent to the number of tasks belonging to the job.}\label{fig:figureIX-a}
|
\caption{Job event rates vs.\ job size and final job termination in 2011 and 2019 (all clusters aggregated) traces. The job size is defined as the number of tasks belonging to the job.}\label{fig:figureIX-a}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
|
|
|
@ -103,7 +103,7 @@ Unknown & Unknown & 8729 & 1.639218\% \\
|
||||||
0.479492 & 0.500000 & 2 & 0.000376\% \\
|
0.479492 & 0.500000 & 2 & 0.000376\% \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}}{2019 data}
|
\end{tabular}}{2019 data}
|
||||||
\caption{Overview of machine configurations in term of CPU and Memory power in 2011 and 2019 (all clusters aggregated) traces. In the 2019 traces NCU stands for ``Normalized Compute Unit'' and NMU stands for ``Normalized Compute Unit'': both are $[0,1]$ normalizations of resource values. While memory was measured in terms of capacity, CPU power was measured in ``Google Compute Units'' (GCUs), an opaque umbrella metric used by Google that factors in CPU clock, number of cores/processors, and CPU ISA architecture.}\label{fig:machineconfigs}
|
\caption{Overview of machine configurations in term of CPU and Memory power in 2011 and 2019 (all clusters aggregated) traces. In the 2019 traces NCU stands for ``Normalized Compute Unit'' and NMU stands for ``Normalized Memory Unit'': both are $[0,1]$ normalizations of resource values. While memory was measured in terms of capacity, CPU power was measured in ``Google Compute Units'' (GCUs), an opaque umbrella metric used by Google that factors in CPU clock, number of cores/processors, and CPU ISA architecture.}\label{fig:machineconfigs}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
|
@ -231,5 +231,5 @@ Unknown & Unknown & 1720 & 2.933251\% \\
|
||||||
0.591797 & 0.666992 & 500 & 0.852689\% \\
|
0.591797 & 0.666992 & 500 & 0.852689\% \\
|
||||||
0.958984 & 1.000000 & 200 & 0.341076\% \\
|
0.958984 & 1.000000 & 200 & 0.341076\% \\
|
||||||
}{\\\\\\\\\\}
|
}{\\\\\\\\\\}
|
||||||
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts}
|
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to Figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
|
@ -10,8 +10,7 @@
|
||||||
\machinetimewaste[1]{2011 data}{cluster_2011.pgf}
|
\machinetimewaste[1]{2011 data}{cluster_2011.pgf}
|
||||||
\machinetimewaste[1]{2019 data}{cluster_all.pgf}
|
\machinetimewaste[1]{2019 data}{cluster_all.pgf}
|
||||||
\caption{Relative task time spent in each execution phase
|
\caption{Relative task time spent in each execution phase
|
||||||
w.r.t.\ task termination in 2011 and 2019 (all clusters aggregated) traces. The x-axis shows task termination type,
|
w.r.t.\ task termination in 2011 and 2019 (all clusters aggregated) traces. The x-axis shows task termination type, while the y-axis shows total time \% spent. Colors break down the time in execution phases. ``Unknown'' execution times are
|
||||||
Y axis shows total time \% spent. Colors break down the time in execution phases. ``Unknown'' execution times are
|
|
||||||
2019 specific and correspond to event time transitions that are not consider ``typical'' by Google.}\label{fig:machinetimewaste-rel}
|
2019 specific and correspond to event time transitions that are not consider ``typical'' by Google.}\label{fig:machinetimewaste-rel}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
|
|
@ -50,7 +50,7 @@ EVICT & FAIL & FINISH & KILL \\}
|
||||||
\toprule
|
\toprule
|
||||||
\tableIIIh
|
\tableIIIh
|
||||||
\midrule
|
\midrule
|
||||||
EVICT & 78.710 (342) & 52.242 & 0.673 & 0.000 & 25.795 \\
|
EVICT & 78.710 (342) & 52.242 & 0.673 & 0 & 25.795 \\
|
||||||
FAIL & 24.962 (26) & 0.290 & 23.635 & 0.348 & 0.691 \\
|
FAIL & 24.962 (26) & 0.290 & 23.635 & 0.348 & 0.691 \\
|
||||||
FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
|
FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
|
||||||
KILL & 8.763 (16) & 1.876 & 0.143 & 0.003 & 6.741 \\
|
KILL & 8.763 (16) & 1.876 & 0.143 & 0.003 & 6.741 \\
|
||||||
|
@ -68,52 +68,52 @@ FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\tableIII{A}{%
|
\tableIII{A}{%
|
||||||
EVICT & 103.228 (719) & 73.694 & 0.769 & 0.000 & 28.766 \\
|
EVICT & 103.228 (719) & 73.694 & 0.769 & 0 & 28.766 \\
|
||||||
FAIL & 11.819 (26) & 0.288 & 11.062 & 0.002 & 0.468 \\
|
FAIL & 11.819 (26) & 0.288 & 11.062 & 0.002 & 0.468 \\
|
||||||
FINISH & 2.185 (1) & 0.019 & 0.004 & 2.153 & 0.008 \\
|
FINISH & 2.185 (1) & 0.019 & 0.004 & 2.153 & 0.008 \\
|
||||||
KILL & 5.963 (11) & 2.350 & 0.214 & 0.003 & 3.396 \\
|
KILL & 5.963 (11) & 2.350 & 0.214 & 0.003 & 3.396 \\
|
||||||
}
|
}
|
||||||
\tableIII{B}{%
|
\tableIII{B}{%
|
||||||
EVICT & 83.018 (394) & 64.817 & 0.240 & 0.000 & 17.962 \\
|
EVICT & 83.018 (394) & 64.817 & 0.240 & 0 & 17.962 \\
|
||||||
FAIL & 20.851 (62) & 0.518 & 19.657 & 0.001 & 0.675 \\
|
FAIL & 20.851 (62) & 0.518 & 19.657 & 0.001 & 0.675 \\
|
||||||
FINISH & 2.995 (4) & 0.020 & 0.021 & 2.943 & 0.012 \\
|
FINISH & 2.995 (4) & 0.020 & 0.021 & 2.943 & 0.012 \\
|
||||||
KILL & 9.173 (12) & 3.351 & 0.276 & 0.004 & 5.541 \\
|
KILL & 9.173 (12) & 3.351 & 0.276 & 0.004 & 5.541 \\
|
||||||
}
|
}
|
||||||
\tableIII{C}{%
|
\tableIII{C}{%
|
||||||
EVICT & 98.437 (444) & 73.716 & 1.813 & 0.000 & 22.908 \\
|
EVICT & 98.437 (444) & 73.716 & 1.813 & 0 & 22.908 \\
|
||||||
FAIL & 52.010 (30) & 0.773 & 48.446 & 2.035 & 0.756 \\
|
FAIL & 52.010 (30) & 0.773 & 48.446 & 2.035 & 0.756 \\
|
||||||
FINISH & 2.507 (2) & 0.018 & 0.013 & 2.471 & 0.006 \\
|
FINISH & 2.507 (2) & 0.018 & 0.013 & 2.471 & 0.006 \\
|
||||||
KILL & 5.452 (6) & 1.533 & 0.116 & 0.004 & 3.799 \\
|
KILL & 5.452 (6) & 1.533 & 0.116 & 0.004 & 3.799 \\
|
||||||
}
|
}
|
||||||
\tableIII{D}{%
|
\tableIII{D}{%
|
||||||
EVICT & 76.759 (366) & 62.001 & 0.700 & 0.000 & 14.058 \\
|
EVICT & 76.759 (366) & 62.001 & 0.700 & 0 & 14.058 \\
|
||||||
FAIL & 62.314 (62) & 0.496 & 58.968 & 0.810 & 2.040 \\
|
FAIL & 62.314 (62) & 0.496 & 58.968 & 0.810 & 2.040 \\
|
||||||
FINISH & 3.877 (2) & 0.059 & 0.019 & 3.789 & 0.010 \\
|
FINISH & 3.877 (2) & 0.059 & 0.019 & 3.789 & 0.010 \\
|
||||||
KILL & 6.795 (6) & 1.960 & 0.151 & 0.002 & 4.682 \\
|
KILL & 6.795 (6) & 1.960 & 0.151 & 0.002 & 4.682 \\
|
||||||
}
|
}
|
||||||
\tableIII{E}{%
|
\tableIII{E}{%
|
||||||
EVICT & 17.678 (72) & 11.781 & 0.106 & 0.000 & 5.791 \\
|
EVICT & 17.678 (72) & 11.781 & 0.106 & 0 & 5.791 \\
|
||||||
FAIL & 112.384 (28) & 0.458 & 111.471 & 0.000 & 0.456 \\
|
FAIL & 112.384 (28) & 0.458 & 111.471 & 0 & 0.456 \\
|
||||||
FINISH & 2.029 (2) & 0.014 & 0.008 & 1.999 & 0.008 \\
|
FINISH & 2.029 (2) & 0.014 & 0.008 & 1.999 & 0.008 \\
|
||||||
KILL & 13.505 (64) & 1.288 & 0.057 & 0.000 & 12.160 \\
|
KILL & 13.505 (64) & 1.288 & 0.057 & 0 & 12.160 \\
|
||||||
}
|
}
|
||||||
\tableIII{F}{%
|
\tableIII{F}{%
|
||||||
EVICT & 70.146 (114) & 23.974 & 0.192 & 0.000 & 45.980 \\
|
EVICT & 70.146 (114) & 23.974 & 0.192 & 0 & 45.980 \\
|
||||||
FAIL & 41.087 (54) & 0.279 & 39.257 & 0.000 & 1.550 \\
|
FAIL & 41.087 (54) & 0.279 & 39.257 & 0 & 1.550 \\
|
||||||
FINISH & 3.129 (4) & 0.019 & 0.004 & 3.008 & 0.098 \\
|
FINISH & 3.129 (4) & 0.019 & 0.004 & 3.008 & 0.098 \\
|
||||||
KILL & 10.288 (38) & 0.384 & 0.098 & 0.001 & 9.804 \\
|
KILL & 10.288 (38) & 0.384 & 0.098 & 0.001 & 9.804 \\
|
||||||
}
|
}
|
||||||
\tableIII{G}{%
|
\tableIII{G}{%
|
||||||
EVICT & 136.032 (490) & 77.429 & 0.303 & 0.000 & 58.299 \\
|
EVICT & 136.032 (490) & 77.429 & 0.303 & 0 & 58.299 \\
|
||||||
FAIL & 8.948 (8) & 0.016 & 8.593 & 0.000 & 0.339 \\
|
FAIL & 8.948 (8) & 0.016 & 8.593 & 0 & 0.339 \\
|
||||||
FINISH & 14.176 (2) & 0.015 & 0.002 & 14.154 & 0.005 \\
|
FINISH & 14.176 (2) & 0.015 & 0.002 & 14.154 & 0.005 \\
|
||||||
KILL & 32.320 (164) & 6.909 & 0.135 & 0.000 & 25.276 \\
|
KILL & 32.320 (164) & 6.909 & 0.135 & 0 & 25.276 \\
|
||||||
}
|
}
|
||||||
\tableIII{H}{%
|
\tableIII{H}{%
|
||||||
EVICT & 14.734 (40) & 6.733 & 0.837 & 0.000 & 7.165 \\
|
EVICT & 14.734 (40) & 6.733 & 0.837 & 0 & 7.165 \\
|
||||||
FAIL & 41.067 (120) & 0.600 & 37.600 & 0.000 & 2.867 \\
|
FAIL & 41.067 (120) & 0.600 & 37.600 & 0 & 2.867 \\
|
||||||
FINISH & 3.681 (2) & 0.024 & 0.014 & 3.633 & 0.011 \\
|
FINISH & 3.681 (2) & 0.024 & 0.014 & 3.633 & 0.011 \\
|
||||||
KILL & 17.976 (98) & 0.633 & 0.170 & 0.000 & 17.173 \\
|
KILL & 17.976 (98) & 0.633 & 0.170 & 0 & 17.173 \\
|
||||||
}
|
}
|
||||||
\caption{Mean number of termination events and their distributions per
|
\caption{Mean number of termination events and their distributions per
|
||||||
task type for each cluster in the 2019 traces. The tables show an
|
task type for each cluster in the 2019 traces. The tables show an
|
||||||
|
|
|
@ -20,7 +20,7 @@ KILL & 86.8 (400) & $13.3$ & $20.9$ & $26.9$ & $62.7$ \\
|
||||||
\toprule
|
\toprule
|
||||||
\tableIVh%
|
\tableIVh%
|
||||||
\midrule
|
\midrule
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 43.126 (200) & 0.114 & 2.300 & 0.981 & 12.833 \\
|
FAIL & 43.126 (200) & 0.114 & 2.300 & 0.981 & 12.833 \\
|
||||||
FINISH & 3.074 (2) & 0.005 & 0.153 & 1.778 & 0.014 \\
|
FINISH & 3.074 (2) & 0.005 & 0.153 & 1.778 & 0.014 \\
|
||||||
KILL & 53.919 (178) & 0.235 & 0.103 & 0.288 & 11.337 \\
|
KILL & 53.919 (178) & 0.235 & 0.103 & 0.288 & 11.337 \\
|
||||||
|
@ -42,43 +42,43 @@ FINISH & 1.187 (1) & 0.005 & 0.001 & 1.073 & 0.024 \\
|
||||||
KILL & 16.533 (10) & 1.045 & 0.074 & 0.461 & 1.189 \\
|
KILL & 16.533 (10) & 1.045 & 0.074 & 0.461 & 1.189 \\
|
||||||
}
|
}
|
||||||
\tableIV{B}{
|
\tableIV{B}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 74.368 (374) & 2.003 & 1.994 & 0.267 & 4.944 \\
|
FAIL & 74.368 (374) & 2.003 & 1.994 & 0.267 & 4.944 \\
|
||||||
FINISH & 6.304 (10) & 0.022 & 0.008 & 2.349 & 0.013 \\
|
FINISH & 6.304 (10) & 0.022 & 0.008 & 2.349 & 0.013 \\
|
||||||
KILL & 69.853 (234) & 1.696 & 0.158 & 0.614 & 3.009 \\
|
KILL & 69.853 (234) & 1.696 & 0.158 & 0.614 & 3.009 \\
|
||||||
}
|
}
|
||||||
\tableIV{C}{
|
\tableIV{C}{
|
||||||
EVICT & 1.000 (1) & 1.001 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.001 & 0 & 0 & 0 \\
|
||||||
FAIL & 41.982 (200) & 3.484 & 0.998 & 0.376 & 3.998 \\
|
FAIL & 41.982 (200) & 3.484 & 0.998 & 0.376 & 3.998 \\
|
||||||
FINISH & 1.991 (1) & 0.022 & 0.017 & 1.565 & 0.017 \\
|
FINISH & 1.991 (1) & 0.022 & 0.017 & 1.565 & 0.017 \\
|
||||||
KILL & 110.681 (652) & 0.627 & 0.059 & 0.656 & 2.267 \\
|
KILL & 110.681 (652) & 0.627 & 0.059 & 0.656 & 2.267 \\
|
||||||
}
|
}
|
||||||
\tableIV{D}{
|
\tableIV{D}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 43.356 (250) & 6.112 & 0.949 & 0.531 & 6.498 \\
|
FAIL & 43.356 (250) & 6.112 & 0.949 & 0.531 & 6.498 \\
|
||||||
FINISH & 2.109 (2) & 0.268 & 0.013 & 1.723 & 0.019 \\
|
FINISH & 2.109 (2) & 0.268 & 0.013 & 1.723 & 0.019 \\
|
||||||
KILL & 89.648 (283) & 1.013 & 0.054 & 0.283 & 3.256 \\
|
KILL & 89.648 (283) & 1.013 & 0.054 & 0.283 & 3.256 \\
|
||||||
}
|
}
|
||||||
\tableIV{E}{
|
\tableIV{E}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 23.081 (25) & 0.247 & 0.666 & 0.717 & 1.588 \\
|
FAIL & 23.081 (25) & 0.247 & 0.666 & 0.717 & 1.588 \\
|
||||||
FINISH & 7.776 (2) & 0.019 & 0.029 & 1.934 & 0.021 \\
|
FINISH & 7.776 (2) & 0.019 & 0.029 & 1.934 & 0.021 \\
|
||||||
KILL & 88.790 (309) & 0.706 & 0.029 & 0.461 & 7.572 \\
|
KILL & 88.790 (309) & 0.706 & 0.029 & 0.461 & 7.572 \\
|
||||||
}
|
}
|
||||||
\tableIV{F}{
|
\tableIV{F}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 17.161 (8) & 0.621 & 0.546 & 0.426 & 7.559 \\
|
FAIL & 17.161 (8) & 0.621 & 0.546 & 0.426 & 7.559 \\
|
||||||
FINISH & 2.941 (2) & 0.015 & 0.051 & 1.670 & 0.162 \\
|
FINISH & 2.941 (2) & 0.015 & 0.051 & 1.670 & 0.162 \\
|
||||||
KILL & 103.889 (361) & 0.183 & 0.064 & 0.417 & 5.824 \\
|
KILL & 103.889 (361) & 0.183 & 0.064 & 0.417 & 5.824 \\
|
||||||
}
|
}
|
||||||
\tableIV{G}{
|
\tableIV{G}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 51.835 (250) & 0.556 & 3.335 & 0.608 & 20.352 \\
|
FAIL & 51.835 (250) & 0.556 & 3.335 & 0.608 & 20.352 \\
|
||||||
FINISH & 8.519 (36) & 0.002 & 0.630 & 1.760 & 0.005 \\
|
FINISH & 8.519 (36) & 0.002 & 0.630 & 1.760 & 0.005 \\
|
||||||
KILL & 37.055 (100) & 5.687 & 0.065 & 0.080 & 19.166 \\
|
KILL & 37.055 (100) & 5.687 & 0.065 & 0.080 & 19.166 \\
|
||||||
}
|
}
|
||||||
\tableIV{H}{
|
\tableIV{H}{
|
||||||
EVICT & 1.000 (1) & 1.000 & 0.000 & 0.000 & 0.000 \\
|
EVICT & 1.000 (1) & 1.000 & 0 & 0 & 0 \\
|
||||||
FAIL & 20.504 (1) & 0.114 & 2.300 & 0.981 & 12.833 \\
|
FAIL & 20.504 (1) & 0.114 & 2.300 & 0.981 & 12.833 \\
|
||||||
FINISH & 4.278 (14) & 0.005 & 0.153 & 1.778 & 0.014 \\
|
FINISH & 4.278 (14) & 0.005 & 0.153 & 1.778 & 0.014 \\
|
||||||
KILL & 11.023 (3) & 0.235 & 0.103 & 0.288 & 11.337 \\
|
KILL & 11.023 (3) & 0.235 & 0.103 & 0.288 & 11.337 \\
|
||||||
|
|
|
@ -42,21 +42,22 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 11.06\% & 4139 & 113 & 7.84\\
|
|
||||||
Free & 42.85\% & 1374 & 8 & 1.15\\
|
Free & 42.85\% & 1374 & 8 & 1.15\\
|
||||||
|
Best effort batch & 11.06\% & 4139 & 113 & 7.84\\
|
||||||
Mid & 2.71\% & 18187 & 157 & 2.55\\
|
Mid & 2.71\% & 18187 & 157 & 2.55\\
|
||||||
Monitoring & 2.74\% & 834226 & 130 & 2.05\\
|
Monitoring & 2.74\% & 834226 & 130 & 2.05\\
|
||||||
Production & 13.54\% & 54789 & 24 & 6.68\\
|
Production & 13.54\% & 54789 & 24 & 6.68\\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}}
|
\end{tabular}}
|
||||||
\caption{Mean task slowdown for each cluster and each priority ``tier'' for 2011 and
|
\caption{Mean task slowdown for each cluster and each priority ``tier'' for 2011
|
||||||
2019 data. \textbf{\% finished} is the percentage of tasks with
|
and 2019 data. \textbf{\% finished} is the percentage of tasks with
|
||||||
\texttt{FINISH} termination w.r.t.\ priority, \textbf{Mean response [s] (last
|
\texttt{FINISH} termination w.r.t.\ priority, \textbf{mean response [s]
|
||||||
execution)} is the mean response time (queue+execution time, in seconds) for
|
(last execution)} is the mean response time (queue+execution time, in
|
||||||
the last task execution w.r.t.\ priority, \textbf{Mean response [s] (all
|
seconds) for the last task execution w.r.t.\ priority, \textbf{mean
|
||||||
executions)} is the response time (in seconds) of all executions,
|
response [s] (all executions)} is the response time (in seconds) of all
|
||||||
\textbf{Mean slowdown} is the mean slowdown measure w.r.t.\ priority.
|
executions, \textbf{mean slowdown} is the mean slowdown measure w.r.t.\
|
||||||
Priorities with no successfully terminated jobs have been omitted.}\label{fig:taskslowdown}
|
priority. Priorities with no successfully terminated jobs have been
|
||||||
|
omitted.}\label{fig:taskslowdown}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
|
@ -65,8 +66,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 212.62\% & 71108 & 14201 & 5.17 \\
|
|
||||||
Free & 0.33\% & 5769 & 1203 & 82.97 \\
|
Free & 0.33\% & 5769 & 1203 & 82.97 \\
|
||||||
|
Best effort batch & 212.62\% & 71108 & 14201 & 5.17 \\
|
||||||
Mid & 46.22\% & 8510 & 9135 & 1.16 \\
|
Mid & 46.22\% & 8510 & 9135 & 1.16 \\
|
||||||
Monitoring & 2.82\% & 1200998 & 1054458 & 2.86 \\
|
Monitoring & 2.82\% & 1200998 & 1054458 & 2.86 \\
|
||||||
Production & 27.21\% & 4546 & 16845 & 4.12 \\
|
Production & 27.21\% & 4546 & 16845 & 4.12 \\
|
||||||
|
@ -78,8 +79,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 71.84\% & 1018454 & 550288 & 8.47 \\
|
|
||||||
Free & 45.21\% & 12047 & 5588 & 1.18 \\
|
Free & 45.21\% & 12047 & 5588 & 1.18 \\
|
||||||
|
Best effort batch & 71.84\% & 1018454 & 550288 & 8.47 \\
|
||||||
Mid & 8.82\% & 225147 & 336262 & 1.11 \\
|
Mid & 8.82\% & 225147 & 336262 & 1.11 \\
|
||||||
Monitoring & 4.12\% & 2627612 & 2024679 & 1.51 \\
|
Monitoring & 4.12\% & 2627612 & 2024679 & 1.51 \\
|
||||||
Production & 30.92\% & 182604 & 466329 & 9.71 \\
|
Production & 30.92\% & 182604 & 466329 & 9.71 \\
|
||||||
|
@ -91,8 +92,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 52.96\% & 1236666 & 997117 & 7.40 \\
|
|
||||||
Free & 73.36\% & 172214 & 5553 & 1.12 \\
|
Free & 73.36\% & 172214 & 5553 & 1.12 \\
|
||||||
|
Best effort batch & 52.96\% & 1236666 & 997117 & 7.40 \\
|
||||||
Mid & 95.4\% & 579844 & 248553 & 2.04 \\
|
Mid & 95.4\% & 579844 & 248553 & 2.04 \\
|
||||||
Monitoring & 5.88\% & 2159459 & 1761833 & 1.74 \\
|
Monitoring & 5.88\% & 2159459 & 1761833 & 1.74 \\
|
||||||
Production & 3.61\% & 352603 & 357993 & 4.14 \\
|
Production & 3.61\% & 352603 & 357993 & 4.14 \\
|
||||||
|
@ -105,8 +106,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 50.56\% & 1154060 & 1135023 & 12.04 \\
|
|
||||||
Free & 42.82\% & 22831 & 5506 & 1.15 \\
|
Free & 42.82\% & 22831 & 5506 & 1.15 \\
|
||||||
|
Best effort batch & 50.56\% & 1154060 & 1135023 & 12.04 \\
|
||||||
Mid & 86.34\% & 228762 & 225269 & 2.56 \\
|
Mid & 86.34\% & 228762 & 225269 & 2.56 \\
|
||||||
Monitoring & 2.21\% & 1588844 & 913816 & 2.16 \\
|
Monitoring & 2.21\% & 1588844 & 913816 & 2.16 \\
|
||||||
Production & 6.53\% & 279565 & 349364 & 5.51 \\
|
Production & 6.53\% & 279565 & 349364 & 5.51 \\
|
||||||
|
@ -119,8 +120,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 0.47\% & 280811 & 205838 & 8.06 \\
|
|
||||||
Free & 48.15\% & 33050 & 40073 & 1.44 \\
|
Free & 48.15\% & 33050 & 40073 & 1.44 \\
|
||||||
|
Best effort batch & 0.47\% & 280811 & 205838 & 8.06 \\
|
||||||
Mid & 0.46\% & 62123 & 83322 & 10.31 \\
|
Mid & 0.46\% & 62123 & 83322 & 10.31 \\
|
||||||
Monitoring & 37.71\% & 1415296 & 1263746 & 2.82 \\
|
Monitoring & 37.71\% & 1415296 & 1263746 & 2.82 \\
|
||||||
Production & 1.96\% & 231639 & 414149 & 8.54 \\
|
Production & 1.96\% & 231639 & 414149 & 8.54 \\
|
||||||
|
@ -133,8 +134,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 44.29\% & 1368306 & 1563086 & 6.14 \\
|
|
||||||
Free & 45.86\% & 187447 & 37069 & 1.09 \\
|
Free & 45.86\% & 187447 & 37069 & 1.09 \\
|
||||||
|
Best effort batch & 44.29\% & 1368306 & 1563086 & 6.14 \\
|
||||||
Mid & 31.36\% & 200116 & 110201 & 7.60 \\
|
Mid & 31.36\% & 200116 & 110201 & 7.60 \\
|
||||||
Monitoring & 8.42\% & 2079134 & 1682711 & 2.08 \\
|
Monitoring & 8.42\% & 2079134 & 1682711 & 2.08 \\
|
||||||
Production & 3.65\% & 297168 & 492372 & 5.94 \\
|
Production & 3.65\% & 297168 & 492372 & 5.94 \\
|
||||||
|
@ -146,8 +147,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 104.33\% & 294959 & 184724 & 19.06 \\
|
|
||||||
Free & 33.85\% & 64718 & 15473 & 1.14 \\
|
Free & 33.85\% & 64718 & 15473 & 1.14 \\
|
||||||
|
Best effort batch & 104.33\% & 294959 & 184724 & 19.06 \\
|
||||||
Mid & 49.06\% & 732532 & 706124 & 3.86 \\
|
Mid & 49.06\% & 732532 & 706124 & 3.86 \\
|
||||||
Monitoring & 4.36\% & 1991341 & 1676276 & 1.72 \\
|
Monitoring & 4.36\% & 1991341 & 1676276 & 1.72 \\
|
||||||
Production & 26.75\% & 115953 & 399050 & 14.57 \\
|
Production & 26.75\% & 115953 & 399050 & 14.57 \\
|
||||||
|
@ -159,8 +160,8 @@
|
||||||
\toprule
|
\toprule
|
||||||
\taskslowdownheader
|
\taskslowdownheader
|
||||||
\midrule
|
\midrule
|
||||||
Best effort batch & 107.03\% & 947368 & 527812 & 7.33 \\
|
|
||||||
Free & 28.79\% & 310534 & 290058 & 1.12 \\
|
Free & 28.79\% & 310534 & 290058 & 1.12 \\
|
||||||
|
Best effort batch & 107.03\% & 947368 & 527812 & 7.33 \\
|
||||||
Mid & 2.18\% & 338883 & 197440 & 6.49 \\
|
Mid & 2.18\% & 338883 & 197440 & 6.49 \\
|
||||||
Monitoring & 4.96\% & 2309296 & 1808698 & 1.94 \\
|
Monitoring & 4.96\% & 2309296 & 1808698 & 1.94 \\
|
||||||
Production & 2.7\% & 298799 & 470783 & 5.80 \\
|
Production & 2.7\% & 298799 & 470783 & 5.80 \\
|
||||||
|
@ -168,6 +169,6 @@
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
}
|
}
|
||||||
\caption{Mean task slowdown for each cluster and each task priority for single
|
\caption{Mean task slowdown for each cluster and each task priority for single
|
||||||
clusters in the 2019 traces. Refer to \ref{fig:taskslowdown} for a legend of the
|
clusters in the 2019 traces. Refer to Figure~\ref{fig:taskslowdown} for a legend
|
||||||
columns}\label{fig:taskslowdown-csts}
|
of the columns.}\label{fig:taskslowdown-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
Loading…
Reference in a new issue