Merge branch 'master' of tea.maggioni.xyz:maggicl/bachelorThesis
This commit is contained in:
commit
2752ad249f
3 changed files with 154 additions and 112 deletions
Binary file not shown.
|
@ -44,10 +44,10 @@ Switzerland]{Dr.}{Andrea}{Ros\'a}
|
||||||
datacenters, focusing in particular on unsuccessful executions of jobs and
|
datacenters, focusing in particular on unsuccessful executions of jobs and
|
||||||
tasks submitted by users. The objective of this project is to compare the
|
tasks submitted by users. The objective of this project is to compare the
|
||||||
resource waste caused by unsuccessful executions, their impact on application
|
resource waste caused by unsuccessful executions, their impact on application
|
||||||
performance, and their root causes. We will show the strong negative impact on
|
performance, and their root causes. We show the strong negative impact on
|
||||||
CPU and RAM usage and on task slowdown. We will analyze patterns of
|
CPU and RAM usage and on task slowdown. We analyze patterns of
|
||||||
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
||||||
Moreover, we will uncover their root causes by inspecting key workload and
|
Moreover, we uncover their root causes by inspecting key workload and
|
||||||
system attributes such asmachine locality and concurrency level.}
|
system attributes such asmachine locality and concurrency level.}
|
||||||
|
|
||||||
\begin{document}
|
\begin{document}
|
||||||
|
@ -82,19 +82,11 @@ and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
|
||||||
amount of computational power to analyze them and the implementation of special
|
amount of computational power to analyze them and the implementation of special
|
||||||
data engineering techniques for analysis of the data.
|
data engineering techniques for analysis of the data.
|
||||||
|
|
||||||
\input{figures/machine_configs}
|
|
||||||
|
|
||||||
An overview of the machine configurations in the cluster analyzed with the 2011
|
|
||||||
traces and in the 8 clusters composing the 2019 traces can be found in
|
|
||||||
figure~\ref{fig:machineconfigs}. Additionally, in
|
|
||||||
figure~\ref{fig:machineconfigs-csts}, the same machine configuration data is
|
|
||||||
provided for the 2019 traces providing a cluster-by-cluster distribution of the
|
|
||||||
machines.
|
|
||||||
|
|
||||||
This project aims to repeat the analysis performed in 2015 to highlight
|
This project aims to repeat the analysis performed in 2015 to highlight
|
||||||
similarities and differences in workload this decade brought, and expanding the
|
similarities and differences in workload this decade brought, and expanding the
|
||||||
old analysis to understand even better the causes of failures and how to prevent
|
old analysis to understand even better the causes of failures and how to prevent
|
||||||
them. Additionally, this report will provide an overview on the data engineering
|
them. Additionally, this report provides an overview of the data engineering
|
||||||
techniques used to perform the queries and analyses on the 2019 traces.
|
techniques used to perform the queries and analyses on the 2019 traces.
|
||||||
|
|
||||||
\subsection{Outline}
|
\subsection{Outline}
|
||||||
|
@ -111,7 +103,23 @@ conclusions.
|
||||||
|
|
||||||
\section{State of the art}\label{sec2}
|
\section{State of the art}\label{sec2}
|
||||||
|
|
||||||
\textbf{TBD (introduce only 2015 dsn paper)}
|
\begin{figure}[t]
|
||||||
|
\begin{center}
|
||||||
|
\begin{tabular}{cc}
|
||||||
|
\textbf{Cluster} & \textbf{Timezone} \\ \hline
|
||||||
|
A & America/New York \\
|
||||||
|
B & America/Chicago \\
|
||||||
|
C & America/New York \\
|
||||||
|
D & America/New York \\
|
||||||
|
E & Europe/Helsinki \\
|
||||||
|
F & America/Chicago \\
|
||||||
|
G & Asia/Singapore \\
|
||||||
|
H & Europe/Brussels \\
|
||||||
|
\end{tabular}
|
||||||
|
\end{center}
|
||||||
|
\caption{Approximate geographical location obtained from the datacenter's
|
||||||
|
timezone of each cluster in the 2019 Google Borg traces.}\label{fig:clusters}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
In 2015, Dr.~Andrea Rosà et al.\ published a
|
In 2015, Dr.~Andrea Rosà et al.\ published a
|
||||||
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
||||||
|
@ -123,6 +131,30 @@ failures. The salient conclusion of that research is that actually lots of
|
||||||
computations performed by Google would eventually end in failure, then leading
|
computations performed by Google would eventually end in failure, then leading
|
||||||
to large amounts of computational power being wasted.
|
to large amounts of computational power being wasted.
|
||||||
|
|
||||||
|
However, with the release of the new 2019 traces, the results and conclusions
|
||||||
|
found by that paper could be potentially outdated in the current large-scale
|
||||||
|
computing world. The new traces not only provide updated data on Borg's
|
||||||
|
workload, but provide more data as well: the new traces contain data from 8
|
||||||
|
different Borg ``cells'' (i.e.\ clusters) in datacenters across the world,
|
||||||
|
from now on referred as ``Cluster A'' to ``Cluster H''.
|
||||||
|
|
||||||
|
The geographical
|
||||||
|
location of each cluster can be consulted in Figure~\ref{fig:clusters}. The
|
||||||
|
information in that table was provided by the 2019 traces
|
||||||
|
documentation\cite{google-drive-marso}.
|
||||||
|
|
||||||
|
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
||||||
|
example, the amount and variety of server configurations per cluster increased
|
||||||
|
significantly from 2011.
|
||||||
|
An overview of the machine configurations in the cluster analyzed with the 2011
|
||||||
|
traces and in the 8 clusters composing the 2019 traces can be found in
|
||||||
|
Figure~\ref{fig:machineconfigs}. Additionally, in
|
||||||
|
Figure~\ref{fig:machineconfigs-csts}, the same machine configuration data is
|
||||||
|
provided for the 2019 traces providing a cluster-by-cluster distribution of the
|
||||||
|
machines.
|
||||||
|
|
||||||
|
\input{figures/machine_configs}
|
||||||
|
|
||||||
\section{Background information}\label{sec3}
|
\section{Background information}\label{sec3}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
|
@ -143,7 +175,7 @@ to large amounts of computational power being wasted.
|
||||||
% encoded and stored in the trace as rows of various tables. Among the
|
% encoded and stored in the trace as rows of various tables. Among the
|
||||||
% information events provide, the field ``type'' provides information on the
|
% information events provide, the field ``type'' provides information on the
|
||||||
% execution status of the job or task. This field can have several values,
|
% execution status of the job or task. This field can have several values,
|
||||||
% which are illustrated in figure~\ref{fig:eventtypes}.
|
% which are illustrated in Figure~\ref{fig:eventtypes}.
|
||||||
|
|
||||||
\subsection{Traces}
|
\subsection{Traces}
|
||||||
|
|
||||||
|
@ -173,7 +205,7 @@ status of a task itself.
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\end{center}
|
\end{center}
|
||||||
\caption{Overview of job and task event types.}\label{fig:eventtypes}
|
\caption{Overview of job and task termination event types.}\label{fig:eventtypes}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
||||||
|
@ -238,6 +270,7 @@ The scope of this thesis focuses on the tables
|
||||||
\texttt{machine\_configs}, \texttt{instance\_events} and
|
\texttt{machine\_configs}, \texttt{instance\_events} and
|
||||||
\texttt{collection\_events}.
|
\texttt{collection\_events}.
|
||||||
|
|
||||||
|
|
||||||
\hypertarget{remark-on-traces-size}{%
|
\hypertarget{remark-on-traces-size}{%
|
||||||
\subsection{Remark on traces size}\label{remark-on-traces-size}}
|
\subsection{Remark on traces size}\label{remark-on-traces-size}}
|
||||||
|
|
||||||
|
@ -296,22 +329,24 @@ The chosen programming language for writing analysis scripts was Python.
|
||||||
Spark has very powerful native Python bindings in the form of the
|
Spark has very powerful native Python bindings in the form of the
|
||||||
\emph{PySpark} API, which were used to implement the various queries.
|
\emph{PySpark} API, which were used to implement the various queries.
|
||||||
|
|
||||||
|
|
||||||
\hypertarget{query-architecture}{%
|
\hypertarget{query-architecture}{%
|
||||||
\subsection{Query architecture}\label{query-architecture}}
|
\subsection{Query architecture}\label{query-architecture}}
|
||||||
|
|
||||||
\subsubsection{Overview}
|
\subsubsection{Overview}
|
||||||
|
|
||||||
In general, each query written to execute the analysis
|
In general, each query written to execute the analysis
|
||||||
follows a general Map-Reduce template.
|
follows a Map-Reduce template. Traces are first read, then parsed, and then
|
||||||
|
filtered by performing selections,
|
||||||
|
projections and computing new derived fields.
|
||||||
|
|
||||||
Traces are first read, then parsed, and then filtered by performing selections,
|
After this preparation phase, the
|
||||||
projections and computing new derived fields. After this preparation phase, the
|
|
||||||
trace records are often passed through a \texttt{groupby()} operation, which by
|
trace records are often passed through a \texttt{groupby()} operation, which by
|
||||||
choosing one or many record fields sorts all the records into several ``bins''
|
choosing one or many record fields sorts all the records into several ``bins''
|
||||||
containing records with matching values for the selected fields. Then, a map
|
containing records with matching values for the selected fields. Then, a map
|
||||||
operation is applied to each bin in order to derive some aggregated property
|
operation is applied to each bin in order to derive some aggregated property
|
||||||
value for each grouping. Finally, a reduce operation is applied to either
|
value for each grouping.
|
||||||
|
|
||||||
|
Finally, a reduce operation is applied to either
|
||||||
further aggregate those computed properties or to generate an aggregated data
|
further aggregate those computed properties or to generate an aggregated data
|
||||||
structure for storage purposes.
|
structure for storage purposes.
|
||||||
|
|
||||||
|
@ -372,12 +407,12 @@ appreciate their behaviour.
|
||||||
One example of analysis script with average complexity and a pretty
|
One example of analysis script with average complexity and a pretty
|
||||||
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
||||||
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
||||||
(namely the tables in figure~\ref{fig:taskslowdown}).
|
(namely the tables in Figure~\ref{fig:taskslowdown}).
|
||||||
|
|
||||||
``Slowdown'' is a task-wise measure of wasted execution time for tasks with a
|
``Slowdown'' is a task-wise measure of wasted execution time for tasks with a
|
||||||
\texttt{FINISH} termination type. It is computed as the total execution time of
|
\texttt{FINISH} termination type. It is computed as the total execution time of
|
||||||
the task divided by the execution time actually needed to complete the task
|
the task divided by the execution time actually needed to complete the task
|
||||||
(i.e. the total time of the last execution attempt, successful by definition).
|
(i.e.\ the total time of the last execution attempt, successful by definition).
|
||||||
|
|
||||||
The analysis requires to compute the mean task slowdown for each task priority
|
The analysis requires to compute the mean task slowdown for each task priority
|
||||||
value, and additionally compute the percentage of tasks with successful
|
value, and additionally compute the percentage of tasks with successful
|
||||||
|
@ -385,7 +420,7 @@ terminations per priority. The query therefore needs to compute the execution
|
||||||
time of each execution attempt for each task, determine if each task has
|
time of each execution attempt for each task, determine if each task has
|
||||||
successful termination or not, and finally combine this data to compute
|
successful termination or not, and finally combine this data to compute
|
||||||
slowdown, mean slowdown and ultimately the final table found in
|
slowdown, mean slowdown and ultimately the final table found in
|
||||||
figure~\ref{fig:taskslowdown}.
|
Figure~\ref{fig:taskslowdown}.
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\hspace{-0.075\textwidth}
|
\hspace{-0.075\textwidth}
|
||||||
|
@ -402,7 +437,7 @@ contains (among other data) all task event logs containing properties, event
|
||||||
types and timestamps. As already explained in the previous section, the logical
|
types and timestamps. As already explained in the previous section, the logical
|
||||||
table file is actually stored as several Gzip-compressed JSONL shards. This is
|
table file is actually stored as several Gzip-compressed JSONL shards. This is
|
||||||
very useful for processing purposes, since Spark is able to parse and load in
|
very useful for processing purposes, since Spark is able to parse and load in
|
||||||
memory each shard in parallel, i.e. using all processing cores on the server
|
memory each shard in parallel, i.e.\ using all processing cores on the server
|
||||||
used to run the queries.
|
used to run the queries.
|
||||||
|
|
||||||
After loading the data, a selection and a projection operation are performed in
|
After loading the data, a selection and a projection operation are performed in
|
||||||
|
@ -436,18 +471,18 @@ Finally, the \texttt{task\_slowdown\_table.py} processes this intermediate
|
||||||
results to compute the percentage of successful tasks per execution and
|
results to compute the percentage of successful tasks per execution and
|
||||||
computing slowdown values given the previously computed execution attempt time
|
computing slowdown values given the previously computed execution attempt time
|
||||||
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
||||||
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
in the clear and coincise tables found in Figure~\ref{fig:taskslowdown}.
|
||||||
|
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
\section{Analysis: Performance Input of Unsuccessful Executions}\label{sec5}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the methodologies used in the
|
Our first investigation focuses on replicating the analysis done by the paper of
|
||||||
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||||
and resources.
|
and resources.
|
||||||
|
|
||||||
In this section we perform several analyses focusing on how machine time and
|
In this section we perform several analyses focusing on how machine time and
|
||||||
resources are wasted, by means of a temporal vs. spatial resource analysis from
|
resources are wasted, by means of a temporal vs.\ spatial resource analysis from
|
||||||
the perspective of single tasks as well as jobs. We then compare the results
|
the perspective of single tasks as well as jobs. We then compare the results
|
||||||
from the 2019 traces to the ones that were obtained in 2015 to understand the
|
from the 2019 traces to the ones that were obtained before to understand the
|
||||||
workload evolution inside Borg between 2011 and 2019.
|
workload evolution inside Borg between 2011 and 2019.
|
||||||
|
|
||||||
We discover that the spatial and temporal impact of unsuccessful
|
We discover that the spatial and temporal impact of unsuccessful
|
||||||
|
@ -458,22 +493,38 @@ termination event.
|
||||||
\subsection{Temporal Impact: Machine Time Waste}
|
\subsection{Temporal Impact: Machine Time Waste}
|
||||||
\input{figures/machine_time_waste}
|
\input{figures/machine_time_waste}
|
||||||
|
|
||||||
This analysis explores how machine time is distributed over task events and
|
The goal of this analysis is to understand how much time is spent in doing
|
||||||
submissions. By partitioning the collection of all terminating tasks by their
|
useless computations by exploring how machine time is distributed over task
|
||||||
|
events and submissions.
|
||||||
|
|
||||||
|
Before delving into the analysis itself, we define three kinds of events in a
|
||||||
|
task's lifecycle:
|
||||||
|
|
||||||
|
\begin{description}
|
||||||
|
\item[submission:] when a task is added or re-added to the Borg
|
||||||
|
system queue, waiting to be scheduled;
|
||||||
|
\item[scheduling:] when a task is removed from the Borg queue and
|
||||||
|
its actual execution of potentially useful computations starts;
|
||||||
|
\item[termination:] when a task terminates its computations either
|
||||||
|
successfully or unsuccessfully.
|
||||||
|
\end{description}
|
||||||
|
|
||||||
|
By partitioning the set of all terminating tasks by their
|
||||||
termination event, the analysis aims to measure the total time spent by tasks in
|
termination event, the analysis aims to measure the total time spent by tasks in
|
||||||
3 different execution phases:
|
3 different execution phases:
|
||||||
|
|
||||||
\begin{description}
|
\begin{description}
|
||||||
\item[resubmission time:] the total of all time deltas between every task
|
\item[resubmission time:] the total of all time intervals between every task
|
||||||
termination event and the immediately succeding task submission event, i.e.
|
termination event and the immediately succeding task submission event, i.e.\
|
||||||
the total time spent by tasks waiting to be resubmitted in Borg after a
|
the total time spent by tasks waiting to be resubmitted in Borg after a
|
||||||
termination;
|
termination;
|
||||||
\item[queue time:] the total of all time deltas between every task submission
|
\item[queue time:] the total of all time intervals between every task submission
|
||||||
event and the following task scheduling event, i.e. the total time spent by
|
event and the following task scheduling event, i.e.\ the total time spent by
|
||||||
tasks queuing before execution;
|
tasks queuing before execution;
|
||||||
\item[running time:] the total of all time deltas between every task scheduling
|
\item[running time:] the total of all time intervals between every task
|
||||||
event and the following task termination event, i.e. the total time spent by
|
scheduling event and the following task termination event, i.e.\ the total
|
||||||
tasks ``executing'' (i.e. performing useful computations) in the clusters.
|
time spent by tasks ``executing'' (i.e.\ performing potentially useful
|
||||||
|
computations) in the clusters.
|
||||||
\end{description}
|
\end{description}
|
||||||
|
|
||||||
In the 2019 traces, an additional ``Unknown'' measure is counted. This measure
|
In the 2019 traces, an additional ``Unknown'' measure is counted. This measure
|
||||||
|
@ -482,17 +533,16 @@ events do not allow to safely assume in which execution phase a task may be.
|
||||||
Unknown measures are mostly caused by faults and missed event writes in the task
|
Unknown measures are mostly caused by faults and missed event writes in the task
|
||||||
event log that was used to generate the traces.
|
event log that was used to generate the traces.
|
||||||
|
|
||||||
The analysis results are depicted in figure~\ref{fig:machinetimewaste-rel} as a
|
The analysis results are depicted in Figure~\ref{fig:machinetimewaste-rel} as a
|
||||||
comparison between the 2011 and 2019 traces, aggregating the data from all
|
comparison between the 2011 and 2019 traces, aggregating the data from all
|
||||||
clusters. Additionally, in figure~\ref{fig:machinetimewaste-rel-csts}
|
clusters. Additionally, in Figure~\ref{fig:machinetimewaste-rel-csts}
|
||||||
cluster-by-cluster breakdown result is provided for the 2019 traces.
|
cluster-by-cluster breakdown result is provided for the 2019 traces.
|
||||||
|
|
||||||
The striking difference between 2011 and 2019 data is in the machine time
|
The striking difference between 2011 and 2019 data is in the machine time
|
||||||
distribution per task termination type. In the 2019 traces, 94.38\% of global
|
distribution per task termination type. In the 2019 traces, 94.38\% of global
|
||||||
machine time is spent on tasks that are eventually \texttt{KILL}ed.
|
machine time is spent on tasks that are eventually \texttt{KILL}ed.
|
||||||
\texttt{FINISH}, \texttt{EVICT} and \texttt{FAIL} tasks respectively register
|
\texttt{FINISH}, \texttt{EVICT} and \texttt{FAIL} tasks respectively register
|
||||||
totals of 4.20\%, 1.18\% and 0.25\% machine time, maintaining a analogous
|
totals of 4.20\%, 1.18\% and 0.25\% machine time.
|
||||||
distribution between them to their distribution in the 2011 traces.
|
|
||||||
|
|
||||||
Considering instead the distribution between execution phase times, the
|
Considering instead the distribution between execution phase times, the
|
||||||
comparison shows very similar behaviour between the two traces, having the
|
comparison shows very similar behaviour between the two traces, having the
|
||||||
|
@ -508,32 +558,36 @@ w.r.t.\ of accuracy of task event logging.
|
||||||
|
|
||||||
Considering instead the behaviour of each single cluster in the 2019 traces, no
|
Considering instead the behaviour of each single cluster in the 2019 traces, no
|
||||||
significant difference beween them can be observed. The only notable difference
|
significant difference beween them can be observed. The only notable difference
|
||||||
lies between the ``Running time``-``Unknown time'' ratio in \texttt{KILL}ed
|
lies between the ``Running time''-``Unknown time'' ratio in \texttt{KILL}ed
|
||||||
tasks, which is at its highest in cluster A (at 30.78\% by 58.71\% of global
|
tasks, which is at its highest in cluster A (at 30.78\% by 58.71\% of global
|
||||||
machine time) and at its lowest in cluster H (at 8.06\% by 84.77\% of global
|
machine time) and at its lowest in cluster H (at 8.06\% by 84.77\% of global
|
||||||
machine time).
|
machine time).
|
||||||
|
|
||||||
|
The takeaway from this analysis is that in the 2019 traces a lot of computation
|
||||||
|
time is wasted in the execution of tasks that are eventually \texttt{KILL}ed,
|
||||||
|
i.e.\ unsuccessful.
|
||||||
|
|
||||||
\subsection{Average Slowdown per Task}
|
\subsection{Average Slowdown per Task}
|
||||||
\input{figures/task_slowdown}
|
\input{figures/task_slowdown}
|
||||||
|
|
||||||
This analysis aims to measure the figure of ``slowdown'', which is defined as
|
This analysis aims to measure the average of an ad-hoc defined parameter we call
|
||||||
the ratio between the response time (i.e\. queue time and running time) of the
|
``slowdown''. We define it as the ratio between the total response time across
|
||||||
last execution of a given task and the total response time across all
|
all executions of the task and the response time (i.e.\ queue time and running
|
||||||
executions of said task. This metric is especially useful to analyze the impact
|
time) of the last execution of said task. This metric is especially useful to
|
||||||
of unsuccesful executions on each task total execution time w.r.t.\ the intrinsic
|
analyze the impact of unsuccesful executions on each task total execution time
|
||||||
workload (i.e.\ computational time) of tasks.
|
w.r.t.\ the intrinsic workload (i.e.\ computational time) of tasks.
|
||||||
|
|
||||||
Refer to figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
Refer to Figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
||||||
2019 mean task slowdown measures broke down by task priority. Additionally, said
|
2019 mean task slowdown measures broke down by task priority. Additionally, said
|
||||||
means are computed on a cluster-by-cluster basis for 2019 data in
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||||
figure~\ref{fig:taskslowdown-csts}.
|
Figure~\ref{fig:taskslowdown-csts}.
|
||||||
|
|
||||||
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
||||||
priority value, which at the time were $[0,11]$ numeric values. However,
|
priority value, which at the time were numeric values between 0 and 11. However,
|
||||||
in 2019 traces, task priorities are given as a $[0,500]$ numeric value.
|
in 2019 traces, task priorities are given as a numeric value between 0 and 500.
|
||||||
Therefore, to allow for an easier comparison, mean task slowdown values are
|
Therefore, to allow an easier comparison, mean task slowdown values are computed
|
||||||
computed by task priority tier over the 2019 data. Priority tiers are
|
by task priority tier over the 2019 data. Priority tiers are semantically
|
||||||
semantically relevant priority ranges defined in the Tirmazi et al.
|
relevant priority ranges defined in the Tirmazi et al.\
|
||||||
2020\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
2020\cite{google-marso-19} that introduced the 2019 traces. Equivalent priority
|
||||||
tiers are also provided next to the 2011 priority values in the table covering
|
tiers are also provided next to the 2011 priority values in the table covering
|
||||||
the 2015 analysis.
|
the 2015 analysis.
|
||||||
|
@ -547,9 +601,9 @@ though this column shows the mean response time across all executions.
|
||||||
\textbf{Mean slowdown} instead provides the mean slowdown value for each task
|
\textbf{Mean slowdown} instead provides the mean slowdown value for each task
|
||||||
priority/tier.
|
priority/tier.
|
||||||
|
|
||||||
Comparing the tables in figure~\ref{fig:taskslowdown} we observe that the
|
Comparing the tables in Figure~\ref{fig:taskslowdown} we observe that the
|
||||||
maximum mean slowdown measure for 2019 data (i.e.\ 7.84, for the BEB tier) is almost
|
maximum mean slowdown measure for 2019 data (i.e.\ 7.84, for the BEB tier) is
|
||||||
double of the maximum measure in 2011 data (i.e.\ 3.39, for priority $3$
|
almost double of the maximum measure in 2011 data (i.e.\ 3.39, for priority $3$
|
||||||
corresponding to the BEB tier). The ``Best effort batch'' tier, as the name
|
corresponding to the BEB tier). The ``Best effort batch'' tier, as the name
|
||||||
suggest, is a lower priority tier where failures are more tolerated. Therefore,
|
suggest, is a lower priority tier where failures are more tolerated. Therefore,
|
||||||
due to the increased concurrency in the 2019 clusters compared to 2011 and the
|
due to the increased concurrency in the 2019 clusters compared to 2011 and the
|
||||||
|
@ -566,7 +620,7 @@ executions: while the mean response is overall shorter in time in the 2019
|
||||||
traces by an order of magnitude, the new traces show an overall significantly
|
traces by an order of magnitude, the new traces show an overall significantly
|
||||||
higher mean response time than in the 2011 data.
|
higher mean response time than in the 2011 data.
|
||||||
|
|
||||||
Across 2019 single clusters (as in figure~\ref{fig:taskslowdown-csts}), the data
|
Across 2019 single clusters (as in Figure~\ref{fig:taskslowdown-csts}), the data
|
||||||
shows a mostly uniform behaviour, other than for some noteworthy mean slowdown
|
shows a mostly uniform behaviour, other than for some noteworthy mean slowdown
|
||||||
spikes. Indeed, cluster A has 82.97 mean slowdown in the ``Free'' tier,
|
spikes. Indeed, cluster A has 82.97 mean slowdown in the ``Free'' tier,
|
||||||
cluster G has 19.06 and 14.57 mean slowdown in the ``BEB'' and ``Production''
|
cluster G has 19.06 and 14.57 mean slowdown in the ``BEB'' and ``Production''
|
||||||
|
@ -585,9 +639,9 @@ Due to limited computational resources w.r.t.\ the data analysis process, the
|
||||||
resource usage for clusters E to H in the 2019 traces is missing. However, a
|
resource usage for clusters E to H in the 2019 traces is missing. However, a
|
||||||
comparison between 2011 resource usage and the aggregated resource usage of
|
comparison between 2011 resource usage and the aggregated resource usage of
|
||||||
clusters A to D in the 2019 traces can be found in
|
clusters A to D in the 2019 traces can be found in
|
||||||
figure~\ref{fig:spatialresourcewaste-actual}. Additionally, a
|
Figure~\ref{fig:spatialresourcewaste-actual}. Additionally, a
|
||||||
cluster-by-cluster breakdown for the 2019 data can be found in
|
cluster-by-cluster breakdown for the 2019 data can be found in
|
||||||
figure~\ref{fig:spatialresourcewaste-actual-csts}.
|
Figure~\ref{fig:spatialresourcewaste-actual-csts}.
|
||||||
|
|
||||||
From these figures it is clear that, compared to the relatively even
|
From these figures it is clear that, compared to the relatively even
|
||||||
distribution of used resources in the 2011 traces, the distribution of resources
|
distribution of used resources in the 2011 traces, the distribution of resources
|
||||||
|
@ -598,14 +652,14 @@ all other task termination types have a significantly lower resource usage:
|
||||||
\texttt{EVICT}ed, \texttt{FAIL}ed and \texttt{FINISH}ed tasks register respectively
|
\texttt{EVICT}ed, \texttt{FAIL}ed and \texttt{FINISH}ed tasks register respectively
|
||||||
8.53\%, 3.17\% and 2.02\% CPU usage and 9.03\%, 4.45\%, and 1.66\% memory usage.
|
8.53\%, 3.17\% and 2.02\% CPU usage and 9.03\%, 4.45\%, and 1.66\% memory usage.
|
||||||
This resource distribution can also be found in the data from individual
|
This resource distribution can also be found in the data from individual
|
||||||
clusters in figure~\ref{fig:spatialresourcewaste-actual-csts}, with always more
|
clusters in Figure~\ref{fig:spatialresourcewaste-actual-csts}, with always more
|
||||||
than 80\% of resources devoted to \texttt{KILL}ed tasks.
|
than 80\% of resources devoted to \texttt{KILL}ed tasks.
|
||||||
|
|
||||||
Considering now requested resources instead of used ones, a comparison between
|
Considering now requested resources instead of used ones, a comparison between
|
||||||
2011 and the aggregation of all A-H clusters of the 2019 traces can be found in
|
2011 and the aggregation of all A-H clusters of the 2019 traces can be found in
|
||||||
figure~\ref{fig:spatialresourcewaste-requested}. Additionally, a
|
Figure~\ref{fig:spatialresourcewaste-requested}. Additionally, a
|
||||||
cluster-by-cluster breakdown for single 2019 clusters can be found in
|
cluster-by-cluster breakdown for single 2019 clusters can be found in
|
||||||
figure~\ref{fig:spatialresourcewaste-requested-csts}.
|
Figure~\ref{fig:spatialresourcewaste-requested-csts}.
|
||||||
|
|
||||||
Here \texttt{KILL}ed jobs dominate even more the distribution of resources,
|
Here \texttt{KILL}ed jobs dominate even more the distribution of resources,
|
||||||
reaching a global 97.21\% of CPU allocation and a global 96.89\% of memory
|
reaching a global 97.21\% of CPU allocation and a global 96.89\% of memory
|
||||||
|
@ -615,7 +669,7 @@ respective CPU allocation figures of 2.73\%, 0.06\% and 0.0012\% and memory
|
||||||
allocation figures of 3.04\%, 0.06\% and 0.012\%.
|
allocation figures of 3.04\%, 0.06\% and 0.012\%.
|
||||||
|
|
||||||
Behaviour across clusters (as
|
Behaviour across clusters (as
|
||||||
evinced in figure~\ref{fig:spatialresourcewaste-requested-csts}) in terms of
|
evinced in Figure~\ref{fig:spatialresourcewaste-requested-csts}) in terms of
|
||||||
requested resources is pretty homogeneous, with the exception of cluster A
|
requested resources is pretty homogeneous, with the exception of cluster A
|
||||||
having a relatively high 2.85\% CPU and 3.42\% memory resource requests from
|
having a relatively high 2.85\% CPU and 3.42\% memory resource requests from
|
||||||
\texttt{EVICT}ed tasks and cluster E having a noteworthy 1.67\% CPU and 1.31\%
|
\texttt{EVICT}ed tasks and cluster E having a noteworthy 1.67\% CPU and 1.31\%
|
||||||
|
@ -651,9 +705,9 @@ the task-level events, namely \texttt{EVICT}, \texttt{FAIL}, \texttt{FINISH}
|
||||||
and \texttt{KILL} termination events.
|
and \texttt{KILL} termination events.
|
||||||
|
|
||||||
A comparison of the termination event distribution between the 2011 and 2019
|
A comparison of the termination event distribution between the 2011 and 2019
|
||||||
traces is shown in figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
traces is shown in Figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
||||||
breakdown of the same data for the 2019 traces is shown in
|
breakdown of the same data for the 2019 traces is shown in
|
||||||
figure~\ref{fig:tableIII-csts}.
|
Figure~\ref{fig:tableIII-csts}.
|
||||||
|
|
||||||
Each table from these figure shows the mean and the 95-th percentile of the
|
Each table from these figure shows the mean and the 95-th percentile of the
|
||||||
number of termination events per task, broke down by task termination. In
|
number of termination events per task, broke down by task termination. In
|
||||||
|
@ -677,7 +731,7 @@ jobs and their \texttt{EVICT} events (1.876 on average per task with a 8.763
|
||||||
event overall average).
|
event overall average).
|
||||||
|
|
||||||
Considering cluster-by-cluster behaviour in the 2019 traces (as reported in
|
Considering cluster-by-cluster behaviour in the 2019 traces (as reported in
|
||||||
figure~\ref{fig:tableIII-csts}) the general observations still hold for each
|
Figure~\ref{fig:tableIII-csts}) the general observations still hold for each
|
||||||
cluster, albeit with event count averages having different magnitudes. Notably,
|
cluster, albeit with event count averages having different magnitudes. Notably,
|
||||||
cluster E registers the highest per-event average, with \texttt{FAIL}ed tasks
|
cluster E registers the highest per-event average, with \texttt{FAIL}ed tasks
|
||||||
experiencing 111.471 \texttt{FAIL} events out of \texttt{112.384}.
|
experiencing 111.471 \texttt{FAIL} events out of \texttt{112.384}.
|
||||||
|
@ -692,11 +746,11 @@ given number of unsuccessful events could affect the termination of the task it
|
||||||
belongs to.
|
belongs to.
|
||||||
|
|
||||||
Conditional probabilities of each unsuccessful event type are shown in the form
|
Conditional probabilities of each unsuccessful event type are shown in the form
|
||||||
of a plot in figure~\ref{fig:figureV}, comparing the 2011 traces with the
|
of a plot in Figure~\ref{fig:figureV}, comparing the 2011 traces with the
|
||||||
overall data from the 2019 ones, and in figure~\ref{fig:figureV-csts}, as a
|
overall data from the 2019 ones, and in Figure~\ref{fig:figureV-csts}, as a
|
||||||
cluster-by-cluster breakdown of the same data for the 2019 traces.
|
cluster-by-cluster breakdown of the same data for the 2019 traces.
|
||||||
|
|
||||||
In figure~\ref{fig:figureV} the 2011 and 2019 plots differ in their x-axis:
|
In Figure~\ref{fig:figureV} the 2011 and 2019 plots differ in their x-axis:
|
||||||
for 2011 data conditional probabilities are computed for a maximum event coun
|
for 2011 data conditional probabilities are computed for a maximum event coun
|
||||||
t of 30, while for 2019 data are computed for up to 50 events of a specific
|
t of 30, while for 2019 data are computed for up to 50 events of a specific
|
||||||
kind. Nevertheless, another quite striking difference between the two plots can
|
kind. Nevertheless, another quite striking difference between the two plots can
|
||||||
|
@ -716,7 +770,7 @@ The \texttt{FAIL} probability curve has instead 18.55\%, 1.79\%, 14.49\%,
|
||||||
2.08\%, 2.40\%, and 1.29\% success probabilities for the same range.
|
2.08\%, 2.40\%, and 1.29\% success probabilities for the same range.
|
||||||
|
|
||||||
Considering cluster-to-cluster behaviour in the 2019 traces (as shown in
|
Considering cluster-to-cluster behaviour in the 2019 traces (as shown in
|
||||||
figure~\ref{fig:figureV-csts}), some clusters show quite similar behaviour to
|
Figure~\ref{fig:figureV-csts}), some clusters show quite similar behaviour to
|
||||||
the aggregated plot (namely clusters A, F, and H), while some other clusters
|
the aggregated plot (namely clusters A, F, and H), while some other clusters
|
||||||
show very oscillating probability distribution function curves for
|
show very oscillating probability distribution function curves for
|
||||||
\texttt{EVICT} and \texttt{FINISH} curves. \texttt{KILL} behaviour is instead
|
\texttt{EVICT} and \texttt{FINISH} curves. \texttt{KILL} behaviour is instead
|
||||||
|
@ -725,15 +779,15 @@ homogeneous even on a single cluster basis.
|
||||||
\subsection{Unsuccessful Job Event Patterns}\label{tabIV-section}
|
\subsection{Unsuccessful Job Event Patterns}\label{tabIV-section}
|
||||||
\input{figures/table_iv}
|
\input{figures/table_iv}
|
||||||
|
|
||||||
This analysis uses very similar techniques to the ones used in
|
The analysis uses very similar techniques to the ones used in
|
||||||
Section~\ref{tabIII-section}, but focusing at the job level instead. The aim is
|
Section~\ref{tabIII-section}, but focusing at the job level instead. The aim is
|
||||||
to better understand the task-job level relationship and to understand how
|
to better understand the task-job level relationship and to understand how
|
||||||
task-level termination events can influence the termination state of a job.
|
task-level termination events can influence the termination state of a job.
|
||||||
|
|
||||||
A comparison of the analyzed parameters between the 2011 and 2019
|
A comparison of the analyzed parameters between the 2011 and 2019
|
||||||
traces is shown in figure~\ref{fig:tableIV}. Additionally, a cluster-by-cluster
|
traces is shown in Figure~\ref{fig:tableIV}. Additionally, a cluster-by-cluster
|
||||||
breakdown of the same data for the 2019 traces is shown in
|
breakdown of the same data for the 2019 traces is shown in
|
||||||
figure~\ref{fig:tableIV-csts}.
|
Figure~\ref{fig:tableIV-csts}.
|
||||||
|
|
||||||
Considering the distribution of number of tasks in a job, the 2019 traces show a
|
Considering the distribution of number of tasks in a job, the 2019 traces show a
|
||||||
decrease for the mean figure (e.g.\ for \texttt{FAIL}ed jobs, with a mean 60.5
|
decrease for the mean figure (e.g.\ for \texttt{FAIL}ed jobs, with a mean 60.5
|
||||||
|
@ -751,7 +805,7 @@ the \texttt{FINISH}ed job category has a new event distribution too, with
|
||||||
\texttt{FINISH} task events being the most popular at 1.778 events per job in
|
\texttt{FINISH} task events being the most popular at 1.778 events per job in
|
||||||
the 2019 traces.
|
the 2019 traces.
|
||||||
|
|
||||||
The cluster-by-cluster comparison in figure~\ref{fig:tableIV-csts} shows that
|
The cluster-by-cluster comparison in Figure~\ref{fig:tableIV-csts} shows that
|
||||||
the number of tasks per job are generally distributed similarly to the
|
the number of tasks per job are generally distributed similarly to the
|
||||||
aggregated data, with only cluster H having remarkably low mean and 95-th
|
aggregated data, with only cluster H having remarkably low mean and 95-th
|
||||||
percentiles overall. Event-wise, for \texttt{EVICT}ed, \texttt{FINISH}ed,
|
percentiles overall. Event-wise, for \texttt{EVICT}ed, \texttt{FINISH}ed,
|
||||||
|
@ -772,31 +826,18 @@ probabilities based on the number of task termination events of a specific type.
|
||||||
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
Finally, Section~\ref{tabIV-section} aims to find similar correlations, but at
|
||||||
the job level.
|
the job level.
|
||||||
|
|
||||||
|
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||||
|
|
||||||
In this section, we search for the root causes of different unsuccessful jobs
|
The aim of this section is to analyze several task-level and job-level
|
||||||
and events, and derive their implications on system design. Our analysis resorts
|
parameters in order to find correlations with the success of an execution. By
|
||||||
to a black-box approach due to the limited information available on the system.
|
using the tecniques used in Section V of the Rosa\' et al.\
|
||||||
We consider two levels of statistics, i.e., events vs. jobs, where the former
|
paper\cite{dsn-paper} we analyze
|
||||||
directly impacts spatial and temporal waste, whereas the latter is directly
|
task events' metadata, the use of CPU and Memory resources at the task level,
|
||||||
correlated to the performance perceived by users. For the event analysis, we
|
and job metadata respectively in Section~\ref{fig7-section},
|
||||||
focus on task priority, event execution time, machine concurrency, and requested
|
Section~\ref{fig8-section} and Section~\ref{fig9-section}.
|
||||||
resources. Moreover, to see the impact of resource efficiency on tasks
|
|
||||||
executions, we correlate events with resource reservation and utilization on
|
|
||||||
machines. As for the job analysis, we study the job size, machine locality, and
|
|
||||||
job execution time.
|
|
||||||
|
|
||||||
In the following analysis, we present how different event/job types happen, with
|
\subsection{Event rates vs.\ task priority, event execution time, and machine
|
||||||
respect to different ranges of attributes. For each type $i$, we compute the
|
concurrency.}\label{fig7-section}
|
||||||
metric of event (job) rate, defined as the number of type $i$ events (jobs)
|
|
||||||
divided by the total number of events (jobs). Event/job rates are computed for
|
|
||||||
each range of attributes. For example, one can compute the eviction rate for
|
|
||||||
priorities in the range $[0,1]$ as the number of eviction events that involved
|
|
||||||
priorities [0,1] divided by the total number of events for priorities $[0,1] .$
|
|
||||||
One can also view event/job rates as the probability that events/jobs end with
|
|
||||||
certain types of outcomes.
|
|
||||||
|
|
||||||
\subsection{Event rates vs. task priority, event execution time, and machine
|
|
||||||
concurrency.}
|
|
||||||
|
|
||||||
\input{figures/figure_7}
|
\input{figures/figure_7}
|
||||||
|
|
||||||
|
@ -825,17 +866,18 @@ Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
||||||
Resource Utilization}
|
Resource Utilization}\label{fig8-section}
|
||||||
\input{figures/figure_8}
|
\input{figures/figure_8}
|
||||||
|
|
||||||
Refer to figure~\ref{fig:figureVIII-a}, figure~\ref{fig:figureVIII-a-csts}
|
Refer to Figure~\ref{fig:figureVIII-a}, Figure~\ref{fig:figureVIII-a-csts}
|
||||||
figure~\ref{fig:figureVIII-b}, figure~\ref{fig:figureVIII-b-csts}
|
Figure~\ref{fig:figureVIII-b}, Figure~\ref{fig:figureVIII-b-csts}
|
||||||
figure~\ref{fig:figureVIII-c}, figure~\ref{fig:figureVIII-c-csts}
|
Figure~\ref{fig:figureVIII-c}, Figure~\ref{fig:figureVIII-c-csts}
|
||||||
figure~\ref{fig:figureVIII-d}, figure~\ref{fig:figureVIII-d-csts}
|
Figure~\ref{fig:figureVIII-d}, Figure~\ref{fig:figureVIII-d-csts}
|
||||||
figure~\ref{fig:figureVIII-e}, figure~\ref{fig:figureVIII-e-csts}
|
Figure~\ref{fig:figureVIII-e}, Figure~\ref{fig:figureVIII-e-csts}
|
||||||
figure~\ref{fig:figureVIII-f}, and figure~\ref{fig:figureVIII-f-csts}.
|
Figure~\ref{fig:figureVIII-f}, and Figure~\ref{fig:figureVIII-f-csts}.
|
||||||
|
|
||||||
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality}
|
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality
|
||||||
|
}\label{fig9-section}
|
||||||
\input{figures/figure_9}
|
\input{figures/figure_9}
|
||||||
|
|
||||||
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||||
|
|
|
@ -9,8 +9,8 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\machinetimewaste[1]{2011 data}{cluster_2011.pgf}
|
\machinetimewaste[1]{2011 data}{cluster_2011.pgf}
|
||||||
\machinetimewaste[1]{2019 data}{cluster_all.pgf}
|
\machinetimewaste[1]{2019 data}{cluster_all.pgf}
|
||||||
\caption{Relative task time (in milliseconds) spent in each execution phase
|
\caption{Relative task time spent in each execution phase
|
||||||
w.r.t. task termination in 2011 and 2019 traces. X axis shows task termination type,
|
w.r.t.\ task termination in 2011 and 2019 (all clusters aggregated) traces. The x-axis shows task termination type,
|
||||||
Y axis shows total time \% spent. Colors break down the time in execution phases. ``Unknown'' execution times are
|
Y axis shows total time \% spent. Colors break down the time in execution phases. ``Unknown'' execution times are
|
||||||
2019 specific and correspond to event time transitions that are not consider ``typical'' by Google.}\label{fig:machinetimewaste-rel}
|
2019 specific and correspond to event time transitions that are not consider ``typical'' by Google.}\label{fig:machinetimewaste-rel}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
@ -24,6 +24,6 @@ Y axis shows total time \% spent. Colors break down the time in execution phases
|
||||||
\machinetimewaste{Cluster F}{cluster_f.pgf}
|
\machinetimewaste{Cluster F}{cluster_f.pgf}
|
||||||
\machinetimewaste{Cluster G}{cluster_g.pgf}
|
\machinetimewaste{Cluster G}{cluster_g.pgf}
|
||||||
\machinetimewaste{Cluster H}{cluster_h.pgf}
|
\machinetimewaste{Cluster H}{cluster_h.pgf}
|
||||||
\caption{Relative task time (in milliseconds) spent in each execution phase w.r.t. clusters in the
|
\caption{Relative task time spent in each execution phase w.r.t. clusters in the
|
||||||
2019 trace. Refer to figure~\ref{fig:machinetimewaste-rel} for axes description.}\label{fig:machinetimewaste-rel-csts}
|
2019 trace. Refer to Figure~\ref{fig:machinetimewaste-rel} for axes description.}\label{fig:machinetimewaste-rel-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
Loading…
Reference in a new issue