report work

This commit is contained in:
Claudio Maggioni 2021-05-27 15:20:08 +02:00
parent e200cea3ab
commit d6780ffa6c
2 changed files with 49 additions and 27 deletions

Binary file not shown.

View file

@ -449,10 +449,7 @@ computing slowdown values given the previously computed execution attempt time
deltas. Finally, the mean of the computed slowdown values is computed resulting deltas. Finally, the mean of the computed slowdown values is computed resulting
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}. in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions} \section{Analysis: Performance Input of Unsuccessful Executions}
\input{figures/machine_time_waste}
Our first investigation focuses on replicating the methodologies used in the Our first investigation focuses on replicating the methodologies used in the
2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time 2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time
@ -465,6 +462,7 @@ from the 2019 traces to the ones that were obtained in 2015 to understand the
workload evolution inside Borg between 2011 and 2019. workload evolution inside Borg between 2011 and 2019.
\subsection{Temporal Impact: Machine Time Waste} \subsection{Temporal Impact: Machine Time Waste}
\input{figures/machine_time_waste}
This analysis explores how machine time is distributed over task events and This analysis explores how machine time is distributed over task events and
submissions. By partitioning the collection of all terminating tasks by their submissions. By partitioning the collection of all terminating tasks by their
@ -565,35 +563,59 @@ higher machine time spent for unsuccesful executions (as observed in the
previous analysis) and increase slowdown rate for this class is not particularly previous analysis) and increase slowdown rate for this class is not particularly
surprising. surprising.
\textbf{TBD} The amount of non-successful task terminations in the 2019 traces is also rather
The \% of finishing jobs is relatively low comparing with the 2011 high when compared to 2011 data, as it can evinced by the low percentage of
traces. \texttt{FINISH}ed tasks across priority tiers.
Another noteworthy difference is in the mean response times for all and last
executions: while the mean response is overall shorter in time in the 2019
traces by an order of magnitude, the new traces show an overall significantly
higher mean response time than in the 2011 data.
Across 2019 single clusters (as in figure~\ref{fig:taskslowdown-csts}), the data
shows a mostly uniform behaviour, other than for some noteworthy mean slowdown
spikes. Indeed, cluster A has 82.97 mean slowdown in the ``Free'' tier,
cluster G has 19.06 and 14.57 mean slowdown in the ``BEB'' and ``Production''
tier respectively, and Cluster D has 12.04 mean slowdown in its ``Free'' tier.
\subsection{Spatial Impact: Resource Waste}
\input{figures/spatial_resource_waste} \input{figures/spatial_resource_waste}
In this analyzis we aim to understand how physical resources of machines
in the Borg cluster are used to complete tasks. In particular, we compare how
CPU and Memory resource allocation and usage are distributed among tasks based
on their termination
type.
Due to limited computational resources w.r.t.\ the data analysis process, the
resource usage for clusters E to H in the 2019 traces is missing. However, a
comparison between 2011 resource usage and the aggregated resource usage of
clusters A to D in the 2019 traces can be found in
figure~\ref{fig:spatialresourcewaste-actual}. Additionally, a
cluster-by-cluster breakdown for the 2019 data can be found in
figure~\ref{fig:spatialresourcewaste-actual-csts}.
From these figures it is clear that, compared to the relatively even
distribution of used resources in the 2011 traces, the distribution of resources
in the 2019 Borg clusters became strikingly uneven, registering a combined
86.29\% of
CPU resource usage and 84.86\% memory usage for \texttt{KILL}ed tasks. Instead,
all other task termination types have a significantly lower resource usage:
\texttt{EVICT}ed, \texttt{FAIL}ed and \texttt{FINISH}ed tasks register respectively
8.53\%, 3.17\% and 2.02\% CPU usage and 9.03\%, 4.45\%, and 1.66\% memory usage.
This resource distribution can also be found in the data from individual
clusters in figure~\ref{fig:spatialresourcewaste-actual-csts}, with always more
than 80\% of resources devoted to \texttt{KILL}ed tasks.
With more than 98\% of CPU and memory resources used by ultimately
non-successful tasks, it is clear the spatial resource waste is high in the 2019
traces.
\textbf{TBD figure~\ref{fig:spatialresourcewaste-requested}}
\input{figures/table_iii} % has table III and table IV in it \input{figures/table_iii} % has table III and table IV in it
\input{figures/figure_5} \input{figures/figure_5}
\hypertarget{reserved-and-actual-resource-usage-of-tasks}{%
\subsection{Reserved and actual resource usage of
tasks}\label{reserved-and-actual-resource-usage-of-tasks}}
Refer to figures \ref{fig:spatialresourcewaste-actual} and
\ref{fig:spatialresourcewaste-requested}.
\textbf{Observations}:
\begin{itemize}
\item
Most (mesasured and requested) resources are used by killed job, even
more than in the 2011 traces.
\item
Behaviour is rather homogeneous across datacenters, with the exception
of cluster G where a lot of LOST-terminated tasks acquired 70\% of
both CPU and RAM
\end{itemize}
Refer to figure \ref{fig:tableIII}. Refer to figure \ref{fig:tableIII}.
\textbf{Observations}: \textbf{Observations}: