report work

This commit is contained in:
Claudio Maggioni 2021-05-31 16:38:58 +02:00
parent 2d1b357500
commit 96db36d8d6
6 changed files with 57 additions and 40 deletions

Binary file not shown.

View file

@ -68,7 +68,7 @@ scheduling, priority management, and failures of a real production workload.
This data was 2009 This data was 2009
This data was the foundation of the 2015 Ros\'a et al.\ paper This data was the foundation of the 2015 Ros\'a et al.\ paper
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond \textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
Failures}\cite{vino-paper}, which in its many conclusions highlighted the need Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need
for better cluster management highlighting the high amount of failures found in for better cluster management highlighting the high amount of failures found in
the traces. the traces.
@ -103,7 +103,7 @@ techniques used to perform the queries and analyses on the 2019 traces.
In 2015, Dr.~Andrea Rosà et al.\ published a In 2015, Dr.~Andrea Rosà et al.\ published a
research paper titled \textit{Understanding the Dark Side of Big Data Clusters: research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
An Analysis beyond Failures}\cite{vino-paper} in which they performed several An Analysis beyond Failures}\cite{dsn-paper} in which they performed several
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
with the aim of identifying their resource waste, their impacts on the with the aim of identifying their resource waste, their impacts on the
performance of the application, and any causes that may lie behind such performance of the application, and any causes that may lie behind such
@ -145,7 +145,7 @@ In general events can be of two kinds, there are events that are relative to the
status of the schedule, and there are other events that are relative to the status of the schedule, and there are other events that are relative to the
status of a task itself. status of a task itself.
\begin{figure}[h] \begin{figure}[t]
\begin{center} \begin{center}
\begin{tabular}{p{3cm}p{12cm}} \begin{tabular}{p{3cm}p{12cm}}
\toprule \toprule
@ -167,7 +167,7 @@ status of a task itself.
Figure~\ref{fig:eventTypes} shows the expected transitions between event Figure~\ref{fig:eventTypes} shows the expected transitions between event
types. types.
\begin{figure}[h] \begin{figure}[t]
\centering \centering
\resizebox{\textwidth}{!}{% \resizebox{\textwidth}{!}{%
\includegraphics{./figures/event_types.png}} \includegraphics{./figures/event_types.png}}
@ -253,8 +253,8 @@ comes from 8 Borg cells spanning 8 different datacenters located in different
geographical positions, all focused on computational oriented workloads. The geographical positions, all focused on computational oriented workloads. The
data collection time span matches the entire month of May 2019. data collection time span matches the entire month of May 2019.
Due to the inherent complexity in analyzing traces of this size, novel Due to the inherent complexity in analyzing traces of this size, non-trivial
bleeding-edge data engineering tecniques were adopted to performed the required data engineering tecniques were adopted to performed the required
computations. We used the framework Apache Spark to perform efficient and computations. We used the framework Apache Spark to perform efficient and
parallel Map-Reduce computations. In this section, we discuss the technical parallel Map-Reduce computations. In this section, we discuss the technical
details behind our approach. details behind our approach.
@ -324,9 +324,9 @@ possibility and insert back the omitted record attributes.
\subsubsection{The queries} \subsubsection{The queries}
Most queries use only two or three fields in each trace records, while the Most queries use only two or three fields in each trace records, while the
original table records often are made of a couple of dozen fields. In order to save original table records often are made of a couple of dozen fields. In order to
memory during the query, a projection is often applied to the data by the means save memory during the query, a projection is often applied to the data by the
of a \texttt{.map()} operation over the entire trace set, performed using means of a \texttt{.map()} operation over the entire trace set, performed using
Spark's RDD API. Spark's RDD API.
Another operation that is often necessary to perform prior to the Map-Reduce Another operation that is often necessary to perform prior to the Map-Reduce
@ -375,9 +375,9 @@ successful termination or not, and finally combine this data to compute
slowdown, mean slowdown and ultimately the final table found in slowdown, mean slowdown and ultimately the final table found in
figure~\ref{fig:taskslowdown}. figure~\ref{fig:taskslowdown}.
\begin{figure}[h] \begin{figure}[t]
\centering \hspace{-0.075\textwidth}
\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png} \includegraphics[width=1.15\textwidth]{figures/task_slowdown_query.png}
\caption{Diagram of the script used for the ``task slowdown'' \caption{Diagram of the script used for the ``task slowdown''
query.}\label{fig:taskslowdownquery} query.}\label{fig:taskslowdownquery}
\end{figure} \end{figure}
@ -429,7 +429,7 @@ in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
\section{Analysis: Performance Input of Unsuccessful Executions} \section{Analysis: Performance Input of Unsuccessful Executions}
Our first investigation focuses on replicating the methodologies used in the Our first investigation focuses on replicating the methodologies used in the
2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time 2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
and resources. and resources.
In this section we perform several analyses focusing on how machine time and In this section we perform several analyses focusing on how machine time and
@ -516,7 +516,7 @@ Refer to figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
means are computed on a cluster-by-cluster basis for 2019 data in means are computed on a cluster-by-cluster basis for 2019 data in
figure~\ref{fig:taskslowdown-csts}. figure~\ref{fig:taskslowdown-csts}.
In 2015 Ros\'a et al.\cite{vino-paper} measured mean task slowdown per each task In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
priority value, which at the time were $[0,11]$ numeric values. However, priority value, which at the time were $[0,11]$ numeric values. However,
in 2019 traces, task priorities are given as a $[0,500]$ numeric value. in 2019 traces, task priorities are given as a $[0,500]$ numeric value.
Therefore, to allow for an easier comparison, mean task slowdown values are Therefore, to allow for an easier comparison, mean task slowdown values are
@ -614,12 +614,29 @@ With more than 98\% of both CPU and memory resources used by
non-successful tasks, it is clear the spatial resource waste is high in the 2019 non-successful tasks, it is clear the spatial resource waste is high in the 2019
traces. traces.
\section{Analysis: Pattern and Models for Task and Job Events} \section{Analysis: Patterns of Task and Job Events}
This section aims to use some of the tecniques used in section IV of
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
between task and job events by gathering event statistics at those events.
\subsection{Unsuccessful Task Event Patterns} \subsection{Unsuccessful Task Event Patterns}
\input{figures/table_iii} % has table III and table IV in it \input{figures/table_iii} % has table III and table IV in it
Refer to figure \ref{fig:tableIII}. In this analysis we compute the distribution of termination events by type at
the task-level events and the conditional probability of a task succesfully
terminating given a number of \texttt{EVICT}, \texttt{FAIL} and \texttt{FINISH}
termination events during the task execution.
A comparison of the termination event distribution between the 2011 and 2019
traces is shown in figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
breakdown of the same data for the 2019 traces is shown in
figure~\ref{fig:tableIII-csts}.
Each table from these figure shows the mean and the 95-th percentile of the
number of termination events per task, broke down by task termination. In
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
\textbf{Observations}: \textbf{Observations}:
@ -636,22 +653,7 @@ Refer to figure \ref{fig:tableIII}.
2019 traces. 2019 traces.
\end{itemize} \end{itemize}
\subsection{Unsuccessful Job Event Patterns} \subsubsection{Conditional Probability of Task Success}
\textbf{Observations}:
\begin{itemize}
\item
Again the mean number of tasks is significantly higher than the 2011
traces, indicating a higher complexity of workloads
\item
Cluster A has no evicted jobs
\item
The number of events is however lower than the event means in the 2011
traces
\end{itemize}
\subsection{Conditional Probability of Task Success}
\input{figures/figure_5} \input{figures/figure_5}
Refer to figure \ref{fig:figureV}. Refer to figure \ref{fig:figureV}.
@ -669,6 +671,21 @@ Refer to figure \ref{fig:figureV}.
lot for small \# evts differences. This may be due to an uneven lot for small \# evts differences. This may be due to an uneven
distribution of \# evts in the traces. distribution of \# evts in the traces.
\end{itemize} \end{itemize}
\subsection{Unsuccessful Job Event Patterns}
\textbf{Observations}:
\begin{itemize}
\item
Again the mean number of tasks is significantly higher than the 2011
traces, indicating a higher complexity of workloads
\item
Cluster A has no evicted jobs
\item
The number of events is however lower than the event means in the 2011
traces
\end{itemize}
\section{Analysis: Potential Causes of Unsuccessful Executions} \section{Analysis: Potential Causes of Unsuccessful Executions}

View file

@ -231,5 +231,5 @@ Unknown & Unknown & 1720 & 2.933251\% \\
0.591797 & 0.666992 & 500 & 0.852689\% \\ 0.591797 & 0.666992 & 500 & 0.852689\% \\
0.958984 & 1.000000 & 200 & 0.341076\% \\ 0.958984 & 1.000000 & 200 & 0.341076\% \\
}{\\\\\\\\\\} }{\\\\\\\\\\}
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfig} for a column legend.}\label{fig:machineconfigs-csts} \caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts}
\end{figure} \end{figure}

View file

@ -9,7 +9,7 @@
\begin{figure}[p] \begin{figure}[p]
\spatialresourcewaste[0.5\textwidth]{used-2011} \spatialresourcewaste[0.5\textwidth]{used-2011}
\spatialresourcewaste[0.5\textwidth]{used-all} \spatialresourcewaste[0.5\textwidth]{used-all}
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested} \caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual}
\end{figure} \end{figure}
\begin{figure}[p] \begin{figure}[p]
@ -17,16 +17,16 @@
\spatialresourcewaste{used-b} \spatialresourcewaste{used-b}
\spatialresourcewaste{used-c} \spatialresourcewaste{used-c}
\spatialresourcewaste{used-d} \spatialresourcewaste{used-d}
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts} \caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-actual} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
\end{figure} \end{figure}
\begin{figure}[p] \begin{figure}[p]
\spatialresourcewaste[0.5\textwidth]{requested-2011} \spatialresourcewaste[0.5\textwidth]{requested-2011}
\spatialresourcewaste[0.5\textwidth]{requested-all} \spatialresourcewaste[0.5\textwidth]{requested-all}
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual} \caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested}
\end{figure} \end{figure}
\begin{figure}[p] \begin{figure}
\spatialresourcewaste{requested-a} \spatialresourcewaste{requested-a}
\spatialresourcewaste{requested-b} \spatialresourcewaste{requested-b}
\spatialresourcewaste{requested-c} \spatialresourcewaste{requested-c}
@ -35,5 +35,5 @@
\spatialresourcewaste{requested-f} \spatialresourcewaste{requested-f}
\spatialresourcewaste{requested-g} \spatialresourcewaste{requested-g}
\spatialresourcewaste{requested-h} \spatialresourcewaste{requested-h}
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts} \caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-requested-csts}
\end{figure} \end{figure}

View file

@ -63,7 +63,7 @@ FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
tables show an tables show an
overall mean accompanied by the 95-th percentile of all termination overall mean accompanied by the 95-th percentile of all termination
events, followed by the mean of events per event type of each events, followed by the mean of events per event type of each
termination event.} termination event.}\label{fig:tableIII}
\end{figure} \end{figure}
\begin{figure}[p] \begin{figure}[p]

View file

@ -14,7 +14,7 @@ booktitle = {EuroSys'20},
address = {Heraklion, Crete} address = {Heraklion, Crete}
} }
@INPROCEEDINGS{vino-paper, @INPROCEEDINGS{dsn-paper,
author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter}, author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter},
booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks}, booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks},
title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures}, title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures},