report work
This commit is contained in:
parent
2d1b357500
commit
96db36d8d6
6 changed files with 57 additions and 40 deletions
Binary file not shown.
|
@ -68,7 +68,7 @@ scheduling, priority management, and failures of a real production workload.
|
|||
This data was 2009
|
||||
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
||||
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
||||
Failures}\cite{vino-paper}, which in its many conclusions highlighted the need
|
||||
Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need
|
||||
for better cluster management highlighting the high amount of failures found in
|
||||
the traces.
|
||||
|
||||
|
@ -103,7 +103,7 @@ techniques used to perform the queries and analyses on the 2019 traces.
|
|||
|
||||
In 2015, Dr.~Andrea Rosà et al.\ published a
|
||||
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
||||
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
|
||||
An Analysis beyond Failures}\cite{dsn-paper} in which they performed several
|
||||
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
||||
with the aim of identifying their resource waste, their impacts on the
|
||||
performance of the application, and any causes that may lie behind such
|
||||
|
@ -145,7 +145,7 @@ In general events can be of two kinds, there are events that are relative to the
|
|||
status of the schedule, and there are other events that are relative to the
|
||||
status of a task itself.
|
||||
|
||||
\begin{figure}[h]
|
||||
\begin{figure}[t]
|
||||
\begin{center}
|
||||
\begin{tabular}{p{3cm}p{12cm}}
|
||||
\toprule
|
||||
|
@ -167,7 +167,7 @@ status of a task itself.
|
|||
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
||||
types.
|
||||
|
||||
\begin{figure}[h]
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\resizebox{\textwidth}{!}{%
|
||||
\includegraphics{./figures/event_types.png}}
|
||||
|
@ -253,8 +253,8 @@ comes from 8 Borg cells spanning 8 different datacenters located in different
|
|||
geographical positions, all focused on computational oriented workloads. The
|
||||
data collection time span matches the entire month of May 2019.
|
||||
|
||||
Due to the inherent complexity in analyzing traces of this size, novel
|
||||
bleeding-edge data engineering tecniques were adopted to performed the required
|
||||
Due to the inherent complexity in analyzing traces of this size, non-trivial
|
||||
data engineering tecniques were adopted to performed the required
|
||||
computations. We used the framework Apache Spark to perform efficient and
|
||||
parallel Map-Reduce computations. In this section, we discuss the technical
|
||||
details behind our approach.
|
||||
|
@ -324,9 +324,9 @@ possibility and insert back the omitted record attributes.
|
|||
\subsubsection{The queries}
|
||||
|
||||
Most queries use only two or three fields in each trace records, while the
|
||||
original table records often are made of a couple of dozen fields. In order to save
|
||||
memory during the query, a projection is often applied to the data by the means
|
||||
of a \texttt{.map()} operation over the entire trace set, performed using
|
||||
original table records often are made of a couple of dozen fields. In order to
|
||||
save memory during the query, a projection is often applied to the data by the
|
||||
means of a \texttt{.map()} operation over the entire trace set, performed using
|
||||
Spark's RDD API.
|
||||
|
||||
Another operation that is often necessary to perform prior to the Map-Reduce
|
||||
|
@ -375,9 +375,9 @@ successful termination or not, and finally combine this data to compute
|
|||
slowdown, mean slowdown and ultimately the final table found in
|
||||
figure~\ref{fig:taskslowdown}.
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png}
|
||||
\begin{figure}[t]
|
||||
\hspace{-0.075\textwidth}
|
||||
\includegraphics[width=1.15\textwidth]{figures/task_slowdown_query.png}
|
||||
\caption{Diagram of the script used for the ``task slowdown''
|
||||
query.}\label{fig:taskslowdownquery}
|
||||
\end{figure}
|
||||
|
@ -429,7 +429,7 @@ in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
|||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
||||
|
||||
Our first investigation focuses on replicating the methodologies used in the
|
||||
2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time
|
||||
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||
and resources.
|
||||
|
||||
In this section we perform several analyses focusing on how machine time and
|
||||
|
@ -516,7 +516,7 @@ Refer to figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
|||
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||
figure~\ref{fig:taskslowdown-csts}.
|
||||
|
||||
In 2015 Ros\'a et al.\cite{vino-paper} measured mean task slowdown per each task
|
||||
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
||||
priority value, which at the time were $[0,11]$ numeric values. However,
|
||||
in 2019 traces, task priorities are given as a $[0,500]$ numeric value.
|
||||
Therefore, to allow for an easier comparison, mean task slowdown values are
|
||||
|
@ -614,12 +614,29 @@ With more than 98\% of both CPU and memory resources used by
|
|||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||
traces.
|
||||
|
||||
\section{Analysis: Pattern and Models for Task and Job Events}
|
||||
\section{Analysis: Patterns of Task and Job Events}
|
||||
|
||||
This section aims to use some of the tecniques used in section IV of
|
||||
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||
between task and job events by gathering event statistics at those events.
|
||||
|
||||
\subsection{Unsuccessful Task Event Patterns}
|
||||
\input{figures/table_iii} % has table III and table IV in it
|
||||
|
||||
Refer to figure \ref{fig:tableIII}.
|
||||
In this analysis we compute the distribution of termination events by type at
|
||||
the task-level events and the conditional probability of a task succesfully
|
||||
terminating given a number of \texttt{EVICT}, \texttt{FAIL} and \texttt{FINISH}
|
||||
termination events during the task execution.
|
||||
|
||||
A comparison of the termination event distribution between the 2011 and 2019
|
||||
traces is shown in figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
||||
breakdown of the same data for the 2019 traces is shown in
|
||||
figure~\ref{fig:tableIII-csts}.
|
||||
|
||||
Each table from these figure shows the mean and the 95-th percentile of the
|
||||
number of termination events per task, broke down by task termination. In
|
||||
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
|
||||
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
|
||||
|
||||
\textbf{Observations}:
|
||||
|
||||
|
@ -636,22 +653,7 @@ Refer to figure \ref{fig:tableIII}.
|
|||
2019 traces.
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Unsuccessful Job Event Patterns}
|
||||
|
||||
\textbf{Observations}:
|
||||
|
||||
\begin{itemize}
|
||||
\item
|
||||
Again the mean number of tasks is significantly higher than the 2011
|
||||
traces, indicating a higher complexity of workloads
|
||||
\item
|
||||
Cluster A has no evicted jobs
|
||||
\item
|
||||
The number of events is however lower than the event means in the 2011
|
||||
traces
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Conditional Probability of Task Success}
|
||||
\subsubsection{Conditional Probability of Task Success}
|
||||
\input{figures/figure_5}
|
||||
|
||||
Refer to figure \ref{fig:figureV}.
|
||||
|
@ -669,6 +671,21 @@ Refer to figure \ref{fig:figureV}.
|
|||
lot for small \# evts differences. This may be due to an uneven
|
||||
distribution of \# evts in the traces.
|
||||
\end{itemize}
|
||||
\subsection{Unsuccessful Job Event Patterns}
|
||||
|
||||
\textbf{Observations}:
|
||||
|
||||
\begin{itemize}
|
||||
\item
|
||||
Again the mean number of tasks is significantly higher than the 2011
|
||||
traces, indicating a higher complexity of workloads
|
||||
\item
|
||||
Cluster A has no evicted jobs
|
||||
\item
|
||||
The number of events is however lower than the event means in the 2011
|
||||
traces
|
||||
\end{itemize}
|
||||
|
||||
|
||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||
|
||||
|
|
|
@ -231,5 +231,5 @@ Unknown & Unknown & 1720 & 2.933251\% \\
|
|||
0.591797 & 0.666992 & 500 & 0.852689\% \\
|
||||
0.958984 & 1.000000 & 200 & 0.341076\% \\
|
||||
}{\\\\\\\\\\}
|
||||
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfig} for a column legend.}\label{fig:machineconfigs-csts}
|
||||
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts}
|
||||
\end{figure}
|
||||
|
|
|
@ -9,7 +9,7 @@
|
|||
\begin{figure}[p]
|
||||
\spatialresourcewaste[0.5\textwidth]{used-2011}
|
||||
\spatialresourcewaste[0.5\textwidth]{used-all}
|
||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested}
|
||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[p]
|
||||
|
@ -17,16 +17,16 @@
|
|||
\spatialresourcewaste{used-b}
|
||||
\spatialresourcewaste{used-c}
|
||||
\spatialresourcewaste{used-d}
|
||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-actual} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[p]
|
||||
\spatialresourcewaste[0.5\textwidth]{requested-2011}
|
||||
\spatialresourcewaste[0.5\textwidth]{requested-all}
|
||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual}
|
||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[p]
|
||||
\begin{figure}
|
||||
\spatialresourcewaste{requested-a}
|
||||
\spatialresourcewaste{requested-b}
|
||||
\spatialresourcewaste{requested-c}
|
||||
|
@ -35,5 +35,5 @@
|
|||
\spatialresourcewaste{requested-f}
|
||||
\spatialresourcewaste{requested-g}
|
||||
\spatialresourcewaste{requested-h}
|
||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-requested-csts}
|
||||
\end{figure}
|
||||
|
|
|
@ -63,7 +63,7 @@ FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
|
|||
tables show an
|
||||
overall mean accompanied by the 95-th percentile of all termination
|
||||
events, followed by the mean of events per event type of each
|
||||
termination event.}
|
||||
termination event.}\label{fig:tableIII}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[p]
|
||||
|
|
|
@ -14,7 +14,7 @@ booktitle = {EuroSys'20},
|
|||
address = {Heraklion, Crete}
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{vino-paper,
|
||||
@INPROCEEDINGS{dsn-paper,
|
||||
author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter},
|
||||
booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks},
|
||||
title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures},
|
||||
|
|
Loading…
Reference in a new issue