report work
This commit is contained in:
parent
2d1b357500
commit
96db36d8d6
6 changed files with 57 additions and 40 deletions
Binary file not shown.
|
@ -68,7 +68,7 @@ scheduling, priority management, and failures of a real production workload.
|
||||||
This data was 2009
|
This data was 2009
|
||||||
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
||||||
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
||||||
Failures}\cite{vino-paper}, which in its many conclusions highlighted the need
|
Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need
|
||||||
for better cluster management highlighting the high amount of failures found in
|
for better cluster management highlighting the high amount of failures found in
|
||||||
the traces.
|
the traces.
|
||||||
|
|
||||||
|
@ -103,7 +103,7 @@ techniques used to perform the queries and analyses on the 2019 traces.
|
||||||
|
|
||||||
In 2015, Dr.~Andrea Rosà et al.\ published a
|
In 2015, Dr.~Andrea Rosà et al.\ published a
|
||||||
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
||||||
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
|
An Analysis beyond Failures}\cite{dsn-paper} in which they performed several
|
||||||
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
||||||
with the aim of identifying their resource waste, their impacts on the
|
with the aim of identifying their resource waste, their impacts on the
|
||||||
performance of the application, and any causes that may lie behind such
|
performance of the application, and any causes that may lie behind such
|
||||||
|
@ -145,7 +145,7 @@ In general events can be of two kinds, there are events that are relative to the
|
||||||
status of the schedule, and there are other events that are relative to the
|
status of the schedule, and there are other events that are relative to the
|
||||||
status of a task itself.
|
status of a task itself.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[t]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tabular}{p{3cm}p{12cm}}
|
\begin{tabular}{p{3cm}p{12cm}}
|
||||||
\toprule
|
\toprule
|
||||||
|
@ -167,7 +167,7 @@ status of a task itself.
|
||||||
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
||||||
types.
|
types.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[t]
|
||||||
\centering
|
\centering
|
||||||
\resizebox{\textwidth}{!}{%
|
\resizebox{\textwidth}{!}{%
|
||||||
\includegraphics{./figures/event_types.png}}
|
\includegraphics{./figures/event_types.png}}
|
||||||
|
@ -253,8 +253,8 @@ comes from 8 Borg cells spanning 8 different datacenters located in different
|
||||||
geographical positions, all focused on computational oriented workloads. The
|
geographical positions, all focused on computational oriented workloads. The
|
||||||
data collection time span matches the entire month of May 2019.
|
data collection time span matches the entire month of May 2019.
|
||||||
|
|
||||||
Due to the inherent complexity in analyzing traces of this size, novel
|
Due to the inherent complexity in analyzing traces of this size, non-trivial
|
||||||
bleeding-edge data engineering tecniques were adopted to performed the required
|
data engineering tecniques were adopted to performed the required
|
||||||
computations. We used the framework Apache Spark to perform efficient and
|
computations. We used the framework Apache Spark to perform efficient and
|
||||||
parallel Map-Reduce computations. In this section, we discuss the technical
|
parallel Map-Reduce computations. In this section, we discuss the technical
|
||||||
details behind our approach.
|
details behind our approach.
|
||||||
|
@ -324,9 +324,9 @@ possibility and insert back the omitted record attributes.
|
||||||
\subsubsection{The queries}
|
\subsubsection{The queries}
|
||||||
|
|
||||||
Most queries use only two or three fields in each trace records, while the
|
Most queries use only two or three fields in each trace records, while the
|
||||||
original table records often are made of a couple of dozen fields. In order to save
|
original table records often are made of a couple of dozen fields. In order to
|
||||||
memory during the query, a projection is often applied to the data by the means
|
save memory during the query, a projection is often applied to the data by the
|
||||||
of a \texttt{.map()} operation over the entire trace set, performed using
|
means of a \texttt{.map()} operation over the entire trace set, performed using
|
||||||
Spark's RDD API.
|
Spark's RDD API.
|
||||||
|
|
||||||
Another operation that is often necessary to perform prior to the Map-Reduce
|
Another operation that is often necessary to perform prior to the Map-Reduce
|
||||||
|
@ -375,9 +375,9 @@ successful termination or not, and finally combine this data to compute
|
||||||
slowdown, mean slowdown and ultimately the final table found in
|
slowdown, mean slowdown and ultimately the final table found in
|
||||||
figure~\ref{fig:taskslowdown}.
|
figure~\ref{fig:taskslowdown}.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[t]
|
||||||
\centering
|
\hspace{-0.075\textwidth}
|
||||||
\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png}
|
\includegraphics[width=1.15\textwidth]{figures/task_slowdown_query.png}
|
||||||
\caption{Diagram of the script used for the ``task slowdown''
|
\caption{Diagram of the script used for the ``task slowdown''
|
||||||
query.}\label{fig:taskslowdownquery}
|
query.}\label{fig:taskslowdownquery}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
@ -429,7 +429,7 @@ in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
||||||
\section{Analysis: Performance Input of Unsuccessful Executions}
|
\section{Analysis: Performance Input of Unsuccessful Executions}
|
||||||
|
|
||||||
Our first investigation focuses on replicating the methodologies used in the
|
Our first investigation focuses on replicating the methodologies used in the
|
||||||
2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time
|
2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time
|
||||||
and resources.
|
and resources.
|
||||||
|
|
||||||
In this section we perform several analyses focusing on how machine time and
|
In this section we perform several analyses focusing on how machine time and
|
||||||
|
@ -516,7 +516,7 @@ Refer to figure~\ref{fig:taskslowdown} for a comparison between the 2011 and
|
||||||
means are computed on a cluster-by-cluster basis for 2019 data in
|
means are computed on a cluster-by-cluster basis for 2019 data in
|
||||||
figure~\ref{fig:taskslowdown-csts}.
|
figure~\ref{fig:taskslowdown-csts}.
|
||||||
|
|
||||||
In 2015 Ros\'a et al.\cite{vino-paper} measured mean task slowdown per each task
|
In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task
|
||||||
priority value, which at the time were $[0,11]$ numeric values. However,
|
priority value, which at the time were $[0,11]$ numeric values. However,
|
||||||
in 2019 traces, task priorities are given as a $[0,500]$ numeric value.
|
in 2019 traces, task priorities are given as a $[0,500]$ numeric value.
|
||||||
Therefore, to allow for an easier comparison, mean task slowdown values are
|
Therefore, to allow for an easier comparison, mean task slowdown values are
|
||||||
|
@ -614,12 +614,29 @@ With more than 98\% of both CPU and memory resources used by
|
||||||
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
non-successful tasks, it is clear the spatial resource waste is high in the 2019
|
||||||
traces.
|
traces.
|
||||||
|
|
||||||
\section{Analysis: Pattern and Models for Task and Job Events}
|
\section{Analysis: Patterns of Task and Job Events}
|
||||||
|
|
||||||
|
This section aims to use some of the tecniques used in section IV of
|
||||||
|
the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies
|
||||||
|
between task and job events by gathering event statistics at those events.
|
||||||
|
|
||||||
\subsection{Unsuccessful Task Event Patterns}
|
\subsection{Unsuccessful Task Event Patterns}
|
||||||
\input{figures/table_iii} % has table III and table IV in it
|
\input{figures/table_iii} % has table III and table IV in it
|
||||||
|
|
||||||
Refer to figure \ref{fig:tableIII}.
|
In this analysis we compute the distribution of termination events by type at
|
||||||
|
the task-level events and the conditional probability of a task succesfully
|
||||||
|
terminating given a number of \texttt{EVICT}, \texttt{FAIL} and \texttt{FINISH}
|
||||||
|
termination events during the task execution.
|
||||||
|
|
||||||
|
A comparison of the termination event distribution between the 2011 and 2019
|
||||||
|
traces is shown in figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster
|
||||||
|
breakdown of the same data for the 2019 traces is shown in
|
||||||
|
figure~\ref{fig:tableIII-csts}.
|
||||||
|
|
||||||
|
Each table from these figure shows the mean and the 95-th percentile of the
|
||||||
|
number of termination events per task, broke down by task termination. In
|
||||||
|
addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL},
|
||||||
|
\texttt{FINISH}, and \texttt{KILL} for each task event termination.
|
||||||
|
|
||||||
\textbf{Observations}:
|
\textbf{Observations}:
|
||||||
|
|
||||||
|
@ -636,22 +653,7 @@ Refer to figure \ref{fig:tableIII}.
|
||||||
2019 traces.
|
2019 traces.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\subsection{Unsuccessful Job Event Patterns}
|
\subsubsection{Conditional Probability of Task Success}
|
||||||
|
|
||||||
\textbf{Observations}:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
Again the mean number of tasks is significantly higher than the 2011
|
|
||||||
traces, indicating a higher complexity of workloads
|
|
||||||
\item
|
|
||||||
Cluster A has no evicted jobs
|
|
||||||
\item
|
|
||||||
The number of events is however lower than the event means in the 2011
|
|
||||||
traces
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\subsection{Conditional Probability of Task Success}
|
|
||||||
\input{figures/figure_5}
|
\input{figures/figure_5}
|
||||||
|
|
||||||
Refer to figure \ref{fig:figureV}.
|
Refer to figure \ref{fig:figureV}.
|
||||||
|
@ -669,6 +671,21 @@ Refer to figure \ref{fig:figureV}.
|
||||||
lot for small \# evts differences. This may be due to an uneven
|
lot for small \# evts differences. This may be due to an uneven
|
||||||
distribution of \# evts in the traces.
|
distribution of \# evts in the traces.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
\subsection{Unsuccessful Job Event Patterns}
|
||||||
|
|
||||||
|
\textbf{Observations}:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
Again the mean number of tasks is significantly higher than the 2011
|
||||||
|
traces, indicating a higher complexity of workloads
|
||||||
|
\item
|
||||||
|
Cluster A has no evicted jobs
|
||||||
|
\item
|
||||||
|
The number of events is however lower than the event means in the 2011
|
||||||
|
traces
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
|
||||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||||
|
|
||||||
|
|
|
@ -231,5 +231,5 @@ Unknown & Unknown & 1720 & 2.933251\% \\
|
||||||
0.591797 & 0.666992 & 500 & 0.852689\% \\
|
0.591797 & 0.666992 & 500 & 0.852689\% \\
|
||||||
0.958984 & 1.000000 & 200 & 0.341076\% \\
|
0.958984 & 1.000000 & 200 & 0.341076\% \\
|
||||||
}{\\\\\\\\\\}
|
}{\\\\\\\\\\}
|
||||||
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfig} for a column legend.}\label{fig:machineconfigs-csts}
|
\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\spatialresourcewaste[0.5\textwidth]{used-2011}
|
\spatialresourcewaste[0.5\textwidth]{used-2011}
|
||||||
\spatialresourcewaste[0.5\textwidth]{used-all}
|
\spatialresourcewaste[0.5\textwidth]{used-all}
|
||||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested}
|
\caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
|
@ -17,16 +17,16 @@
|
||||||
\spatialresourcewaste{used-b}
|
\spatialresourcewaste{used-b}
|
||||||
\spatialresourcewaste{used-c}
|
\spatialresourcewaste{used-c}
|
||||||
\spatialresourcewaste{used-d}
|
\spatialresourcewaste{used-d}
|
||||||
\caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
\caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-actual} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
\spatialresourcewaste[0.5\textwidth]{requested-2011}
|
\spatialresourcewaste[0.5\textwidth]{requested-2011}
|
||||||
\spatialresourcewaste[0.5\textwidth]{requested-all}
|
\spatialresourcewaste[0.5\textwidth]{requested-all}
|
||||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual}
|
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}
|
||||||
\spatialresourcewaste{requested-a}
|
\spatialresourcewaste{requested-a}
|
||||||
\spatialresourcewaste{requested-b}
|
\spatialresourcewaste{requested-b}
|
||||||
\spatialresourcewaste{requested-c}
|
\spatialresourcewaste{requested-c}
|
||||||
|
@ -35,5 +35,5 @@
|
||||||
\spatialresourcewaste{requested-f}
|
\spatialresourcewaste{requested-f}
|
||||||
\spatialresourcewaste{requested-g}
|
\spatialresourcewaste{requested-g}
|
||||||
\spatialresourcewaste{requested-h}
|
\spatialresourcewaste{requested-h}
|
||||||
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts}
|
\caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-requested-csts}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
|
@ -63,7 +63,7 @@ FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\
|
||||||
tables show an
|
tables show an
|
||||||
overall mean accompanied by the 95-th percentile of all termination
|
overall mean accompanied by the 95-th percentile of all termination
|
||||||
events, followed by the mean of events per event type of each
|
events, followed by the mean of events per event type of each
|
||||||
termination event.}
|
termination event.}\label{fig:tableIII}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[p]
|
\begin{figure}[p]
|
||||||
|
|
|
@ -14,7 +14,7 @@ booktitle = {EuroSys'20},
|
||||||
address = {Heraklion, Crete}
|
address = {Heraklion, Crete}
|
||||||
}
|
}
|
||||||
|
|
||||||
@INPROCEEDINGS{vino-paper,
|
@INPROCEEDINGS{dsn-paper,
|
||||||
author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter},
|
author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter},
|
||||||
booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks},
|
booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks},
|
||||||
title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures},
|
title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures},
|
||||||
|
|
Loading…
Reference in a new issue