diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index 5c31684e..bea848b4 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index 6059715e..83b24f45 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -68,7 +68,7 @@ scheduling, priority management, and failures of a real production workload. This data was 2009 This data was the foundation of the 2015 Ros\'a et al.\ paper \textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond -Failures}\cite{vino-paper}, which in its many conclusions highlighted the need +Failures}\cite{dsn-paper}, which in its many conclusions highlighted the need for better cluster management highlighting the high amount of failures found in the traces. @@ -103,7 +103,7 @@ techniques used to perform the queries and analyses on the 2019 traces. In 2015, Dr.~Andrea Rosà et al.\ published a research paper titled \textit{Understanding the Dark Side of Big Data Clusters: -An Analysis beyond Failures}\cite{vino-paper} in which they performed several +An Analysis beyond Failures}\cite{dsn-paper} in which they performed several analysis on unsuccessful executions in the Google's 2011 Borg cluster traces with the aim of identifying their resource waste, their impacts on the performance of the application, and any causes that may lie behind such @@ -145,7 +145,7 @@ In general events can be of two kinds, there are events that are relative to the status of the schedule, and there are other events that are relative to the status of a task itself. -\begin{figure}[h] +\begin{figure}[t] \begin{center} \begin{tabular}{p{3cm}p{12cm}} \toprule @@ -167,7 +167,7 @@ status of a task itself. Figure~\ref{fig:eventTypes} shows the expected transitions between event types. -\begin{figure}[h] +\begin{figure}[t] \centering \resizebox{\textwidth}{!}{% \includegraphics{./figures/event_types.png}} @@ -253,8 +253,8 @@ comes from 8 Borg cells spanning 8 different datacenters located in different geographical positions, all focused on computational oriented workloads. The data collection time span matches the entire month of May 2019. -Due to the inherent complexity in analyzing traces of this size, novel -bleeding-edge data engineering tecniques were adopted to performed the required +Due to the inherent complexity in analyzing traces of this size, non-trivial +data engineering tecniques were adopted to performed the required computations. We used the framework Apache Spark to perform efficient and parallel Map-Reduce computations. In this section, we discuss the technical details behind our approach. @@ -324,9 +324,9 @@ possibility and insert back the omitted record attributes. \subsubsection{The queries} Most queries use only two or three fields in each trace records, while the -original table records often are made of a couple of dozen fields. In order to save -memory during the query, a projection is often applied to the data by the means -of a \texttt{.map()} operation over the entire trace set, performed using +original table records often are made of a couple of dozen fields. In order to +save memory during the query, a projection is often applied to the data by the +means of a \texttt{.map()} operation over the entire trace set, performed using Spark's RDD API. Another operation that is often necessary to perform prior to the Map-Reduce @@ -375,9 +375,9 @@ successful termination or not, and finally combine this data to compute slowdown, mean slowdown and ultimately the final table found in figure~\ref{fig:taskslowdown}. -\begin{figure}[h] -\centering -\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png} +\begin{figure}[t] +\hspace{-0.075\textwidth} +\includegraphics[width=1.15\textwidth]{figures/task_slowdown_query.png} \caption{Diagram of the script used for the ``task slowdown'' query.}\label{fig:taskslowdownquery} \end{figure} @@ -429,7 +429,7 @@ in the clear and coincise tables found in figure~\ref{fig:taskslowdown}. \section{Analysis: Performance Input of Unsuccessful Executions} Our first investigation focuses on replicating the methodologies used in the -2015 DSN Ros\'a et al.\ paper\cite{vino-paper} regarding usage of machine time +2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} regarding usage of machine time and resources. In this section we perform several analyses focusing on how machine time and @@ -516,7 +516,7 @@ Refer to figure~\ref{fig:taskslowdown} for a comparison between the 2011 and means are computed on a cluster-by-cluster basis for 2019 data in figure~\ref{fig:taskslowdown-csts}. -In 2015 Ros\'a et al.\cite{vino-paper} measured mean task slowdown per each task +In 2015 Ros\'a et al.\cite{dsn-paper} measured mean task slowdown per each task priority value, which at the time were $[0,11]$ numeric values. However, in 2019 traces, task priorities are given as a $[0,500]$ numeric value. Therefore, to allow for an easier comparison, mean task slowdown values are @@ -614,12 +614,29 @@ With more than 98\% of both CPU and memory resources used by non-successful tasks, it is clear the spatial resource waste is high in the 2019 traces. -\section{Analysis: Pattern and Models for Task and Job Events} +\section{Analysis: Patterns of Task and Job Events} + +This section aims to use some of the tecniques used in section IV of +the Ros\'a et al.\ paper\cite{dsn-paper} to find patterns and interpendencies +between task and job events by gathering event statistics at those events. \subsection{Unsuccessful Task Event Patterns} \input{figures/table_iii} % has table III and table IV in it -Refer to figure \ref{fig:tableIII}. +In this analysis we compute the distribution of termination events by type at +the task-level events and the conditional probability of a task succesfully +terminating given a number of \texttt{EVICT}, \texttt{FAIL} and \texttt{FINISH} +termination events during the task execution. + +A comparison of the termination event distribution between the 2011 and 2019 +traces is shown in figure~\ref{fig:tableIII}. Additionally, a cluster-by-cluster +breakdown of the same data for the 2019 traces is shown in +figure~\ref{fig:tableIII-csts}. + +Each table from these figure shows the mean and the 95-th percentile of the +number of termination events per task, broke down by task termination. In +addition, the table shows the mean number of \texttt{EVICT}, \texttt{FAIL}, +\texttt{FINISH}, and \texttt{KILL} for each task event termination. \textbf{Observations}: @@ -636,22 +653,7 @@ Refer to figure \ref{fig:tableIII}. 2019 traces. \end{itemize} -\subsection{Unsuccessful Job Event Patterns} - -\textbf{Observations}: - -\begin{itemize} -\item - Again the mean number of tasks is significantly higher than the 2011 - traces, indicating a higher complexity of workloads -\item - Cluster A has no evicted jobs -\item - The number of events is however lower than the event means in the 2011 - traces -\end{itemize} - -\subsection{Conditional Probability of Task Success} +\subsubsection{Conditional Probability of Task Success} \input{figures/figure_5} Refer to figure \ref{fig:figureV}. @@ -669,6 +671,21 @@ Refer to figure \ref{fig:figureV}. lot for small \# evts differences. This may be due to an uneven distribution of \# evts in the traces. \end{itemize} +\subsection{Unsuccessful Job Event Patterns} + +\textbf{Observations}: + +\begin{itemize} +\item + Again the mean number of tasks is significantly higher than the 2011 + traces, indicating a higher complexity of workloads +\item + Cluster A has no evicted jobs +\item + The number of events is however lower than the event means in the 2011 + traces +\end{itemize} + \section{Analysis: Potential Causes of Unsuccessful Executions} diff --git a/report/figures/machine_configs.tex b/report/figures/machine_configs.tex index f9d28352..edb807fc 100644 --- a/report/figures/machine_configs.tex +++ b/report/figures/machine_configs.tex @@ -231,5 +231,5 @@ Unknown & Unknown & 1720 & 2.933251\% \\ 0.591797 & 0.666992 & 500 & 0.852689\% \\ 0.958984 & 1.000000 & 200 & 0.341076\% \\ }{\\\\\\\\\\} -\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfig} for a column legend.}\label{fig:machineconfigs-csts} +\caption{Overview of machine configurations in terms of CPU and RAM resources for each cluster in the 2019 traces. Refer to figure~\ref{fig:machineconfigs} for a column legend.}\label{fig:machineconfigs-csts} \end{figure} diff --git a/report/figures/spatial_resource_waste.tex b/report/figures/spatial_resource_waste.tex index 4d514ef1..0a28e56c 100644 --- a/report/figures/spatial_resource_waste.tex +++ b/report/figures/spatial_resource_waste.tex @@ -9,7 +9,7 @@ \begin{figure}[p] \spatialresourcewaste[0.5\textwidth]{used-2011} \spatialresourcewaste[0.5\textwidth]{used-all} - \caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested} + \caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type in 2011 and 2019 traces (total of clusters A to D). The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual} \end{figure} \begin{figure}[p] @@ -17,16 +17,16 @@ \spatialresourcewaste{used-b} \spatialresourcewaste{used-c} \spatialresourcewaste{used-d} - \caption{Percentages of CPU and RAM resources used by tasks w.r.t. task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts} + \caption{Percentages of CPU and RAM resources used by tasks w.r.t.\ task termination type for clusters A to D in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-actual} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts} \end{figure} \begin{figure}[p] \spatialresourcewaste[0.5\textwidth]{requested-2011} \spatialresourcewaste[0.5\textwidth]{requested-all} - \caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-actual} + \caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type in 2011 and 2019 traces. The x axis is the type of resource, y-axis is the percentage of resource used and color represents task termination. Numeric values are displayed below the graph as a table.}\label{fig:spatialresourcewaste-requested} \end{figure} -\begin{figure}[p] +\begin{figure} \spatialresourcewaste{requested-a} \spatialresourcewaste{requested-b} \spatialresourcewaste{requested-c} @@ -35,5 +35,5 @@ \spatialresourcewaste{requested-f} \spatialresourcewaste{requested-g} \spatialresourcewaste{requested-h} - \caption{Percentages of CPU and RAM resources requested by tasks w.r.t. task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-actual-csts} + \caption{Percentages of CPU and RAM resources requested by tasks w.r.t.\ task termination type for in 2019 traces. Refer to figure~\ref{fig:spatialresourcewaste-requested} for plot explaination.}\label{fig:spatialresourcewaste-requested-csts} \end{figure} diff --git a/report/figures/table_iii.tex b/report/figures/table_iii.tex index 8c68bb6d..d3403c0c 100644 --- a/report/figures/table_iii.tex +++ b/report/figures/table_iii.tex @@ -63,7 +63,7 @@ FINISH & 2.962 (2) & 0.022 & 0.012 & 2.915 & 0.013 \\ tables show an overall mean accompanied by the 95-th percentile of all termination events, followed by the mean of events per event type of each - termination event.} + termination event.}\label{fig:tableIII} \end{figure} \begin{figure}[p] diff --git a/report/references.bib b/report/references.bib index 87e5b67b..21528d43 100644 --- a/report/references.bib +++ b/report/references.bib @@ -14,7 +14,7 @@ booktitle = {EuroSys'20}, address = {Heraklion, Crete} } -@INPROCEEDINGS{vino-paper, +@INPROCEEDINGS{dsn-paper, author={Rosà, Andrea and Chen, Lydia Y. and Binder, Walter}, booktitle={2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks}, title={Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures},