report work

This commit is contained in:
Claudio Maggioni 2021-05-31 15:52:43 +02:00
parent 657410ea9a
commit 2d1b357500
2 changed files with 30 additions and 85 deletions

Binary file not shown.

View file

@ -97,6 +97,20 @@ old analysis to understand even better the causes of failures and how to prevent
them. Additionally, this report will provide an overview on the data engineering
techniques used to perform the queries and analyses on the 2019 traces.
\section{State of the art}
\textbf{TBD (introduce only 2015 dsn paper)}
In 2015, Dr.~Andrea Rosà et al.\ published a
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
with the aim of identifying their resource waste, their impacts on the
performance of the application, and any causes that may lie behind such
failures. The salient conclusion of that research is that actually lots of
computations performed by Google would eventually end in failure, then leading
to large amounts of computational power being wasted.
\section{Background information}
\textit{Borg} is Google's own cluster management software able to run
@ -131,33 +145,12 @@ In general events can be of two kinds, there are events that are relative to the
status of the schedule, and there are other events that are relative to the
status of a task itself.
% \subsection{Rosà et al.~2015 DSN paper}
In 2015, Dr.~Andrea Rosà, Lydia Y. Chen and Prof.~Walter Binder published a
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
with the aim of identifying their resource waste, their impacts on the
performance of the application, and any causes that may lie behind such
failures. The salient conclusion of that research is that actually lots of
computations performed by Google would eventually end in failure, then leading
to large amounts of computational power being wasted.
\begin{figure}[h]
\begin{center}
\begin{tabular}{p{3cm}p{12cm}}
\toprule
\textbf{Type code} & \textbf{Description} \\
\midrule
% SUGGERIMENTO, NON CANCELLARE MAI, A MENO CHE NON SONO COSE COMPLETAMENTE
% INUTILI, IN MOLTI CASI VA BENE COMMENTARE, INTANTO NON INFLUISCONO CON LA
% COMPILAZIONE.
% \texttt{QUEUE} & The job or task was marked not eligible for scheduling
% by Borg's scheduler, and thus Borg will move the job/task in a long
% wait queue\\
% \texttt{SUBMIT} & The job or task was submitted to Borg for execution\\
% \texttt{ENABLE} & The job or task became eligible for scheduling\\
% \texttt{SCHEDULE} & The job or task's execution started\\
\texttt{EVICT} & The job or task was terminated in order to free
computational resources for an higher priority job\\
\texttt{FAIL} & The job or task terminated its execution unsuccesfully
@ -165,15 +158,6 @@ to large amounts of computational power being wasted.
\texttt{FINISH} & The job or task terminated succesfully\\
\texttt{KILL} & The job or task terminated its execution because of a
manual request to stop it\\
% \texttt{LOST} & It is assumed a job or task is has been terminated, but
% due to missing data there is insufficent information to identify when
% or how\\
% \texttt{UPDATE\_PENDING} & The metadata (scheduling class, resource
% requirements, \ldots) of the job/task was updated while the job was
% waiting to be scheduled\\
% \texttt{UPDATE\_RUNNING} & The metadata (scheduling class, resource
% requirements, \ldots) of the job/task was updated while the job was in
% execution\\
\bottomrule
\end{tabular}
\end{center}
@ -259,12 +243,9 @@ science technologies like Apache Spark were used to achieve efficient
and parallelized computations. This approach is discussed with further
detail in the following section.
\hypertarget{project-requirements-and-analysis}{%
\section{Project requirements and
analysis}\label{project-requirements-and-analysis}}
\section{Project Requirements and Analysis Methodology}
\textbf{TBD} (describe our objective with this analysis in detail)
The aim of this thesis is to repeat the analysis performed in 2015 on the
The aim of this project is to repeat the analysis performed in 2015 on the
dataset Google has released in 2019 in order to find similarities and
differences with the previous analysis, and ultimately find whether
computational power is indeed wasted in this new workload as well. The 2019 data
@ -272,10 +253,6 @@ comes from 8 Borg cells spanning 8 different datacenters located in different
geographical positions, all focused on computational oriented workloads. The
data collection time span matches the entire month of May 2019.
\hypertarget{analysis-methodology}{%
\section{Analysis methodology}\label{analysis-methodology}}
Due to the inherent complexity in analyzing traces of this size, novel
bleeding-edge data engineering tecniques were adopted to performed the required
computations. We used the framework Apache Spark to perform efficient and
@ -461,6 +438,11 @@ the perspective of single tasks as well as jobs. We then compare the results
from the 2019 traces to the ones that were obtained in 2015 to understand the
workload evolution inside Borg between 2011 and 2019.
We discover that the spatial and temporal impact of unsuccessful
executions is very significant, more than in the 2011 traces. In particular,
resource usage is overall dominated by tasks with a final \texttt{KILL}
termination event.
\subsection{Temporal Impact: Machine Time Waste}
\input{figures/machine_time_waste}
@ -669,11 +651,6 @@ Refer to figure \ref{fig:tableIII}.
traces
\end{itemize}
\hypertarget{probability-of-task-successful-termination-given-its-unsuccesful-events}{%
\subsection{Probability of task successful termination given its
unsuccesful
events}\label{probability-of-task-successful-termination-given-its-unsuccesful-events}}
\subsection{Conditional Probability of Task Success}
\input{figures/figure_5}
@ -692,9 +669,6 @@ Refer to figure \ref{fig:figureV}.
lot for small \# evts differences. This may be due to an uneven
distribution of \# evts in the traces.
\end{itemize}
\hypertarget{correlation-between-task-events-metadata-and-task-termination}{%
\subsection{Correlation between task events' metadata and task
termination}\label{correlation-between-task-events-metadata-and-task-termination}}
\section{Analysis: Potential Causes of Unsuccessful Executions}
@ -729,10 +703,15 @@ Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
Resource Utilization}
\subsection{Figure 8 tbd}
\input{figures/figure_8}
Refer to figure~\ref{fig:figureVIII-a}, figure~\ref{fig:figureVIII-a-csts}
figure~\ref{fig:figureVIII-b}, figure~\ref{fig:figureVIII-b-csts}
figure~\ref{fig:figureVIII-c}, figure~\ref{fig:figureVIII-c-csts}
figure~\ref{fig:figureVIII-d}, figure~\ref{fig:figureVIII-d-csts}
figure~\ref{fig:figureVIII-e}, figure~\ref{fig:figureVIII-e-csts}
figure~\ref{fig:figureVIII-f}, and figure~\ref{fig:figureVIII-f-csts}.
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality}
\input{figures/figure_9}
@ -759,44 +738,10 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
the highest success event rate
\end{itemize}
\hypertarget{mean-number-of-tasks-and-event-distribution-per-task-type}{%
\subsection{Mean number of tasks and event distribution per task
type}\label{mean-number-of-tasks-and-event-distribution-per-task-type}}
\hypertarget{potential-causes-of-unsuccesful-executions}{%
\subsection{Potential causes of unsuccesful
executions}\label{potential-causes-of-unsuccesful-executions}}
\textbf{TBD}
\hypertarget{implementation-issues-analysis-limitations}{%
\section{Implementation issues -- Analysis
limitations}\label{implementation-issues-analysis-limitations}}
\hypertarget{discussion-on-unknown-fields}{%
\subsection{Discussion on unknown
fields}\label{discussion-on-unknown-fields}}
\textbf{TBD}
\hypertarget{limitation-on-computation-resources-required-for-the-analysis}{%
\subsection{Limitation on computation resources required for the
analysis}\label{limitation-on-computation-resources-required-for-the-analysis}}
\textbf{TBD}
\hypertarget{other-limitations}{%
\subsection{Other limitations \ldots{}}\label{other-limitations}}
\textbf{TBD}
\hypertarget{conclusions-and-future-work-or-possible-developments}{%
\section{Conclusions and future work or possible
developments}\label{conclusions-and-future-work-or-possible-developments}}
\section{Conclusions, Future Work and Possible Developments}
\textbf{TBD}
\newpage
\printbibliography
\end{document}