report work
This commit is contained in:
parent
657410ea9a
commit
2d1b357500
2 changed files with 30 additions and 85 deletions
Binary file not shown.
|
@ -97,6 +97,20 @@ old analysis to understand even better the causes of failures and how to prevent
|
||||||
them. Additionally, this report will provide an overview on the data engineering
|
them. Additionally, this report will provide an overview on the data engineering
|
||||||
techniques used to perform the queries and analyses on the 2019 traces.
|
techniques used to perform the queries and analyses on the 2019 traces.
|
||||||
|
|
||||||
|
\section{State of the art}
|
||||||
|
|
||||||
|
\textbf{TBD (introduce only 2015 dsn paper)}
|
||||||
|
|
||||||
|
In 2015, Dr.~Andrea Rosà et al.\ published a
|
||||||
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
||||||
|
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
|
||||||
|
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
||||||
|
with the aim of identifying their resource waste, their impacts on the
|
||||||
|
performance of the application, and any causes that may lie behind such
|
||||||
|
failures. The salient conclusion of that research is that actually lots of
|
||||||
|
computations performed by Google would eventually end in failure, then leading
|
||||||
|
to large amounts of computational power being wasted.
|
||||||
|
|
||||||
\section{Background information}
|
\section{Background information}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
|
@ -131,33 +145,12 @@ In general events can be of two kinds, there are events that are relative to the
|
||||||
status of the schedule, and there are other events that are relative to the
|
status of the schedule, and there are other events that are relative to the
|
||||||
status of a task itself.
|
status of a task itself.
|
||||||
|
|
||||||
% \subsection{Rosà et al.~2015 DSN paper}
|
|
||||||
|
|
||||||
In 2015, Dr.~Andrea Rosà, Lydia Y. Chen and Prof.~Walter Binder published a
|
|
||||||
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
|
||||||
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
|
|
||||||
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
|
||||||
with the aim of identifying their resource waste, their impacts on the
|
|
||||||
performance of the application, and any causes that may lie behind such
|
|
||||||
failures. The salient conclusion of that research is that actually lots of
|
|
||||||
computations performed by Google would eventually end in failure, then leading
|
|
||||||
to large amounts of computational power being wasted.
|
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tabular}{p{3cm}p{12cm}}
|
\begin{tabular}{p{3cm}p{12cm}}
|
||||||
\toprule
|
\toprule
|
||||||
\textbf{Type code} & \textbf{Description} \\
|
\textbf{Type code} & \textbf{Description} \\
|
||||||
\midrule
|
\midrule
|
||||||
% SUGGERIMENTO, NON CANCELLARE MAI, A MENO CHE NON SONO COSE COMPLETAMENTE
|
|
||||||
% INUTILI, IN MOLTI CASI VA BENE COMMENTARE, INTANTO NON INFLUISCONO CON LA
|
|
||||||
% COMPILAZIONE.
|
|
||||||
% \texttt{QUEUE} & The job or task was marked not eligible for scheduling
|
|
||||||
% by Borg's scheduler, and thus Borg will move the job/task in a long
|
|
||||||
% wait queue\\
|
|
||||||
% \texttt{SUBMIT} & The job or task was submitted to Borg for execution\\
|
|
||||||
% \texttt{ENABLE} & The job or task became eligible for scheduling\\
|
|
||||||
% \texttt{SCHEDULE} & The job or task's execution started\\
|
|
||||||
\texttt{EVICT} & The job or task was terminated in order to free
|
\texttt{EVICT} & The job or task was terminated in order to free
|
||||||
computational resources for an higher priority job\\
|
computational resources for an higher priority job\\
|
||||||
\texttt{FAIL} & The job or task terminated its execution unsuccesfully
|
\texttt{FAIL} & The job or task terminated its execution unsuccesfully
|
||||||
|
@ -165,15 +158,6 @@ to large amounts of computational power being wasted.
|
||||||
\texttt{FINISH} & The job or task terminated succesfully\\
|
\texttt{FINISH} & The job or task terminated succesfully\\
|
||||||
\texttt{KILL} & The job or task terminated its execution because of a
|
\texttt{KILL} & The job or task terminated its execution because of a
|
||||||
manual request to stop it\\
|
manual request to stop it\\
|
||||||
% \texttt{LOST} & It is assumed a job or task is has been terminated, but
|
|
||||||
% due to missing data there is insufficent information to identify when
|
|
||||||
% or how\\
|
|
||||||
% \texttt{UPDATE\_PENDING} & The metadata (scheduling class, resource
|
|
||||||
% requirements, \ldots) of the job/task was updated while the job was
|
|
||||||
% waiting to be scheduled\\
|
|
||||||
% \texttt{UPDATE\_RUNNING} & The metadata (scheduling class, resource
|
|
||||||
% requirements, \ldots) of the job/task was updated while the job was in
|
|
||||||
% execution\\
|
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
@ -259,12 +243,9 @@ science technologies like Apache Spark were used to achieve efficient
|
||||||
and parallelized computations. This approach is discussed with further
|
and parallelized computations. This approach is discussed with further
|
||||||
detail in the following section.
|
detail in the following section.
|
||||||
|
|
||||||
\hypertarget{project-requirements-and-analysis}{%
|
\section{Project Requirements and Analysis Methodology}
|
||||||
\section{Project requirements and
|
|
||||||
analysis}\label{project-requirements-and-analysis}}
|
|
||||||
|
|
||||||
\textbf{TBD} (describe our objective with this analysis in detail)
|
The aim of this project is to repeat the analysis performed in 2015 on the
|
||||||
The aim of this thesis is to repeat the analysis performed in 2015 on the
|
|
||||||
dataset Google has released in 2019 in order to find similarities and
|
dataset Google has released in 2019 in order to find similarities and
|
||||||
differences with the previous analysis, and ultimately find whether
|
differences with the previous analysis, and ultimately find whether
|
||||||
computational power is indeed wasted in this new workload as well. The 2019 data
|
computational power is indeed wasted in this new workload as well. The 2019 data
|
||||||
|
@ -272,10 +253,6 @@ comes from 8 Borg cells spanning 8 different datacenters located in different
|
||||||
geographical positions, all focused on computational oriented workloads. The
|
geographical positions, all focused on computational oriented workloads. The
|
||||||
data collection time span matches the entire month of May 2019.
|
data collection time span matches the entire month of May 2019.
|
||||||
|
|
||||||
|
|
||||||
\hypertarget{analysis-methodology}{%
|
|
||||||
\section{Analysis methodology}\label{analysis-methodology}}
|
|
||||||
|
|
||||||
Due to the inherent complexity in analyzing traces of this size, novel
|
Due to the inherent complexity in analyzing traces of this size, novel
|
||||||
bleeding-edge data engineering tecniques were adopted to performed the required
|
bleeding-edge data engineering tecniques were adopted to performed the required
|
||||||
computations. We used the framework Apache Spark to perform efficient and
|
computations. We used the framework Apache Spark to perform efficient and
|
||||||
|
@ -461,6 +438,11 @@ the perspective of single tasks as well as jobs. We then compare the results
|
||||||
from the 2019 traces to the ones that were obtained in 2015 to understand the
|
from the 2019 traces to the ones that were obtained in 2015 to understand the
|
||||||
workload evolution inside Borg between 2011 and 2019.
|
workload evolution inside Borg between 2011 and 2019.
|
||||||
|
|
||||||
|
We discover that the spatial and temporal impact of unsuccessful
|
||||||
|
executions is very significant, more than in the 2011 traces. In particular,
|
||||||
|
resource usage is overall dominated by tasks with a final \texttt{KILL}
|
||||||
|
termination event.
|
||||||
|
|
||||||
\subsection{Temporal Impact: Machine Time Waste}
|
\subsection{Temporal Impact: Machine Time Waste}
|
||||||
\input{figures/machine_time_waste}
|
\input{figures/machine_time_waste}
|
||||||
|
|
||||||
|
@ -669,11 +651,6 @@ Refer to figure \ref{fig:tableIII}.
|
||||||
traces
|
traces
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\hypertarget{probability-of-task-successful-termination-given-its-unsuccesful-events}{%
|
|
||||||
\subsection{Probability of task successful termination given its
|
|
||||||
unsuccesful
|
|
||||||
events}\label{probability-of-task-successful-termination-given-its-unsuccesful-events}}
|
|
||||||
|
|
||||||
\subsection{Conditional Probability of Task Success}
|
\subsection{Conditional Probability of Task Success}
|
||||||
\input{figures/figure_5}
|
\input{figures/figure_5}
|
||||||
|
|
||||||
|
@ -692,9 +669,6 @@ Refer to figure \ref{fig:figureV}.
|
||||||
lot for small \# evts differences. This may be due to an uneven
|
lot for small \# evts differences. This may be due to an uneven
|
||||||
distribution of \# evts in the traces.
|
distribution of \# evts in the traces.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
\hypertarget{correlation-between-task-events-metadata-and-task-termination}{%
|
|
||||||
\subsection{Correlation between task events' metadata and task
|
|
||||||
termination}\label{correlation-between-task-events-metadata-and-task-termination}}
|
|
||||||
|
|
||||||
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
\section{Analysis: Potential Causes of Unsuccessful Executions}
|
||||||
|
|
||||||
|
@ -729,10 +703,15 @@ Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
||||||
|
|
||||||
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
\subsection{Event Rates vs. Requested Resources, Resource Reservation, and
|
||||||
Resource Utilization}
|
Resource Utilization}
|
||||||
|
|
||||||
\subsection{Figure 8 tbd}
|
|
||||||
\input{figures/figure_8}
|
\input{figures/figure_8}
|
||||||
|
|
||||||
|
Refer to figure~\ref{fig:figureVIII-a}, figure~\ref{fig:figureVIII-a-csts}
|
||||||
|
figure~\ref{fig:figureVIII-b}, figure~\ref{fig:figureVIII-b-csts}
|
||||||
|
figure~\ref{fig:figureVIII-c}, figure~\ref{fig:figureVIII-c-csts}
|
||||||
|
figure~\ref{fig:figureVIII-d}, figure~\ref{fig:figureVIII-d-csts}
|
||||||
|
figure~\ref{fig:figureVIII-e}, figure~\ref{fig:figureVIII-e-csts}
|
||||||
|
figure~\ref{fig:figureVIII-f}, and figure~\ref{fig:figureVIII-f-csts}.
|
||||||
|
|
||||||
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality}
|
\subsection{Job Rates vs. Job Size, Job Execution Time, and Machine Locality}
|
||||||
\input{figures/figure_9}
|
\input{figures/figure_9}
|
||||||
|
|
||||||
|
@ -759,44 +738,10 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||||
the highest success event rate
|
the highest success event rate
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\hypertarget{mean-number-of-tasks-and-event-distribution-per-task-type}{%
|
\section{Conclusions, Future Work and Possible Developments}
|
||||||
\subsection{Mean number of tasks and event distribution per task
|
|
||||||
type}\label{mean-number-of-tasks-and-event-distribution-per-task-type}}
|
|
||||||
|
|
||||||
|
|
||||||
\hypertarget{potential-causes-of-unsuccesful-executions}{%
|
|
||||||
\subsection{Potential causes of unsuccesful
|
|
||||||
executions}\label{potential-causes-of-unsuccesful-executions}}
|
|
||||||
|
|
||||||
\textbf{TBD}
|
|
||||||
|
|
||||||
\hypertarget{implementation-issues-analysis-limitations}{%
|
|
||||||
\section{Implementation issues -- Analysis
|
|
||||||
limitations}\label{implementation-issues-analysis-limitations}}
|
|
||||||
|
|
||||||
\hypertarget{discussion-on-unknown-fields}{%
|
|
||||||
\subsection{Discussion on unknown
|
|
||||||
fields}\label{discussion-on-unknown-fields}}
|
|
||||||
|
|
||||||
\textbf{TBD}
|
|
||||||
|
|
||||||
\hypertarget{limitation-on-computation-resources-required-for-the-analysis}{%
|
|
||||||
\subsection{Limitation on computation resources required for the
|
|
||||||
analysis}\label{limitation-on-computation-resources-required-for-the-analysis}}
|
|
||||||
|
|
||||||
\textbf{TBD}
|
|
||||||
|
|
||||||
\hypertarget{other-limitations}{%
|
|
||||||
\subsection{Other limitations \ldots{}}\label{other-limitations}}
|
|
||||||
|
|
||||||
\textbf{TBD}
|
|
||||||
|
|
||||||
\hypertarget{conclusions-and-future-work-or-possible-developments}{%
|
|
||||||
\section{Conclusions and future work or possible
|
|
||||||
developments}\label{conclusions-and-future-work-or-possible-developments}}
|
|
||||||
|
|
||||||
\textbf{TBD}
|
\textbf{TBD}
|
||||||
|
|
||||||
|
\newpage
|
||||||
\printbibliography
|
\printbibliography
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
|
Loading…
Reference in a new issue