report
This commit is contained in:
parent
744a4025a1
commit
f8045b560c
2 changed files with 38 additions and 2 deletions
Binary file not shown.
|
@ -949,8 +949,44 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
||||||
the highest success event rate
|
the highest success event rate
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
\section{Conclusions, Limitations and Future Work}\label{sec8}
|
||||||
\textbf{TBD}
|
In this report we analyze the Google Borg 2019 traces and compared them with
|
||||||
|
their 2011 counterpart from the perspective of failures, their impact on
|
||||||
|
resources and their causes. We discover that the impact of non-successful
|
||||||
|
executions (especially of \texttt{KILL}ed tasks and jobs) in the new traces is
|
||||||
|
still very relevant in terms of machine time and resources, even more so than in
|
||||||
|
2011. We also discover that unsuccessful job and task event patterns still play
|
||||||
|
a major role in the overall execution success of Borg jobs and tasks. We finally
|
||||||
|
discover that unsuccessful job and task event rates dominate the overall
|
||||||
|
landscape of Borg's own logs, even when grouping tasks and jobs by parameters
|
||||||
|
such as priority, resource request, reservation and utilization, and machine
|
||||||
|
locality.
|
||||||
|
|
||||||
|
We then can conclude that the performed analysis show a lot of clear trends
|
||||||
|
regarding the correlation of execution success with several parameters and
|
||||||
|
metadata. These trends can potentially be exploited to build better scheduling
|
||||||
|
algorithms and new predictive models
|
||||||
|
that could understand if an execution has high probability of failure based on
|
||||||
|
its own properties and metadata. The creation of such models could allow for
|
||||||
|
computational resources to be saved and used to either increase the throughput
|
||||||
|
of higher priority workloads or to allow for a larger workload altoghether.
|
||||||
|
|
||||||
|
The biggest limitation and threat to validity posed to this project is the
|
||||||
|
relative lack of infrormation provided by Google on the true meaning of
|
||||||
|
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
|
||||||
|
and the rather scarcity of information in the traces
|
||||||
|
documentation\cite{google-drive-marso}, it is not clear if unsuccessful
|
||||||
|
executions yield any useful computation result or not. Our assumption in this
|
||||||
|
report is that unsuccesful jobs and tasks do not produce any result and are
|
||||||
|
therefore just burdens on machine time and resources, but should this assumption
|
||||||
|
be incorrect then the interpretation of the analyses might change significantly.
|
||||||
|
|
||||||
|
Given the significant computational time invested in obtaining the results shown
|
||||||
|
in this report and due to time and resource limitations, some of the analysis
|
||||||
|
were not completed. Our future work will focus on finishing these analysis,
|
||||||
|
namely by computing results for the missing clusters and obtaining a true
|
||||||
|
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
|
||||||
|
their causes.
|
||||||
|
|
||||||
\newpage
|
\newpage
|
||||||
\printbibliography%
|
\printbibliography%
|
||||||
|
|
Loading…
Reference in a new issue