report
This commit is contained in:
parent
744a4025a1
commit
f8045b560c
2 changed files with 38 additions and 2 deletions
Binary file not shown.
|
@ -949,8 +949,44 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|||
the highest success event rate
|
||||
\end{itemize}
|
||||
|
||||
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
|
||||
\textbf{TBD}
|
||||
\section{Conclusions, Limitations and Future Work}\label{sec8}
|
||||
In this report we analyze the Google Borg 2019 traces and compared them with
|
||||
their 2011 counterpart from the perspective of failures, their impact on
|
||||
resources and their causes. We discover that the impact of non-successful
|
||||
executions (especially of \texttt{KILL}ed tasks and jobs) in the new traces is
|
||||
still very relevant in terms of machine time and resources, even more so than in
|
||||
2011. We also discover that unsuccessful job and task event patterns still play
|
||||
a major role in the overall execution success of Borg jobs and tasks. We finally
|
||||
discover that unsuccessful job and task event rates dominate the overall
|
||||
landscape of Borg's own logs, even when grouping tasks and jobs by parameters
|
||||
such as priority, resource request, reservation and utilization, and machine
|
||||
locality.
|
||||
|
||||
We then can conclude that the performed analysis show a lot of clear trends
|
||||
regarding the correlation of execution success with several parameters and
|
||||
metadata. These trends can potentially be exploited to build better scheduling
|
||||
algorithms and new predictive models
|
||||
that could understand if an execution has high probability of failure based on
|
||||
its own properties and metadata. The creation of such models could allow for
|
||||
computational resources to be saved and used to either increase the throughput
|
||||
of higher priority workloads or to allow for a larger workload altoghether.
|
||||
|
||||
The biggest limitation and threat to validity posed to this project is the
|
||||
relative lack of infrormation provided by Google on the true meaning of
|
||||
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
|
||||
and the rather scarcity of information in the traces
|
||||
documentation\cite{google-drive-marso}, it is not clear if unsuccessful
|
||||
executions yield any useful computation result or not. Our assumption in this
|
||||
report is that unsuccesful jobs and tasks do not produce any result and are
|
||||
therefore just burdens on machine time and resources, but should this assumption
|
||||
be incorrect then the interpretation of the analyses might change significantly.
|
||||
|
||||
Given the significant computational time invested in obtaining the results shown
|
||||
in this report and due to time and resource limitations, some of the analysis
|
||||
were not completed. Our future work will focus on finishing these analysis,
|
||||
namely by computing results for the missing clusters and obtaining a true
|
||||
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
|
||||
their causes.
|
||||
|
||||
\newpage
|
||||
\printbibliography%
|
||||
|
|
Loading…
Reference in a new issue