This commit is contained in:
Claudio Maggioni 2021-06-18 15:30:50 +02:00
parent 0fa930ae56
commit b9c1159307
2 changed files with 38 additions and 2 deletions

Binary file not shown.

View File

@ -949,8 +949,44 @@ Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
the highest success event rate
\end{itemize}
\section{Conclusions, Future Work and Possible Developments}\label{sec8}
\textbf{TBD}
\section{Conclusions, Limitations and Future Work}\label{sec8}
In this report we analyze the Google Borg 2019 traces and compared them with
their 2011 counterpart from the perspective of failures, their impact on
resources and their causes. We discover that the impact of non-successful
executions (especially of \texttt{KILL}ed tasks and jobs) in the new traces is
still very relevant in terms of machine time and resources, even more so than in
2011. We also discover that unsuccessful job and task event patterns still play
a major role in the overall execution success of Borg jobs and tasks. We finally
discover that unsuccessful job and task event rates dominate the overall
landscape of Borg's own logs, even when grouping tasks and jobs by parameters
such as priority, resource request, reservation and utilization, and machine
locality.
We then can conclude that the performed analysis show a lot of clear trends
regarding the correlation of execution success with several parameters and
metadata. These trends can potentially be exploited to build better scheduling
algorithms and new predictive models
that could understand if an execution has high probability of failure based on
its own properties and metadata. The creation of such models could allow for
computational resources to be saved and used to either increase the throughput
of higher priority workloads or to allow for a larger workload altoghether.
The biggest limitation and threat to validity posed to this project is the
relative lack of infrormation provided by Google on the true meaning of
unsuccessful terminations. Indeed, given the ``black box'' nature of the traces
and the rather scarcity of information in the traces
documentation\cite{google-drive-marso}, it is not clear if unsuccessful
executions yield any useful computation result or not. Our assumption in this
report is that unsuccesful jobs and tasks do not produce any result and are
therefore just burdens on machine time and resources, but should this assumption
be incorrect then the interpretation of the analyses might change significantly.
Given the significant computational time invested in obtaining the results shown
in this report and due to time and resource limitations, some of the analysis
were not completed. Our future work will focus on finishing these analysis,
namely by computing results for the missing clusters and obtaining a true
overall picture of the 2019 Google Borg cluster traces w.r.t.\ failures and
their causes.
\newpage
\printbibliography%