report
This commit is contained in:
parent
a05bd53fe6
commit
744a4025a1
3 changed files with 124 additions and 17 deletions
Binary file not shown.
|
@ -48,7 +48,7 @@ performance, and their root causes. We show the strong negative impact on
|
||||||
CPU and RAM usage and on task slowdown. We analyze patterns of
|
CPU and RAM usage and on task slowdown. We analyze patterns of
|
||||||
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
||||||
Moreover, we uncover their root causes by inspecting key workload and
|
Moreover, we uncover their root causes by inspecting key workload and
|
||||||
system attributes such asmachine locality and concurrency level.}
|
system attributes such as machine locality and concurrency level.}
|
||||||
|
|
||||||
\begin{document}
|
\begin{document}
|
||||||
\maketitle
|
\maketitle
|
||||||
|
@ -78,7 +78,7 @@ workload due to improvements in computational technology, but also providing
|
||||||
data from 8 different \textit{Borg} cells from datacenters located all over the
|
data from 8 different \textit{Borg} cells from datacenters located all over the
|
||||||
world.
|
world.
|
||||||
|
|
||||||
\subsection{Motivation}
|
%\subsection{Motivation}
|
||||||
Even by glancing at some of the spatial and temporal analyses performed on the
|
Even by glancing at some of the spatial and temporal analyses performed on the
|
||||||
Google Borg traces in this report, it is evident that unsuccessful executions
|
Google Borg traces in this report, it is evident that unsuccessful executions
|
||||||
play a major role in the waste of resources in clusterized computations. For
|
play a major role in the waste of resources in clusterized computations. For
|
||||||
|
@ -104,7 +104,7 @@ can be of interest for understanding the behaviour of failures in
|
||||||
similar clusterized systems, and could potentially be used to build predictive
|
similar clusterized systems, and could potentially be used to build predictive
|
||||||
models to mitigate or erase the resource impact of unsuccessful executions.
|
models to mitigate or erase the resource impact of unsuccessful executions.
|
||||||
|
|
||||||
\subsection{Challenges}
|
%\subsection{Challenges}
|
||||||
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
Given that the new 2019 Google Borg cluster traces are about 100 times larger
|
||||||
than the 2011 ones, and given that the entire compressed traces package has a
|
than the 2011 ones, and given that the entire compressed traces package has a
|
||||||
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
|
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
|
||||||
|
@ -116,7 +116,7 @@ span of 3 months. Additionally, the analysis scripts have been written by
|
||||||
exploiting the power of parallel computing, following most of the time a
|
exploiting the power of parallel computing, following most of the time a
|
||||||
MapReduce-like structure.
|
MapReduce-like structure.
|
||||||
|
|
||||||
\subsection{Contribution}
|
%\subsection{Contribution}
|
||||||
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
|
||||||
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
|
||||||
workload and the behaviour and patterns of executions within it. Thanks to this
|
workload and the behaviour and patterns of executions within it. Thanks to this
|
||||||
|
@ -125,7 +125,7 @@ prevent them. Additionally, given the technical challenge this analysis posed,
|
||||||
the report aims to provide an overview of some basic data engineering techniques
|
the report aims to provide an overview of some basic data engineering techniques
|
||||||
for big data applications.
|
for big data applications.
|
||||||
|
|
||||||
\subsection{Outline}
|
%\subsection{Outline}
|
||||||
The report is structured as follows. Section~\ref{sec2} contains information
|
The report is structured as follows. Section~\ref{sec2} contains information
|
||||||
about the current state of the art for Google Borg cluster traces.
|
about the current state of the art for Google Borg cluster traces.
|
||||||
Section~\ref{sec3} provides an overview including technical background
|
Section~\ref{sec3} provides an overview including technical background
|
||||||
|
@ -137,8 +137,7 @@ performance input of unsuccessful executions, the patterns of task and job
|
||||||
events, and the potential causes of unsuccessful executions. Finally,
|
events, and the potential causes of unsuccessful executions. Finally,
|
||||||
Section~\ref{sec8} contains the conclusions.
|
Section~\ref{sec8} contains the conclusions.
|
||||||
|
|
||||||
|
\section{State of the Art}\label{sec2}
|
||||||
\section{State of the art}\label{sec2}
|
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -170,9 +169,10 @@ failures. The salient conclusion of that research is that actually lots of
|
||||||
computations performed by Google would eventually end in failure, then leading
|
computations performed by Google would eventually end in failure, then leading
|
||||||
to large amounts of computational power being wasted.
|
to large amounts of computational power being wasted.
|
||||||
|
|
||||||
However, with the release of the new 2019 traces, the results and conclusions
|
However, with the release of the new 2019 traces\cite{google-marso-19},
|
||||||
found by that paper could be potentially outdated in the current large-scale
|
the results and conclusions found by that paper could be potentially outdated
|
||||||
computing world. The new traces not only provide updated data on Borg's
|
in the current large-scale computing world.
|
||||||
|
The new traces not only provide updated data on Borg's
|
||||||
workload, but provide more data as well: the new traces contain data from 8
|
workload, but provide more data as well: the new traces contain data from 8
|
||||||
different Borg ``cells'' (i.e.\ clusters) in datacenters across the world,
|
different Borg ``cells'' (i.e.\ clusters) in datacenters across the world,
|
||||||
from now on referred as ``Cluster A'' to ``Cluster H''.
|
from now on referred as ``Cluster A'' to ``Cluster H''.
|
||||||
|
@ -184,16 +184,33 @@ documentation\cite{google-drive-marso}.
|
||||||
|
|
||||||
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
The new 2019 traces provide richer data even on a cluster by cluster basis. For
|
||||||
example, the amount and variety of server configurations per cluster increased
|
example, the amount and variety of server configurations per cluster increased
|
||||||
significantly from 2011.
|
significantly from 2011. An overview of the machine configurations in the cluster
|
||||||
An overview of the machine configurations in the cluster analyzed with the 2011
|
analyzed with the 2011 traces and in the 8 clusters composing the 2019 traces
|
||||||
traces and in the 8 clusters composing the 2019 traces can be found in
|
can be found in Figure~\ref{fig:machineconfigs} and in
|
||||||
Figure~\ref{fig:machineconfigs}. Additionally, in
|
Figure~\ref{fig:machineconfigs-csts} on a cluster-by-cluster basis.
|
||||||
Figure~\ref{fig:machineconfigs-csts}, the same machine configuration data is
|
|
||||||
provided for the 2019 traces providing a cluster-by-cluster distribution of the
|
|
||||||
machines.
|
|
||||||
|
|
||||||
\input{figures/machine_configs}
|
\input{figures/machine_configs}
|
||||||
|
|
||||||
|
There are two main works covering the new data,
|
||||||
|
one being the paper \textit{Borg: The Next Generation}\cite{google-marso-19},
|
||||||
|
which compares the overall features of the trace with the 2011
|
||||||
|
one\cite{google-marso-11}\cite{github-marso}, and one covering the features and
|
||||||
|
performance of
|
||||||
|
\textit{Autopilot}\cite{james-muratore}, a software that provides autoscaling
|
||||||
|
features in Borg. The new traces have also been analyzed from the execution
|
||||||
|
priority perspective\cite{down-under}, as well as from a cluster-by-cluster
|
||||||
|
comparison\cite{golf-course} given the multi-cluster nature of the new traces.
|
||||||
|
|
||||||
|
Other studies have been performed in similar big-data systems focusing on the
|
||||||
|
failure of hardware components and software
|
||||||
|
bugs\cite{9}\cite{10}\cite{11}\cite{12}.
|
||||||
|
|
||||||
|
However, the community has not yet performed any research on the new Borg
|
||||||
|
traces analysing unsuccessful executions, their possible causes, and the
|
||||||
|
relationships between tasks and jobs. Therefore, the only current research in
|
||||||
|
this field (beside this report) is the 2015 Ros\'a et al.\
|
||||||
|
paper\cite{dsn-paper}.
|
||||||
|
|
||||||
\section{Background information}\label{sec3}
|
\section{Background information}\label{sec3}
|
||||||
|
|
||||||
\textit{Borg} is Google's own cluster management software able to run
|
\textit{Borg} is Google's own cluster management software able to run
|
||||||
|
|
|
@ -26,4 +26,94 @@ address = {Heraklion, Crete}
|
||||||
|
|
||||||
@misc{google-drive-marso, title={Google cluster-usage traces v3.pdf}, url={https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view}, journal={Google Drive}, publisher={Google}, author={Wilkes, John}, year={2020}, month={Aug}}
|
@misc{google-drive-marso, title={Google cluster-usage traces v3.pdf}, url={https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view}, journal={Google Drive}, publisher={Google}, author={Wilkes, John}, year={2020}, month={Aug}}
|
||||||
|
|
||||||
|
@INPROCEEDINGS{down-under,
|
||||||
|
author={Lasantha, Dimuth and Ray, Biplob},
|
||||||
|
booktitle={2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)},
|
||||||
|
title={Priority Based Modeling and Comparative Study of Google Cloud Resources between 2011 and 2019},
|
||||||
|
year={2020},
|
||||||
|
volume={},
|
||||||
|
number={},
|
||||||
|
pages={1310-1317},
|
||||||
|
doi={10.1109/TrustCom50675.2020.00176}}
|
||||||
|
|
||||||
|
@inproceedings{james-muratore,
|
||||||
|
author = {Rzadca, Krzysztof and Findeisen, Pawel and Swiderski, Jacek and Zych, Przemyslaw and Broniek, Przemyslaw and Kusmierek, Jarek and Nowak, Pawel and Strack, Beata and Witusowski, Piotr and Hand, Steven and Wilkes, John},
|
||||||
|
title = {Autopilot: Workload Autoscaling at Google},
|
||||||
|
year = {2020},
|
||||||
|
isbn = {9781450368827},
|
||||||
|
publisher = {Association for Computing Machinery},
|
||||||
|
address = {New York, NY, USA},
|
||||||
|
url = {https://doi.org/10.1145/3342195.3387524},
|
||||||
|
doi = {10.1145/3342195.3387524},
|
||||||
|
booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
|
||||||
|
articleno = {16},
|
||||||
|
numpages = {16},
|
||||||
|
location = {Heraklion, Greece},
|
||||||
|
series = {EuroSys '20}
|
||||||
|
}
|
||||||
|
|
||||||
|
@INPROCEEDINGS{golf-course, author={Lin, Yuhui and Barker, Adam and Ceesay, Sheriffo}, booktitle={2020 IEEE International Conference on Big Data (Big Data)}, title={Exploring Characteristics of Inter-cluster Machines and Cloud Applications on Google Clusters}, year={2020}, volume={}, number={}, pages={2785-2794}, doi={10.1109/BigData50022.2020.9377802}}
|
||||||
|
|
||||||
|
@ARTICLE{9,
|
||||||
|
author={Schroeder, Bianca and Gibson, Garth A.},
|
||||||
|
journal={IEEE Transactions on Dependable and Secure Computing},
|
||||||
|
title={A Large-Scale Study of Failures in High-Performance Computing Systems},
|
||||||
|
year={2010},
|
||||||
|
volume={7},
|
||||||
|
number={4},
|
||||||
|
pages={337-350},
|
||||||
|
doi={10.1109/TDSC.2009.4}}
|
||||||
|
|
||||||
|
@article{10,
|
||||||
|
author = {Schroeder, Bianca and Pinheiro, Eduardo and Weber, Wolf-Dietrich},
|
||||||
|
title = {DRAM Errors in the Wild: A Large-Scale Field Study},
|
||||||
|
year = {2011},
|
||||||
|
issue_date = {February 2011},
|
||||||
|
publisher = {Association for Computing Machinery},
|
||||||
|
address = {New York, NY, USA},
|
||||||
|
volume = {54},
|
||||||
|
number = {2},
|
||||||
|
issn = {0001-0782},
|
||||||
|
url = {https://doi.org/10.1145/1897816.1897844},
|
||||||
|
doi = {10.1145/1897816.1897844},
|
||||||
|
journal = {Commun. ACM},
|
||||||
|
month = feb,
|
||||||
|
pages = {100–107},
|
||||||
|
numpages = {8}
|
||||||
|
}
|
||||||
|
|
||||||
|
@inproceedings{11,
|
||||||
|
author = {Lu, Shan and Park, Soyeon and Seo, Eunsoo and Zhou, Yuanyuan},
|
||||||
|
title = {Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics},
|
||||||
|
year = {2008},
|
||||||
|
isbn = {9781595939586},
|
||||||
|
publisher = {Association for Computing Machinery},
|
||||||
|
address = {New York, NY, USA},
|
||||||
|
url = {https://doi.org/10.1145/1346281.1346323},
|
||||||
|
doi = {10.1145/1346281.1346323},
|
||||||
|
booktitle = {Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems},
|
||||||
|
pages = {329–339},
|
||||||
|
numpages = {11},
|
||||||
|
keywords = {concurrency bug, concurrent program, bug characteristics},
|
||||||
|
location = {Seattle, WA, USA},
|
||||||
|
series = {ASPLOS XIII}
|
||||||
|
}
|
||||||
|
|
||||||
|
@inproceedings{12,
|
||||||
|
author = {Yuan, Ding and Park, Soyeon and Huang, Peng and Liu, Yang and Lee, Michael M. and Tang, Xiaoming and Zhou, Yuanyuan and Savage, Stefan},
|
||||||
|
title = {Be Conservative: Enhancing Failure Diagnosis with Proactive Logging},
|
||||||
|
year = {2012},
|
||||||
|
isbn = {9781931971966},
|
||||||
|
publisher = {USENIX Association},
|
||||||
|
address = {USA},
|
||||||
|
booktitle = {Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation},
|
||||||
|
pages = {293–306},
|
||||||
|
numpages = {14},
|
||||||
|
location = {Hollywood, CA, USA},
|
||||||
|
series = {OSDI'12}
|
||||||
|
}
|
||||||
|
|
||||||
@misc{google-proto-marso, title={Google 2019 Borg traces protobuffer specification}, url={https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto}, journal={GitHub}, publisher={Google}, author={Deng, Nan}, year={2020}, month={Aug}}
|
@misc{google-proto-marso, title={Google 2019 Borg traces protobuffer specification}, url={https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto}, journal={GitHub}, publisher={Google}, author={Deng, Nan}, year={2020}, month={Aug}}
|
||||||
|
|
||||||
|
@misc{github-marso, title={Borg cluster traces from Google}, url={https://github.com/google/cluster-data}, journal={GitHub}, publisher={Google}, author={Wilkies, John}, year={2020}, month={Aug}}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue