This commit is contained in:
Claudio Maggioni 2021-06-17 21:50:22 +02:00
parent a05bd53fe6
commit 744a4025a1
3 changed files with 124 additions and 17 deletions

Binary file not shown.

View file

@ -48,7 +48,7 @@ performance, and their root causes. We show the strong negative impact on
CPU and RAM usage and on task slowdown. We analyze patterns of
unsuccessful jobs and tasks, particularly focusing on their interdependency.
Moreover, we uncover their root causes by inspecting key workload and
system attributes such asmachine locality and concurrency level.}
system attributes such as machine locality and concurrency level.}
\begin{document}
\maketitle
@ -78,7 +78,7 @@ workload due to improvements in computational technology, but also providing
data from 8 different \textit{Borg} cells from datacenters located all over the
world.
\subsection{Motivation}
%\subsection{Motivation}
Even by glancing at some of the spatial and temporal analyses performed on the
Google Borg traces in this report, it is evident that unsuccessful executions
play a major role in the waste of resources in clusterized computations. For
@ -104,7 +104,7 @@ can be of interest for understanding the behaviour of failures in
similar clusterized systems, and could potentially be used to build predictive
models to mitigate or erase the resource impact of unsuccessful executions.
\subsection{Challenges}
%\subsection{Challenges}
Given that the new 2019 Google Borg cluster traces are about 100 times larger
than the 2011 ones, and given that the entire compressed traces package has a
non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the
@ -116,7 +116,7 @@ span of 3 months. Additionally, the analysis scripts have been written by
exploiting the power of parallel computing, following most of the time a
MapReduce-like structure.
\subsection{Contribution}
%\subsection{Contribution}
This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\
paper\cite{dsn-paper} to highlight similarities and differences in Google Borg
workload and the behaviour and patterns of executions within it. Thanks to this
@ -125,7 +125,7 @@ prevent them. Additionally, given the technical challenge this analysis posed,
the report aims to provide an overview of some basic data engineering techniques
for big data applications.
\subsection{Outline}
%\subsection{Outline}
The report is structured as follows. Section~\ref{sec2} contains information
about the current state of the art for Google Borg cluster traces.
Section~\ref{sec3} provides an overview including technical background
@ -137,8 +137,7 @@ performance input of unsuccessful executions, the patterns of task and job
events, and the potential causes of unsuccessful executions. Finally,
Section~\ref{sec8} contains the conclusions.
\section{State of the art}\label{sec2}
\section{State of the Art}\label{sec2}
\begin{figure}[t]
\begin{center}
@ -170,9 +169,10 @@ failures. The salient conclusion of that research is that actually lots of
computations performed by Google would eventually end in failure, then leading
to large amounts of computational power being wasted.
However, with the release of the new 2019 traces, the results and conclusions
found by that paper could be potentially outdated in the current large-scale
computing world. The new traces not only provide updated data on Borg's
However, with the release of the new 2019 traces\cite{google-marso-19},
the results and conclusions found by that paper could be potentially outdated
in the current large-scale computing world.
The new traces not only provide updated data on Borg's
workload, but provide more data as well: the new traces contain data from 8
different Borg ``cells'' (i.e.\ clusters) in datacenters across the world,
from now on referred as ``Cluster A'' to ``Cluster H''.
@ -184,16 +184,33 @@ documentation\cite{google-drive-marso}.
The new 2019 traces provide richer data even on a cluster by cluster basis. For
example, the amount and variety of server configurations per cluster increased
significantly from 2011.
An overview of the machine configurations in the cluster analyzed with the 2011
traces and in the 8 clusters composing the 2019 traces can be found in
Figure~\ref{fig:machineconfigs}. Additionally, in
Figure~\ref{fig:machineconfigs-csts}, the same machine configuration data is
provided for the 2019 traces providing a cluster-by-cluster distribution of the
machines.
significantly from 2011. An overview of the machine configurations in the cluster
analyzed with the 2011 traces and in the 8 clusters composing the 2019 traces
can be found in Figure~\ref{fig:machineconfigs} and in
Figure~\ref{fig:machineconfigs-csts} on a cluster-by-cluster basis.
\input{figures/machine_configs}
There are two main works covering the new data,
one being the paper \textit{Borg: The Next Generation}\cite{google-marso-19},
which compares the overall features of the trace with the 2011
one\cite{google-marso-11}\cite{github-marso}, and one covering the features and
performance of
\textit{Autopilot}\cite{james-muratore}, a software that provides autoscaling
features in Borg. The new traces have also been analyzed from the execution
priority perspective\cite{down-under}, as well as from a cluster-by-cluster
comparison\cite{golf-course} given the multi-cluster nature of the new traces.
Other studies have been performed in similar big-data systems focusing on the
failure of hardware components and software
bugs\cite{9}\cite{10}\cite{11}\cite{12}.
However, the community has not yet performed any research on the new Borg
traces analysing unsuccessful executions, their possible causes, and the
relationships between tasks and jobs. Therefore, the only current research in
this field (beside this report) is the 2015 Ros\'a et al.\
paper\cite{dsn-paper}.
\section{Background information}\label{sec3}
\textit{Borg} is Google's own cluster management software able to run

View file

@ -26,4 +26,94 @@ address = {Heraklion, Crete}
@misc{google-drive-marso, title={Google cluster-usage traces v3.pdf}, url={https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view}, journal={Google Drive}, publisher={Google}, author={Wilkes, John}, year={2020}, month={Aug}}
@INPROCEEDINGS{down-under,
author={Lasantha, Dimuth and Ray, Biplob},
booktitle={2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)},
title={Priority Based Modeling and Comparative Study of Google Cloud Resources between 2011 and 2019},
year={2020},
volume={},
number={},
pages={1310-1317},
doi={10.1109/TrustCom50675.2020.00176}}
@inproceedings{james-muratore,
author = {Rzadca, Krzysztof and Findeisen, Pawel and Swiderski, Jacek and Zych, Przemyslaw and Broniek, Przemyslaw and Kusmierek, Jarek and Nowak, Pawel and Strack, Beata and Witusowski, Piotr and Hand, Steven and Wilkes, John},
title = {Autopilot: Workload Autoscaling at Google},
year = {2020},
isbn = {9781450368827},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3342195.3387524},
doi = {10.1145/3342195.3387524},
booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
articleno = {16},
numpages = {16},
location = {Heraklion, Greece},
series = {EuroSys '20}
}
@INPROCEEDINGS{golf-course, author={Lin, Yuhui and Barker, Adam and Ceesay, Sheriffo}, booktitle={2020 IEEE International Conference on Big Data (Big Data)}, title={Exploring Characteristics of Inter-cluster Machines and Cloud Applications on Google Clusters}, year={2020}, volume={}, number={}, pages={2785-2794}, doi={10.1109/BigData50022.2020.9377802}}
@ARTICLE{9,
author={Schroeder, Bianca and Gibson, Garth A.},
journal={IEEE Transactions on Dependable and Secure Computing},
title={A Large-Scale Study of Failures in High-Performance Computing Systems},
year={2010},
volume={7},
number={4},
pages={337-350},
doi={10.1109/TDSC.2009.4}}
@article{10,
author = {Schroeder, Bianca and Pinheiro, Eduardo and Weber, Wolf-Dietrich},
title = {DRAM Errors in the Wild: A Large-Scale Field Study},
year = {2011},
issue_date = {February 2011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {54},
number = {2},
issn = {0001-0782},
url = {https://doi.org/10.1145/1897816.1897844},
doi = {10.1145/1897816.1897844},
journal = {Commun. ACM},
month = feb,
pages = {100107},
numpages = {8}
}
@inproceedings{11,
author = {Lu, Shan and Park, Soyeon and Seo, Eunsoo and Zhou, Yuanyuan},
title = {Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics},
year = {2008},
isbn = {9781595939586},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1346281.1346323},
doi = {10.1145/1346281.1346323},
booktitle = {Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {329339},
numpages = {11},
keywords = {concurrency bug, concurrent program, bug characteristics},
location = {Seattle, WA, USA},
series = {ASPLOS XIII}
}
@inproceedings{12,
author = {Yuan, Ding and Park, Soyeon and Huang, Peng and Liu, Yang and Lee, Michael M. and Tang, Xiaoming and Zhou, Yuanyuan and Savage, Stefan},
title = {Be Conservative: Enhancing Failure Diagnosis with Proactive Logging},
year = {2012},
isbn = {9781931971966},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation},
pages = {293306},
numpages = {14},
location = {Hollywood, CA, USA},
series = {OSDI'12}
}
@misc{google-proto-marso, title={Google 2019 Borg traces protobuffer specification}, url={https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto}, journal={GitHub}, publisher={Google}, author={Deng, Nan}, year={2020}, month={Aug}}
@misc{github-marso, title={Borg cluster traces from Google}, url={https://github.com/google/cluster-data}, journal={GitHub}, publisher={Google}, author={Wilkies, John}, year={2020}, month={Aug}}