diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index e847ef0c..7f3217c4 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index fd63a3d2..2c4faec6 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -48,7 +48,7 @@ performance, and their root causes. We show the strong negative impact on CPU and RAM usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and -system attributes such asmachine locality and concurrency level.} +system attributes such as machine locality and concurrency level.} \begin{document} \maketitle @@ -78,7 +78,7 @@ workload due to improvements in computational technology, but also providing data from 8 different \textit{Borg} cells from datacenters located all over the world. -\subsection{Motivation} +%\subsection{Motivation} Even by glancing at some of the spatial and temporal analyses performed on the Google Borg traces in this report, it is evident that unsuccessful executions play a major role in the waste of resources in clusterized computations. For @@ -104,7 +104,7 @@ can be of interest for understanding the behaviour of failures in similar clusterized systems, and could potentially be used to build predictive models to mitigate or erase the resource impact of unsuccessful executions. -\subsection{Challenges} +%\subsection{Challenges} Given that the new 2019 Google Borg cluster traces are about 100 times larger than the 2011 ones, and given that the entire compressed traces package has a non-trivial size (weighing approximately 8 TiB\cite{google-drive-marso}), the @@ -116,7 +116,7 @@ span of 3 months. Additionally, the analysis scripts have been written by exploiting the power of parallel computing, following most of the time a MapReduce-like structure. -\subsection{Contribution} +%\subsection{Contribution} This project aims to repeat the analysis performed in 2015 DSN Ros\'a et al.\ paper\cite{dsn-paper} to highlight similarities and differences in Google Borg workload and the behaviour and patterns of executions within it. Thanks to this @@ -125,7 +125,7 @@ prevent them. Additionally, given the technical challenge this analysis posed, the report aims to provide an overview of some basic data engineering techniques for big data applications. -\subsection{Outline} +%\subsection{Outline} The report is structured as follows. Section~\ref{sec2} contains information about the current state of the art for Google Borg cluster traces. Section~\ref{sec3} provides an overview including technical background @@ -137,8 +137,7 @@ performance input of unsuccessful executions, the patterns of task and job events, and the potential causes of unsuccessful executions. Finally, Section~\ref{sec8} contains the conclusions. - -\section{State of the art}\label{sec2} +\section{State of the Art}\label{sec2} \begin{figure}[t] \begin{center} @@ -170,9 +169,10 @@ failures. The salient conclusion of that research is that actually lots of computations performed by Google would eventually end in failure, then leading to large amounts of computational power being wasted. -However, with the release of the new 2019 traces, the results and conclusions -found by that paper could be potentially outdated in the current large-scale -computing world. The new traces not only provide updated data on Borg's +However, with the release of the new 2019 traces\cite{google-marso-19}, +the results and conclusions found by that paper could be potentially outdated +in the current large-scale computing world. +The new traces not only provide updated data on Borg's workload, but provide more data as well: the new traces contain data from 8 different Borg ``cells'' (i.e.\ clusters) in datacenters across the world, from now on referred as ``Cluster A'' to ``Cluster H''. @@ -184,16 +184,33 @@ documentation\cite{google-drive-marso}. The new 2019 traces provide richer data even on a cluster by cluster basis. For example, the amount and variety of server configurations per cluster increased -significantly from 2011. -An overview of the machine configurations in the cluster analyzed with the 2011 -traces and in the 8 clusters composing the 2019 traces can be found in -Figure~\ref{fig:machineconfigs}. Additionally, in -Figure~\ref{fig:machineconfigs-csts}, the same machine configuration data is -provided for the 2019 traces providing a cluster-by-cluster distribution of the -machines. +significantly from 2011. An overview of the machine configurations in the cluster +analyzed with the 2011 traces and in the 8 clusters composing the 2019 traces +can be found in Figure~\ref{fig:machineconfigs} and in +Figure~\ref{fig:machineconfigs-csts} on a cluster-by-cluster basis. \input{figures/machine_configs} +There are two main works covering the new data, +one being the paper \textit{Borg: The Next Generation}\cite{google-marso-19}, +which compares the overall features of the trace with the 2011 +one\cite{google-marso-11}\cite{github-marso}, and one covering the features and +performance of +\textit{Autopilot}\cite{james-muratore}, a software that provides autoscaling +features in Borg. The new traces have also been analyzed from the execution +priority perspective\cite{down-under}, as well as from a cluster-by-cluster +comparison\cite{golf-course} given the multi-cluster nature of the new traces. + +Other studies have been performed in similar big-data systems focusing on the +failure of hardware components and software +bugs\cite{9}\cite{10}\cite{11}\cite{12}. + +However, the community has not yet performed any research on the new Borg +traces analysing unsuccessful executions, their possible causes, and the +relationships between tasks and jobs. Therefore, the only current research in +this field (beside this report) is the 2015 Ros\'a et al.\ +paper\cite{dsn-paper}. + \section{Background information}\label{sec3} \textit{Borg} is Google's own cluster management software able to run diff --git a/report/references.bib b/report/references.bib index 21528d43..e15680c7 100644 --- a/report/references.bib +++ b/report/references.bib @@ -26,4 +26,94 @@ address = {Heraklion, Crete} @misc{google-drive-marso, title={Google cluster-usage traces v3.pdf}, url={https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view}, journal={Google Drive}, publisher={Google}, author={Wilkes, John}, year={2020}, month={Aug}} +@INPROCEEDINGS{down-under, + author={Lasantha, Dimuth and Ray, Biplob}, + booktitle={2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)}, + title={Priority Based Modeling and Comparative Study of Google Cloud Resources between 2011 and 2019}, + year={2020}, + volume={}, + number={}, + pages={1310-1317}, + doi={10.1109/TrustCom50675.2020.00176}} + +@inproceedings{james-muratore, +author = {Rzadca, Krzysztof and Findeisen, Pawel and Swiderski, Jacek and Zych, Przemyslaw and Broniek, Przemyslaw and Kusmierek, Jarek and Nowak, Pawel and Strack, Beata and Witusowski, Piotr and Hand, Steven and Wilkes, John}, +title = {Autopilot: Workload Autoscaling at Google}, +year = {2020}, +isbn = {9781450368827}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3342195.3387524}, +doi = {10.1145/3342195.3387524}, +booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems}, +articleno = {16}, +numpages = {16}, +location = {Heraklion, Greece}, +series = {EuroSys '20} +} + +@INPROCEEDINGS{golf-course, author={Lin, Yuhui and Barker, Adam and Ceesay, Sheriffo}, booktitle={2020 IEEE International Conference on Big Data (Big Data)}, title={Exploring Characteristics of Inter-cluster Machines and Cloud Applications on Google Clusters}, year={2020}, volume={}, number={}, pages={2785-2794}, doi={10.1109/BigData50022.2020.9377802}} + +@ARTICLE{9, + author={Schroeder, Bianca and Gibson, Garth A.}, + journal={IEEE Transactions on Dependable and Secure Computing}, + title={A Large-Scale Study of Failures in High-Performance Computing Systems}, + year={2010}, + volume={7}, + number={4}, + pages={337-350}, + doi={10.1109/TDSC.2009.4}} + +@article{10, +author = {Schroeder, Bianca and Pinheiro, Eduardo and Weber, Wolf-Dietrich}, +title = {DRAM Errors in the Wild: A Large-Scale Field Study}, +year = {2011}, +issue_date = {February 2011}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {54}, +number = {2}, +issn = {0001-0782}, +url = {https://doi.org/10.1145/1897816.1897844}, +doi = {10.1145/1897816.1897844}, +journal = {Commun. ACM}, +month = feb, +pages = {100–107}, +numpages = {8} +} + +@inproceedings{11, +author = {Lu, Shan and Park, Soyeon and Seo, Eunsoo and Zhou, Yuanyuan}, +title = {Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics}, +year = {2008}, +isbn = {9781595939586}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/1346281.1346323}, +doi = {10.1145/1346281.1346323}, +booktitle = {Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems}, +pages = {329–339}, +numpages = {11}, +keywords = {concurrency bug, concurrent program, bug characteristics}, +location = {Seattle, WA, USA}, +series = {ASPLOS XIII} +} + +@inproceedings{12, +author = {Yuan, Ding and Park, Soyeon and Huang, Peng and Liu, Yang and Lee, Michael M. and Tang, Xiaoming and Zhou, Yuanyuan and Savage, Stefan}, +title = {Be Conservative: Enhancing Failure Diagnosis with Proactive Logging}, +year = {2012}, +isbn = {9781931971966}, +publisher = {USENIX Association}, +address = {USA}, +booktitle = {Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation}, +pages = {293–306}, +numpages = {14}, +location = {Hollywood, CA, USA}, +series = {OSDI'12} +} + @misc{google-proto-marso, title={Google 2019 Borg traces protobuffer specification}, url={https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto}, journal={GitHub}, publisher={Google}, author={Deng, Nan}, year={2020}, month={Aug}} + +@misc{github-marso, title={Borg cluster traces from Google}, url={https://github.com/google/cluster-data}, journal={GitHub}, publisher={Google}, author={Wilkies, John}, year={2020}, month={Aug}} +