diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index 9c5e626d..a04499e6 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index 9a20d546..ba3d7081 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -43,9 +43,36 @@ system attributes such asmachine locality and concurrency level.} \tableofcontents \newpage -\hypertarget{introduction-including-motivation}{% -\section{Introduction (including -Motivation)}\label{introduction-including-motivation}} +\section{Introduction} +In today's world there is an ever growing demand for efficient, large scale +computations. The rising trend of ``big data'' put the need for efficient +management of large scaled parallelized computing at an all time high. This fact +also increases the demand for research in the field of distributed systems, in +particular in how to schedule computations effectively, avoid wasting resources +and avoid failures. + +In 2011 Google released a month long data trace of its own \textit{Borg} cluster +management system, containing a lot of data regarding scheduling, priority +management, and failures of a real production workload. This data was the +foundation of the 2015 Ros\'a et al.\ paper \textit{Understanding the Dark Side +of Big Data Clusters: An Analysis beyond Failures}, which in its many +conclusions highlighted the need for better cluster management highlighting the +high amount of failures found in the traces. + +In 2019 Google released an updated version of the \textit{Borg} cluster traces, +not only containing data from a far bigger workload due to the sheer power of +Moore's law, but also providing data from 8 different \textit{Borg} cells from +datacenters all over the world. These new traces are therefore about 100 times +larger than the old traces, weighing in terms of storage spaces approximately +8TiB (when compressed and stored in JSONL format), requiring considerable +computational power to analyze them and the implementation of special data +engineering tecniques for analysis of the data. + +This project aims to repeat the analysis performed in 2015 to highlight +similarities and differences in workload this decade brought, and expanding the +old analysis to understand even better the causes of failures and how to prevent +them. Additionally, this report will provide an overview on the data engineering +tecniques used to perform the queries and analyses on the 2019 traces. \hypertarget{state-of-the-art}{% \section{State of the Art}\label{state-of-the-art}}