diff --git a/report/Claudio_Maggioni_report.pdf b/report/Claudio_Maggioni_report.pdf index 6be89b2a..ea60945c 100644 Binary files a/report/Claudio_Maggioni_report.pdf and b/report/Claudio_Maggioni_report.pdf differ diff --git a/report/Claudio_Maggioni_report.tex b/report/Claudio_Maggioni_report.tex index b723a70e..14baddd7 100644 --- a/report/Claudio_Maggioni_report.tex +++ b/report/Claudio_Maggioni_report.tex @@ -1,11 +1,11 @@ \documentclass{usiinfbachelorproject} \title{Understanding and Comparing Unsuccessful Executions in Large Datacenters} \author{Claudio Maggioni} - -\usepackage[parfill]{parskip} -\setlength{\parskip}{7pt} +\usepackage{enumitem} +\usepackage{parskip} +\setlength{\parskip}{5pt} \setlength{\parindent}{0pt} - +%\usepackage[printfigures]{figcaps} \usepackage{xcolor} \usepackage{amsmath} \usepackage{subcaption} @@ -93,42 +93,36 @@ are encoded and stored in the trace as rows of various tables. Among the information events provide, the field ``type'' provides information on the execution status of the job or task. This field can have the following values: - -\begin{itemize} -\item - \textbf{QUEUE}: The job or task was marked not eligible for scheduling +\begin{center} +\begin{tabular}{p{3cm}p{12cm}} +\toprule +\textbf{Type code} & \textbf{Description} \\ +\midrule + \texttt{QUEUE} & The job or task was marked not eligible for scheduling by Borg's scheduler, and thus Borg will move the job/task in a long - wait queue; -\item - \textbf{SUBMIT}: The job or task was submitted to Borg for execution; -\item - \textbf{ENABLE}: The job or task became eligible for scheduling; -\item - \textbf{SCHEDULE}: The job or task's execution started; -\item - \textbf{EVICT}: The job or task was terminated in order to free - computational resources for an higher priority job; -\item - \textbf{FAIL}: The job or task terminated its execution unsuccesfully - due to a failure; -\item - \textbf{FINISH}: The job or task terminated succesfully; -\item - \textbf{KILL}: The job or task terminated its execution because of a - manual request to stop it; -\item - \textbf{LOST}: It is assumed a job or task is has been terminated, but - due to missing data there is insufficent information to identify when - or how; -\item - \textbf{UPDATE\_PENDING}: The metadata (scheduling class, resource - requirements, \ldots) of the job/task was updated while the job was - waiting to be scheduled; -\item - \textbf{UPDATE\_RUNNING}: The metadata (scheduling class, resource - requirements, \ldots) of the job/task was updated while the job was in - execution; -\end{itemize} + wait queue\\ +\texttt{SUBMIT} & The job or task was submitted to Borg for execution\\ +\texttt{ENABLE} & The job or task became eligible for scheduling\\ +\texttt{SCHEDULE} & The job or task's execution started\\ +\texttt{EVICT} & The job or task was terminated in order to free + computational resources for an higher priority job\\ +\texttt{FAIL} & The job or task terminated its execution unsuccesfully + due to a failure\\ +\texttt{FINISH} & The job or task terminated succesfully\\ +\texttt{KILL} & The job or task terminated its execution because of a + manual request to stop it\\ +\texttt{LOST} & It is assumed a job or task is has been terminated, but + due to missing data there is insufficent information to identify when + or how\\ +\texttt{UPDATE\_PENDING} & The metadata (scheduling class, resource + requirements, \ldots) of the job/task was updated while the job was + waiting to be scheduled\\ +\texttt{UPDATE\_RUNNING} & The metadata (scheduling class, resource + requirements, \ldots) of the job/task was updated while the job was in + execution\\ +\bottomrule +\end{tabular} +\end{center} Figure~\ref{fig:eventTypes} shows the expected transitions between event types. @@ -177,22 +171,16 @@ file segments) where each carriage return separated line represents a single record for that table. There are namely 5 different table ``files'': - -\begin{itemize} -\item - \texttt{machine\_configs}, which is a table containing each physical +\begin{description} +\item[\texttt{machine\_configs},] which is a table containing each physical machine's configuration and its evolution over time; -\item - \texttt{instance\_events}, which is a table of task events; -\item - \texttt{collection\_events}, which is a table of job events; -\item - \texttt{machine\_attributes}, which is a table containing (obfuscated) +\item[\texttt{instance\_events},] which is a table of task events; +\item[\texttt{collection\_events},] which is a table of job events; +\item[\texttt{machine\_attributes},] which is a table containing (obfuscated) metadata about each physical machine and its evolution over time; -\item - \texttt{instance\_usage}, which contains resource (CPU/RAM) measures +\item[\texttt{instance\_usage},] which contains resource (CPU/RAM) measures of jobs and tasks running on the single machines. -\end{itemize} +\end{description} The scope of this thesis focuses on the tables \texttt{machine\_configs}, \texttt{instance\_events} and @@ -224,7 +212,11 @@ analysis}\label{project-requirements-and-analysis}} \hypertarget{analysis-methodology}{% \section{Analysis methodology}\label{analysis-methodology}} -\textbf{TBD} +Due to the inherent complexity in analyzing traces of this size, novel +bleeding-edge data engineering tecniques were adopted to performed the required +computations. We used the framework Apache Spark to perform efficient and +parallel Map-Reduce computations. In this section, we discuss the technical +details behind our approach. \hypertarget{introduction-on-apache-spark}{% \subsection{Introduction on Apache @@ -302,15 +294,16 @@ the presence of incomplete data (i.e.~records which contain fields whose values is unknown). This filtering is performed using the \texttt{.filter()} operation of Spark's RDD API. -The core of each query is often a \texttt{groupby()} followed by a \texttt{map()} -operation on the aggregated data. The \texttt{groupby()} groups the set of all records -into several subsets of records each having something in common. Then, each of -this small clusters is reduced with a \texttt{map()} operation to a single -record. The motivation behind this computation is often to analyze a time -series of several different traces of programs. This is implemented by -\texttt{groupby()}-ing records by program id, and then \texttt{map()}-ing each program -trace set by sorting by time the traces and computing the desired property in -the form of a record. +The core of each query is often a \texttt{groupby()} followed by a +\texttt{map()} operation on the aggregated data. The \texttt{groupby()} groups +the set of all records into several subsets of records each having something in +common. Then, each of this small clusters is reduced with a \texttt{map()} +operation to a single record. The motivation behind this way of computing data +is that for the analysis in this thesis it is often necessary to analyze the +behaviour w.r.t. time of either task or jobs by looking at their events. These +queries are therefore implemented by \texttt{groupby()}-ing records by task or +job, and then \texttt{map()}-ing each set of event records sorting them by time +and performing the desired computation on the obtained chronological event log. Sometimes intermediate results are saved in Spark's parquet format in order to compute and save intermediate results beforehand.