report progress

2021-05-17 18:50:25 +02:00 · 2021-05-17 18:50:25 +02:00 · d2f896f3ed
commit d2f896f3ed
parent c5ceae561c
2 changed files with 55 additions and 62 deletions
--- a/report/Claudio_Maggioni_report.pdf
+++ b/report/Claudio_Maggioni_report.pdf
--- a/report/Claudio_Maggioni_report.tex
+++ b/report/Claudio_Maggioni_report.tex
@ -1,11 +1,11 @@
 \documentclass{usiinfbachelorproject}
 \title{Understanding and Comparing Unsuccessful Executions in Large Datacenters}
 \author{Claudio Maggioni}
-
-\usepackage[parfill]{parskip}
-\setlength{\parskip}{7pt}
+\usepackage{enumitem}
+\usepackage{parskip}
+\setlength{\parskip}{5pt}
 \setlength{\parindent}{0pt}
-
+%\usepackage[printfigures]{figcaps}
 \usepackage{xcolor}
 \usepackage{amsmath}
 \usepackage{subcaption}
@ -93,42 +93,36 @@ are encoded and stored in the trace as rows of various tables. Among the
 information events provide, the field ``type'' provides information on
 the execution status of the job or task. This field can have the
 following values:
-
-\begin{itemize}
-\item
-  \textbf{QUEUE}: The job or task was marked not eligible for scheduling
+\begin{center}
+\begin{tabular}{p{3cm}p{12cm}}
+\toprule
+\textbf{Type code} & \textbf{Description} \\
+\midrule
+	\texttt{QUEUE} & The job or task was marked not eligible for scheduling
  by Borg's scheduler, and thus Borg will move the job/task in a long
-  wait queue;
-\item
-  \textbf{SUBMIT}: The job or task was submitted to Borg for execution;
-\item
-  \textbf{ENABLE}: The job or task became eligible for scheduling;
-\item
-  \textbf{SCHEDULE}: The job or task's execution started;
-\item
-  \textbf{EVICT}: The job or task was terminated in order to free
-  computational resources for an higher priority job;
-\item
-  \textbf{FAIL}: The job or task terminated its execution unsuccesfully
-  due to a failure;
-\item
-  \textbf{FINISH}: The job or task terminated succesfully;
-\item
-  \textbf{KILL}: The job or task terminated its execution because of a
-  manual request to stop it;
-\item
-  \textbf{LOST}: It is assumed a job or task is has been terminated, but
+  wait queue\\
+\texttt{SUBMIT}  & The job or task was submitted to Borg for execution\\
+\texttt{ENABLE}  & The job or task became eligible for scheduling\\
+\texttt{SCHEDULE}  & The job or task's execution started\\
+\texttt{EVICT}  & The job or task was terminated in order to free
+   computational resources for an higher priority job\\
+\texttt{FAIL}  & The job or task terminated its execution unsuccesfully
+   due to a failure\\
+\texttt{FINISH}  & The job or task terminated succesfully\\
+\texttt{KILL}  & The job or task terminated its execution because of a
+   manual request to stop it\\
+\texttt{LOST}  & It is assumed a job or task is has been terminated, but
   due to missing data there is insufficent information to identify when
-  or how;
-\item
-  \textbf{UPDATE\_PENDING}: The metadata (scheduling class, resource
+   or how\\
+\texttt{UPDATE\_PENDING}  & The metadata (scheduling class, resource
   requirements, \ldots) of the job/task was updated while the job was
-  waiting to be scheduled;
-\item
-  \textbf{UPDATE\_RUNNING}: The metadata (scheduling class, resource
+   waiting to be scheduled\\
+\texttt{UPDATE\_RUNNING}  & The metadata (scheduling class, resource
   requirements, \ldots) of the job/task was updated while the job was in
-  execution;
-\end{itemize}
+   execution\\
+\bottomrule
+\end{tabular}
+\end{center}

 Figure~\ref{fig:eventTypes} shows the expected transitions between event
 types.
@ -177,22 +171,16 @@ file segments) where each carriage return separated line represents a
 single record for that table.

 There are namely 5 different table ``files'':
-
-\begin{itemize}
-\item
-  \texttt{machine\_configs}, which is a table containing each physical
+\begin{description}
+\item[\texttt{machine\_configs},] which is a table containing each physical
  machine's configuration and its evolution over time;
-\item
-  \texttt{instance\_events}, which is a table of task events;
-\item
-  \texttt{collection\_events}, which is a table of job events;
-\item
-  \texttt{machine\_attributes}, which is a table containing (obfuscated)
+\item[\texttt{instance\_events},] which is a table of task events;
+\item[\texttt{collection\_events},] which is a table of job events;
+\item[\texttt{machine\_attributes},] which is a table containing (obfuscated)
  metadata about each physical machine and its evolution over time;
-\item
-  \texttt{instance\_usage}, which contains resource (CPU/RAM) measures
+\item[\texttt{instance\_usage},] which contains resource (CPU/RAM) measures
  of jobs and tasks running on the single machines.
-\end{itemize}
+\end{description}

 The scope of this thesis focuses on the tables
 \texttt{machine\_configs}, \texttt{instance\_events} and
@ -224,7 +212,11 @@ analysis}\label{project-requirements-and-analysis}}
 \hypertarget{analysis-methodology}{%
 \section{Analysis methodology}\label{analysis-methodology}}

-\textbf{TBD}
+Due to the inherent complexity in analyzing traces of this size, novel
+bleeding-edge data engineering tecniques were adopted to performed the required
+computations. We used the framework Apache Spark to perform efficient and
+parallel Map-Reduce computations. In this section, we discuss the technical
+details behind our approach.

 \hypertarget{introduction-on-apache-spark}{%
 \subsection{Introduction on Apache
@ -302,15 +294,16 @@ the presence of incomplete data (i.e.~records which contain fields whose values
 is unknown). This filtering is performed using the \texttt{.filter()} operation
 of Spark's RDD API.

-The core of each query is often a \texttt{groupby()} followed by a \texttt{map()}
-operation on the aggregated data. The \texttt{groupby()} groups the set of all records
-into several subsets of records each having something in common. Then, each of
-this small clusters is reduced with a \texttt{map()} operation to a single
-record. The motivation behind this computation is often to analyze a time
-series of several different traces of programs. This is implemented by
-\texttt{groupby()}-ing records by program id, and then \texttt{map()}-ing each program
-trace set by sorting by time the traces and computing the desired property in
-the form of a record.
+The core of each query is often a \texttt{groupby()} followed by a
+\texttt{map()} operation on the aggregated data. The \texttt{groupby()} groups
+the set of all records into several subsets of records each having something in
+common. Then, each of this small clusters is reduced with a \texttt{map()}
+operation to a single record. The motivation behind this way of computing data
+is that for the analysis in this thesis it is often necessary to analyze the
+behaviour w.r.t. time of either task or jobs by looking at their events. These
+queries are therefore implemented by \texttt{groupby()}-ing records by task or
+job, and then \texttt{map()}-ing each set of event records sorting them by time
+and performing the desired computation on the obtained chronological event log.

 Sometimes intermediate results are saved in Spark's parquet format in order to
 compute and save intermediate results beforehand.