bachelorThesis/report/Claudio_Maggioni_report.md

---
documentclass: usiinfbachelorproject
title: Understanding and Comparing Unsuccessful Executions in Large Datacenters
author: Claudio Maggioni

pandoc-options:
  - --filter=pandoc-include
  - --latex-engine-opt=--shell-escape
  - --latex-engine-opt=--enable-write18

header-includes:
- |
  ```{=latex}
  \usepackage{subcaption}
  \usepackage{booktabs}
  \usepackage{graphicx}

  \captionsetup{labelfont={bf}}
  %\subtitle{The (optional) subtitle}

  \versiondate{\today}

  \begin{committee}
  \advisor[Universit\`a della Svizzera Italiana,
  Switzerland]{Prof.}{Walter}{Binder}
  \assistant[Universit\`a della Svizzera Italiana,
  Switzerland]{Dr.}{Andrea}{Ros\'a}
  \end{committee}

  \abstract{The project aims at comparing two different traces coming from large
  datacenters, focusing in particular on unsuccessful executions of jobs and
  tasks submitted by users. The objective of this project is to compare the
  resource waste caused by unsuccessful executions, their impact on application
  performance, and their root causes. We will show the strong negative impact on
  CPU and RAM usage and on task slowdown.  We will analyze patterns of
  unsuccessful jobs and tasks, particularly focusing on their interdependency.
  Moreover, we will uncover their root causes by inspecting key workload and
  system attributes such asmachine locality and concurrency level.}
  ```
---

# Introduction (including Motivation)

# State of the Art

- Introduce Ros\'a 2015 DSN paper on analysis
- Describe Google Borg clusters
- Describe Traces contents
- Differences between 2011 and 2019 traces

# Project requirements and analysis

(describe our objective with this analysis in detail)

# Analysis methodology

## Technical overview of traces' file format and schema

## Overview on challenging aspects of analysis (data size, schema, avaliable computation resources)

## Introduction on apache spark

## General workflow description of apache spark workflow

The Google 2019 Borg cluster traces analysis were conducted by using Apache
Spark and its Python 3 API (pyspark).  Spark was used to execute a series of
queries to perform various sums and aggregations over the entire dataset
provided by Google.

In general, each query follows a general Map-Reduce template, where traces are
first read, parsed, filtered by performing selections, projections and computing
new derived fields. Then, the trace records are often grouped by one of their
fields, clustering related data toghether before a reduce or fold operation is
applied to each grouping.

Most input data is in JSONL format and adheres to a schema Google profided in
the form of a protobuffer specification[^1].

[^1]: [Google 2019 Borg traces Protobuffer specification on Github](
https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto)

On of the main quirks in the traces is that fields that have a "zero" value
(i.e. a value like 0 or the empty string) are often omitted in the JSON object
records. When reading the traces in Apache Spark is therefore necessary to check
for this possibility and populate those zero fields when omitted.

Most queries use only two or three fields in each trace records, while the
original records often are made of a couple of dozen fields. In order to save
memory during the query, a projection is often applied to the data by the means
of a .map() operation over the entire trace set, performed using Spark's RDD
API.

Another operation that is often necessary to perform prior to the Map-Reduce core of
each query is a record filtering process, which is often motivated by the
presence of incomplete data (i.e. records which contain fields whose values is
unknown). This filtering is performed using the .filter() operation of Spark's
RDD API.

The core of each query is often a groupBy followed by a map() operation on the
aggregated data. The groupby groups the set of all records into several subsets
of records each having something in common. Then, each of this small clusters is
reduced with a .map() operation to a single record. The motivation behind this
computation is often to analyze a time series of several different traces of
programs. This is implemented by groupBy()-ing records by program id, and then
map()-ing each program trace set by sorting by time the traces and computing the
desired property in the form of a record.

Sometimes intermediate results are saved in Spark's parquet format in order to
compute and save intermediate results beforehand.

## General Query script design

## Ad-Hoc presentation of some analysis scripts (w diagrams)

# Analysis and observations

## Overview of machine configurations in each cluster

\input{figures/machine_configs}

Refer to figure \ref{fig:machineconfigs}.

**Observations**:

- machine configurations are definitely more varied than the ones in the 2011
  traces
- some clusters have more machine variability

## Analysis of execution time per each execution phase

\input{figures/machine_time_waste}

Refer to figures \ref{fig:machinetimewaste-abs} and
\ref{fig:machinetimewaste-rel}.

**Observations**:

- Across all cluster almost 50% of time is spent in "unknown" transitions, i.e.
  there are some time slices that are related to a state transition that Google
  says are not "typical" transitions. This is mostly due to the trace log being
  intermittent when recording all state transitions.
- 80% of the time spent in KILL and LOST is unknown. This is predictable, since
  both states indicate that the job execution is not stable (in particular LOST
  is used when the state logging itself is unstable)
- From the absolute graph we see that the time "wasted" on non-finish terminated
  jobs is very significant
- Execution is the most significant task phase, followed by queuing time and
  scheduling time ("ready" state)
- In the absolute graph we see that a significant amount of time is spent to
  re-schedule evicted jobs ("evicted" state)
- Cluster A has unusually high queuing times

## Task slowdown

\input{figures/task_slowdown}

Refer to figure \ref{fig:taskslowdown}

**Observations**:

- Priority values are different from 0-11 values in the 2011 traces. A
  conversion table is provided by Google;
- For some priorities (e.g. 101 for cluster D) the relative number of finishing
  task is very low and the mean slowdown is very high (315). This behaviour
  differs from the relatively homogeneous values from the 2011 traces.
- Some slowdown values cannot be computed since either some tasks have a 0ns
  execution time or for some priorities no tasks in the traces terminate
  successfully. More raw data on those exception is in Jupyter.
- The % of finishing jobs is relatively low comparing with the 2011 traces.

## Reserved and actual resource usage of tasks

\input{figures/spatial_resource_waste}

Refer to figures \ref{fig:spatialresourcewaste-actual} and
\ref{fig:spatialresourcewaste-requested}.

**Observations**:

- Most (mesasured and requested) resources are used by killed job, even more
  than in the 2011 traces.
- Behaviour is rather homogeneous across datacenters, with the exception of
  cluster G where a lot of LOST-terminated tasks acquired 70% of both CPU and
  RAM

## Correlation between task events' metadata and task termination

\input{figures/figure_7}

Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
\ref{fig:figureVII-c}.

**Observations**:

- No smooth curves in this figure either, unlike 2011 traces
- The behaviour of curves for 7a (priority) is almost the opposite of 2011, i.e.
  in-between priorities have higher kill rates while priorities at the extremum
  have lower kill rates. This could also be due bt the inherent distribution of
  job terminations;
- Event execution time curves are quite different than 2011, here it seems there
  is a good correlation between short task execution times and finish event
  rates, instead of the U shape curve in 2015 DSN
- In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
- Machine concurrency seems to play little role in the event termination
  distribution, as for all concurrency factors the kill rate is at 90%.

## Correlation between task events' resource metadata and task termination

## Correlation between job events' metadata and job termination

\input{figures/figure_9}

Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
\ref{fig:figureIX-c}.

**Observations**:

- Behaviour between cluster varies a lot
- There are no "smooth" gradients in the various curves unlike in the 2011
  traces
- Killed jobs have higher event rates in general, and overall dominate all event
  rates measures
- There still seems to be a correlation between short execution job times and
  successfull final termination, and likewise for kills and higher job
  terminations
- Across all clusters, a machine locality factor of 1 seems to lead to the
  highest success event rate

## Mean number of tasks and event distribution per task type

\input{figures/table_iii}

Refer to figure \ref{fig:tableIII}.

**Observations**:

- The mean number of events per task is an order of magnitude higher than in the
  2011 traces
- Generally speaking, the event type with higher mean is the termination event
  for the task
- The # evts mean is higher than the sum of all other event type means, since it
  appears there are a lot more non-termination events in the 2019 traces.

## Mean number of tasks and event distribution per job type

\input{figures/table_iv}

Refer to figure \ref{fig:tableIV}.

**Observations**:

- Again the mean number of tasks is significantly higher than the 2011 traces,
  indicating a higher complexity of workloads
- Cluster A has no evicted jobs
- The number of events is however lower than the event means in the 2011 traces

## Probability of task successful termination given its unsuccesful events

\input{figures/figure_5}

Refer to figure \ref{fig:figureV}.

**Observations**:

- Behaviour is very different from cluster to cluster
- There is no easy conclusion, unlike in 2011, on the correlation between
  succesful probability and # of events of a specific type.
- Clusters B, C and D in particular have very unsmooth lines that vary a lot for
  small # evts differences. This may be due to an uneven distribution of # evts
  in the traces.

## Potential causes of unsuccesful executions

# Implementation issues -- Analysis limitations

## Discussion on unknown fields

## Limitation on computation resources required for the analysis

## Other limitations ...

# Conclusions and future work or possible developments

<!-- vim: set ts=2 sw=2 et tw=80: -->
added report template 2021-05-11 12:48:51 +00:00			`---`
			`documentclass: usiinfbachelorproject`
			`title: Understanding and Comparing Unsuccessful Executions in Large Datacenters`
			`author: Claudio Maggioni`
report: added figures 2021-05-12 12:15:49 +00:00
			`pandoc-options:`
			`- --filter=pandoc-include`
			`- --latex-engine-opt=--shell-escape`
			`- --latex-engine-opt=--enable-write18`

added report template 2021-05-11 12:48:51 +00:00			`header-includes:`
			`- \|`
			```{=latex}
report: added figures 2021-05-12 12:15:49 +00:00			`\usepackage{subcaption}`
			`\usepackage{booktabs}`
			`\usepackage{graphicx}`

added report template 2021-05-11 12:48:51 +00:00			`\captionsetup{labelfont={bf}}`
			`%\subtitle{The (optional) subtitle}`

			`\versiondate{\today}`

			`\begin{committee}`
			\advisor[Universit\`a della Svizzera Italiana,
			`Switzerland]{Prof.}{Walter}{Binder}`
			\assistant[Universit\`a della Svizzera Italiana,
			`Switzerland]{Dr.}{Andrea}{Ros\'a}`
			`\end{committee}`

			`\abstract{The project aims at comparing two different traces coming from large`
			`datacenters, focusing in particular on unsuccessful executions of jobs and`
			`tasks submitted by users. The objective of this project is to compare the`
			`resource waste caused by unsuccessful executions, their impact on application`
			`performance, and their root causes. We will show the strong negative impact on`
			`CPU and RAM usage and on task slowdown. We will analyze patterns of`
			`unsuccessful jobs and tasks, particularly focusing on their interdependency.`
			`Moreover, we will uncover their root causes by inspecting key workload and`
			`system attributes such asmachine locality and concurrency level.}`
			```
			`---`

report: added figures 2021-05-12 12:15:49 +00:00			`# Introduction (including Motivation)`

			`# State of the Art`

			`- Introduce Ros\'a 2015 DSN paper on analysis`
			`- Describe Google Borg clusters`
			`- Describe Traces contents`
			`- Differences between 2011 and 2019 traces`

			`# Project requirements and analysis`

			`(describe our objective with this analysis in detail)`

			`# Analysis methodology`

			`## Technical overview of traces' file format and schema`

			`## Overview on challenging aspects of analysis (data size, schema, avaliable computation resources)`

			`## Introduction on apache spark`

			`## General workflow description of apache spark workflow`

			`The Google 2019 Borg cluster traces analysis were conducted by using Apache`
			`Spark and its Python 3 API (pyspark). Spark was used to execute a series of`
			`queries to perform various sums and aggregations over the entire dataset`
			`provided by Google.`

			`In general, each query follows a general Map-Reduce template, where traces are`
			`first read, parsed, filtered by performing selections, projections and computing`
			`new derived fields. Then, the trace records are often grouped by one of their`
			`fields, clustering related data toghether before a reduce or fold operation is`
			`applied to each grouping.`

			`Most input data is in JSONL format and adheres to a schema Google profided in`
			`the form of a protobuffer specification[^1].`

			`[^1]: [Google 2019 Borg traces Protobuffer specification on Github](`
			`https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto)`

			`On of the main quirks in the traces is that fields that have a "zero" value`
			`(i.e. a value like 0 or the empty string) are often omitted in the JSON object`
			`records. When reading the traces in Apache Spark is therefore necessary to check`
			`for this possibility and populate those zero fields when omitted.`

			`Most queries use only two or three fields in each trace records, while the`
			`original records often are made of a couple of dozen fields. In order to save`
			`memory during the query, a projection is often applied to the data by the means`
			`of a .map() operation over the entire trace set, performed using Spark's RDD`
			`API.`

			`Another operation that is often necessary to perform prior to the Map-Reduce core of`
			`each query is a record filtering process, which is often motivated by the`
			`presence of incomplete data (i.e. records which contain fields whose values is`
			`unknown). This filtering is performed using the .filter() operation of Spark's`
			`RDD API.`

			`The core of each query is often a groupBy followed by a map() operation on the`
			`aggregated data. The groupby groups the set of all records into several subsets`
			`of records each having something in common. Then, each of this small clusters is`
			`reduced with a .map() operation to a single record. The motivation behind this`
			`computation is often to analyze a time series of several different traces of`
			`programs. This is implemented by groupBy()-ing records by program id, and then`
			`map()-ing each program trace set by sorting by time the traces and computing the`
			`desired property in the form of a record.`

			`Sometimes intermediate results are saved in Spark's parquet format in order to`
			`compute and save intermediate results beforehand.`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## General Query script design`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## Ad-Hoc presentation of some analysis scripts (w diagrams)`
added report template 2021-05-11 12:48:51 +00:00
report 2021-05-16 10:22:27 +00:00			`# Analysis and observations`
added report template 2021-05-11 12:48:51 +00:00
report 2021-05-16 10:22:27 +00:00			`## Overview of machine configurations in each cluster`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`\input{figures/machine_configs}`
added report template 2021-05-11 12:48:51 +00:00
more report work 2021-05-13 09:21:18 +00:00			`Refer to figure \ref{fig:machineconfigs}.`

report: added figures 2021-05-12 12:15:49 +00:00			`Observations:`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`- machine configurations are definitely more varied than the ones in the 2011`
			`traces`
			`- some clusters have more machine variability`
added report template 2021-05-11 12:48:51 +00:00
report 2021-05-16 10:22:27 +00:00			`## Analysis of execution time per each execution phase`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`\input{figures/machine_time_waste}`
added report template 2021-05-11 12:48:51 +00:00
more report work 2021-05-13 09:21:18 +00:00			`Refer to figures \ref{fig:machinetimewaste-abs} and`
			`\ref{fig:machinetimewaste-rel}.`

report: added figures 2021-05-12 12:15:49 +00:00			`Observations:`
added report template 2021-05-11 12:48:51 +00:00
report status 2021-05-12 14:26:48 +00:00			`- Across all cluster almost 50% of time is spent in "unknown" transitions, i.e.`
			`there are some time slices that are related to a state transition that Google`
			`says are not "typical" transitions. This is mostly due to the trace log being`
			`intermittent when recording all state transitions.`
			`- 80% of the time spent in KILL and LOST is unknown. This is predictable, since`
			`both states indicate that the job execution is not stable (in particular LOST`
			`is used when the state logging itself is unstable)`
			`- From the absolute graph we see that the time "wasted" on non-finish terminated`
			`jobs is very significant`
			`- Execution is the most significant task phase, followed by queuing time and`
			`scheduling time ("ready" state)`
			`- In the absolute graph we see that a significant amount of time is spent to`
			`re-schedule evicted jobs ("evicted" state)`
			`- Cluster A has unusually high queuing times`

report 2021-05-16 10:22:27 +00:00			`## Task slowdown`
more report 2021-05-12 13:24:57 +00:00
more report work 2021-05-13 09:21:18 +00:00			`\input{figures/task_slowdown}`

			`Refer to figure \ref{fig:taskslowdown}`

			`Observations:`

			`- Priority values are different from 0-11 values in the 2011 traces. A`
			`conversion table is provided by Google;`
			`- For some priorities (e.g. 101 for cluster D) the relative number of finishing`
			`task is very low and the mean slowdown is very high (315). This behaviour`
			`differs from the relatively homogeneous values from the 2011 traces.`
			`- Some slowdown values cannot be computed since either some tasks have a 0ns`
			`execution time or for some priorities no tasks in the traces terminate`
			`successfully. More raw data on those exception is in Jupyter.`
			`- The % of finishing jobs is relatively low comparing with the 2011 traces.`

report 2021-05-16 10:22:27 +00:00			`## Reserved and actual resource usage of tasks`
more report work 2021-05-13 09:21:18 +00:00
more report 2021-05-12 13:24:57 +00:00			`\input{figures/spatial_resource_waste}`

more report work 2021-05-13 09:21:18 +00:00			`Refer to figures \ref{fig:spatialresourcewaste-actual} and`
			`\ref{fig:spatialresourcewaste-requested}.`

more report 2021-05-12 13:24:57 +00:00			`Observations:`

report status 2021-05-12 14:26:48 +00:00			`- Most (mesasured and requested) resources are used by killed job, even more`
			`than in the 2011 traces.`
			`- Behaviour is rather homogeneous across datacenters, with the exception of`
			`cluster G where a lot of LOST-terminated tasks acquired 70% of both CPU and`
			`RAM`

report 2021-05-16 10:22:27 +00:00			`## Correlation between task events' metadata and task termination`
report done task slowdown 2021-05-12 20:03:13 +00:00
more report work 2021-05-13 09:21:18 +00:00			`\input{figures/figure_7}`

			`Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and`
			`\ref{fig:figureVII-c}.`

			`Observations:`

			`- No smooth curves in this figure either, unlike 2011 traces`
			`- The behaviour of curves for 7a (priority) is almost the opposite of 2011, i.e.`
			`in-between priorities have higher kill rates while priorities at the extremum`
			`have lower kill rates. This could also be due bt the inherent distribution of`
			`job terminations;`
			`- Event execution time curves are quite different than 2011, here it seems there`
			`is a good correlation between short task execution times and finish event`
			`rates, instead of the U shape curve in 2015 DSN`
			`- In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform`
			`- Machine concurrency seems to play little role in the event termination`
			`distribution, as for all concurrency factors the kill rate is at 90%.`
report done task slowdown 2021-05-12 20:03:13 +00:00
report 2021-05-16 10:22:27 +00:00			`## Correlation between task events' resource metadata and task termination`

			`## Correlation between job events' metadata and job termination`
more report work 2021-05-13 09:21:18 +00:00
			`\input{figures/figure_9}`

			`Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and`
			`\ref{fig:figureIX-c}.`

			`Observations:`

			`- Behaviour between cluster varies a lot`
			`- There are no "smooth" gradients in the various curves unlike in the 2011`
			`traces`
			`- Killed jobs have higher event rates in general, and overall dominate all event`
			`rates measures`
			`- There still seems to be a correlation between short execution job times and`
			`successfull final termination, and likewise for kills and higher job`
			`terminations`
			`- Across all clusters, a machine locality factor of 1 seems to lead to the`
			`highest success event rate`

report 2021-05-16 10:22:27 +00:00			`## Mean number of tasks and event distribution per task type`

			`\input{figures/table_iii}`

			`Refer to figure \ref{fig:tableIII}.`

			`Observations:`

			`- The mean number of events per task is an order of magnitude higher than in the`
			`2011 traces`
			`- Generally speaking, the event type with higher mean is the termination event`
			`for the task`
			`- The # evts mean is higher than the sum of all other event type means, since it`
			`appears there are a lot more non-termination events in the 2019 traces.`

			`## Mean number of tasks and event distribution per job type`

			`\input{figures/table_iv}`

			`Refer to figure \ref{fig:tableIV}.`

			`Observations:`

			`- Again the mean number of tasks is significantly higher than the 2011 traces,`
			`indicating a higher complexity of workloads`
			`- Cluster A has no evicted jobs`
			`- The number of events is however lower than the event means in the 2011 traces`

			`## Probability of task successful termination given its unsuccesful events`

			`\input{figures/figure_5}`

			`Refer to figure \ref{fig:figureV}.`

			`Observations:`

			`- Behaviour is very different from cluster to cluster`
			`- There is no easy conclusion, unlike in 2011, on the correlation between`
			`succesful probability and # of events of a specific type.`
			`- Clusters B, C and D in particular have very unsmooth lines that vary a lot for`
			`small # evts differences. This may be due to an uneven distribution of # evts`
			`in the traces.`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## Potential causes of unsuccesful executions`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`# Implementation issues -- Analysis limitations`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## Discussion on unknown fields`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## Limitation on computation resources required for the analysis`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`## Other limitations ...`
added report template 2021-05-11 12:48:51 +00:00
report: added figures 2021-05-12 12:15:49 +00:00			`# Conclusions and future work or possible developments`
added report template 2021-05-11 12:48:51 +00:00
			`<!-- vim: set ts=2 sw=2 et tw=80: -->`