242 lines
8.8 KiB
Markdown
242 lines
8.8 KiB
Markdown
---
|
|
documentclass: usiinfbachelorproject
|
|
title: Understanding and Comparing Unsuccessful Executions in Large Datacenters
|
|
author: Claudio Maggioni
|
|
|
|
pandoc-options:
|
|
- --filter=pandoc-include
|
|
- --latex-engine-opt=--shell-escape
|
|
- --latex-engine-opt=--enable-write18
|
|
|
|
header-includes:
|
|
- |
|
|
```{=latex}
|
|
\usepackage{subcaption}
|
|
\usepackage{booktabs}
|
|
\usepackage{graphicx}
|
|
|
|
\captionsetup{labelfont={bf}}
|
|
%\subtitle{The (optional) subtitle}
|
|
|
|
\versiondate{\today}
|
|
|
|
\begin{committee}
|
|
\advisor[Universit\`a della Svizzera Italiana,
|
|
Switzerland]{Prof.}{Walter}{Binder}
|
|
\assistant[Universit\`a della Svizzera Italiana,
|
|
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
|
\end{committee}
|
|
|
|
\abstract{The project aims at comparing two different traces coming from large
|
|
datacenters, focusing in particular on unsuccessful executions of jobs and
|
|
tasks submitted by users. The objective of this project is to compare the
|
|
resource waste caused by unsuccessful executions, their impact on application
|
|
performance, and their root causes. We will show the strong negative impact on
|
|
CPU and RAM usage and on task slowdown. We will analyze patterns of
|
|
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
|
Moreover, we will uncover their root causes by inspecting key workload and
|
|
system attributes such asmachine locality and concurrency level.}
|
|
```
|
|
---
|
|
|
|
# Introduction (including Motivation)
|
|
|
|
# State of the Art
|
|
|
|
- Introduce Ros\'a 2015 DSN paper on analysis
|
|
- Describe Google Borg clusters
|
|
- Describe Traces contents
|
|
- Differences between 2011 and 2019 traces
|
|
|
|
# Project requirements and analysis
|
|
|
|
(describe our objective with this analysis in detail)
|
|
|
|
# Analysis methodology
|
|
|
|
## Technical overview of traces' file format and schema
|
|
|
|
## Overview on challenging aspects of analysis (data size, schema, avaliable computation resources)
|
|
|
|
## Introduction on apache spark
|
|
|
|
## General workflow description of apache spark workflow
|
|
|
|
The Google 2019 Borg cluster traces analysis were conducted by using Apache
|
|
Spark and its Python 3 API (pyspark). Spark was used to execute a series of
|
|
queries to perform various sums and aggregations over the entire dataset
|
|
provided by Google.
|
|
|
|
In general, each query follows a general Map-Reduce template, where traces are
|
|
first read, parsed, filtered by performing selections, projections and computing
|
|
new derived fields. Then, the trace records are often grouped by one of their
|
|
fields, clustering related data toghether before a reduce or fold operation is
|
|
applied to each grouping.
|
|
|
|
Most input data is in JSONL format and adheres to a schema Google profided in
|
|
the form of a protobuffer specification[^1].
|
|
|
|
[^1]: [Google 2019 Borg traces Protobuffer specification on Github](
|
|
https://github.com/google/cluster-data/blob/master/clusterdata_trace_format_v3.proto)
|
|
|
|
On of the main quirks in the traces is that fields that have a "zero" value
|
|
(i.e. a value like 0 or the empty string) are often omitted in the JSON object
|
|
records. When reading the traces in Apache Spark is therefore necessary to check
|
|
for this possibility and populate those zero fields when omitted.
|
|
|
|
Most queries use only two or three fields in each trace records, while the
|
|
original records often are made of a couple of dozen fields. In order to save
|
|
memory during the query, a projection is often applied to the data by the means
|
|
of a .map() operation over the entire trace set, performed using Spark's RDD
|
|
API.
|
|
|
|
Another operation that is often necessary to perform prior to the Map-Reduce core of
|
|
each query is a record filtering process, which is often motivated by the
|
|
presence of incomplete data (i.e. records which contain fields whose values is
|
|
unknown). This filtering is performed using the .filter() operation of Spark's
|
|
RDD API.
|
|
|
|
The core of each query is often a groupBy followed by a map() operation on the
|
|
aggregated data. The groupby groups the set of all records into several subsets
|
|
of records each having something in common. Then, each of this small clusters is
|
|
reduced with a .map() operation to a single record. The motivation behind this
|
|
computation is often to analyze a time series of several different traces of
|
|
programs. This is implemented by groupBy()-ing records by program id, and then
|
|
map()-ing each program trace set by sorting by time the traces and computing the
|
|
desired property in the form of a record.
|
|
|
|
Sometimes intermediate results are saved in Spark's parquet format in order to
|
|
compute and save intermediate results beforehand.
|
|
|
|
## General Query script design
|
|
|
|
## Ad-Hoc presentation of some analysis scripts (w diagrams)
|
|
|
|
# Analysis (w observations)
|
|
|
|
## machine_configs
|
|
|
|
\input{figures/machine_configs}
|
|
|
|
Refer to figure \ref{fig:machineconfigs}.
|
|
|
|
**Observations**:
|
|
|
|
- machine configurations are definitely more varied than the ones in the 2011
|
|
traces
|
|
- some clusters have more machine variability
|
|
|
|
## machine_time_waste
|
|
|
|
\input{figures/machine_time_waste}
|
|
|
|
Refer to figures \ref{fig:machinetimewaste-abs} and
|
|
\ref{fig:machinetimewaste-rel}.
|
|
|
|
**Observations**:
|
|
|
|
- Across all cluster almost 50% of time is spent in "unknown" transitions, i.e.
|
|
there are some time slices that are related to a state transition that Google
|
|
says are not "typical" transitions. This is mostly due to the trace log being
|
|
intermittent when recording all state transitions.
|
|
- 80% of the time spent in KILL and LOST is unknown. This is predictable, since
|
|
both states indicate that the job execution is not stable (in particular LOST
|
|
is used when the state logging itself is unstable)
|
|
- From the absolute graph we see that the time "wasted" on non-finish terminated
|
|
jobs is very significant
|
|
- Execution is the most significant task phase, followed by queuing time and
|
|
scheduling time ("ready" state)
|
|
- In the absolute graph we see that a significant amount of time is spent to
|
|
re-schedule evicted jobs ("evicted" state)
|
|
- Cluster A has unusually high queuing times
|
|
|
|
## task_slowdown
|
|
|
|
\input{figures/task_slowdown}
|
|
|
|
Refer to figure \ref{fig:taskslowdown}
|
|
|
|
**Observations**:
|
|
|
|
- Priority values are different from 0-11 values in the 2011 traces. A
|
|
conversion table is provided by Google;
|
|
- For some priorities (e.g. 101 for cluster D) the relative number of finishing
|
|
task is very low and the mean slowdown is very high (315). This behaviour
|
|
differs from the relatively homogeneous values from the 2011 traces.
|
|
- Some slowdown values cannot be computed since either some tasks have a 0ns
|
|
execution time or for some priorities no tasks in the traces terminate
|
|
successfully. More raw data on those exception is in Jupyter.
|
|
- The % of finishing jobs is relatively low comparing with the 2011 traces.
|
|
|
|
## spatial_resource_waste
|
|
|
|
\input{figures/spatial_resource_waste}
|
|
|
|
Refer to figures \ref{fig:spatialresourcewaste-actual} and
|
|
\ref{fig:spatialresourcewaste-requested}.
|
|
|
|
**Observations**:
|
|
|
|
- Most (mesasured and requested) resources are used by killed job, even more
|
|
than in the 2011 traces.
|
|
- Behaviour is rather homogeneous across datacenters, with the exception of
|
|
cluster G where a lot of LOST-terminated tasks acquired 70% of both CPU and
|
|
RAM
|
|
|
|
## figure_7
|
|
|
|
\input{figures/figure_7}
|
|
|
|
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
|
\ref{fig:figureVII-c}.
|
|
|
|
**Observations**:
|
|
|
|
- No smooth curves in this figure either, unlike 2011 traces
|
|
- The behaviour of curves for 7a (priority) is almost the opposite of 2011, i.e.
|
|
in-between priorities have higher kill rates while priorities at the extremum
|
|
have lower kill rates. This could also be due bt the inherent distribution of
|
|
job terminations;
|
|
- Event execution time curves are quite different than 2011, here it seems there
|
|
is a good correlation between short task execution times and finish event
|
|
rates, instead of the U shape curve in 2015 DSN
|
|
- In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
|
- Machine concurrency seems to play little role in the event termination
|
|
distribution, as for all concurrency factors the kill rate is at 90%.
|
|
|
|
## figure_8
|
|
## figure_9
|
|
|
|
\input{figures/figure_9}
|
|
|
|
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|
\ref{fig:figureIX-c}.
|
|
|
|
**Observations**:
|
|
|
|
- Behaviour between cluster varies a lot
|
|
- There are no "smooth" gradients in the various curves unlike in the 2011
|
|
traces
|
|
- Killed jobs have higher event rates in general, and overall dominate all event
|
|
rates measures
|
|
- There still seems to be a correlation between short execution job times and
|
|
successfull final termination, and likewise for kills and higher job
|
|
terminations
|
|
- Across all clusters, a machine locality factor of 1 seems to lead to the
|
|
highest success event rate
|
|
|
|
## table_iii, table_iv, figure_v
|
|
|
|
## Potential causes of unsuccesful executions
|
|
|
|
# Implementation issues -- Analysis limitations
|
|
|
|
## Discussion on unknown fields
|
|
|
|
## Limitation on computation resources required for the analysis
|
|
|
|
## Other limitations ...
|
|
|
|
# Conclusions and future work or possible developments
|
|
|
|
<!-- vim: set ts=2 sw=2 et tw=80: -->
|