8.8 KiB
documentclass | title | author | pandoc-options | header-includes | ||||
---|---|---|---|---|---|---|---|---|
usiinfbachelorproject | Understanding and Comparing Unsuccessful Executions in Large Datacenters | Claudio Maggioni |
|
|
Introduction (including Motivation)
State of the Art
- Introduce Ros'a 2015 DSN paper on analysis
- Describe Google Borg clusters
- Describe Traces contents
- Differences between 2011 and 2019 traces
Project requirements and analysis
(describe our objective with this analysis in detail)
Analysis methodology
Technical overview of traces' file format and schema
Overview on challenging aspects of analysis (data size, schema, avaliable computation resources)
Introduction on apache spark
General workflow description of apache spark workflow
The Google 2019 Borg cluster traces analysis were conducted by using Apache Spark and its Python 3 API (pyspark). Spark was used to execute a series of queries to perform various sums and aggregations over the entire dataset provided by Google.
In general, each query follows a general Map-Reduce template, where traces are first read, parsed, filtered by performing selections, projections and computing new derived fields. Then, the trace records are often grouped by one of their fields, clustering related data toghether before a reduce or fold operation is applied to each grouping.
Most input data is in JSONL format and adheres to a schema Google profided in the form of a protobuffer specification1.
On of the main quirks in the traces is that fields that have a "zero" value (i.e. a value like 0 or the empty string) are often omitted in the JSON object records. When reading the traces in Apache Spark is therefore necessary to check for this possibility and populate those zero fields when omitted.
Most queries use only two or three fields in each trace records, while the original records often are made of a couple of dozen fields. In order to save memory during the query, a projection is often applied to the data by the means of a .map() operation over the entire trace set, performed using Spark's RDD API.
Another operation that is often necessary to perform prior to the Map-Reduce core of each query is a record filtering process, which is often motivated by the presence of incomplete data (i.e. records which contain fields whose values is unknown). This filtering is performed using the .filter() operation of Spark's RDD API.
The core of each query is often a groupBy followed by a map() operation on the aggregated data. The groupby groups the set of all records into several subsets of records each having something in common. Then, each of this small clusters is reduced with a .map() operation to a single record. The motivation behind this computation is often to analyze a time series of several different traces of programs. This is implemented by groupBy()-ing records by program id, and then map()-ing each program trace set by sorting by time the traces and computing the desired property in the form of a record.
Sometimes intermediate results are saved in Spark's parquet format in order to compute and save intermediate results beforehand.
General Query script design
Ad-Hoc presentation of some analysis scripts (w diagrams)
Analysis (w observations)
machine_configs
\input{figures/machine_configs}
Refer to figure \ref{fig:machineconfigs}.
Observations:
- machine configurations are definitely more varied than the ones in the 2011 traces
- some clusters have more machine variability
machine_time_waste
\input{figures/machine_time_waste}
Refer to figures \ref{fig:machinetimewaste-abs} and \ref{fig:machinetimewaste-rel}.
Observations:
- Across all cluster almost 50% of time is spent in "unknown" transitions, i.e. there are some time slices that are related to a state transition that Google says are not "typical" transitions. This is mostly due to the trace log being intermittent when recording all state transitions.
- 80% of the time spent in KILL and LOST is unknown. This is predictable, since both states indicate that the job execution is not stable (in particular LOST is used when the state logging itself is unstable)
- From the absolute graph we see that the time "wasted" on non-finish terminated jobs is very significant
- Execution is the most significant task phase, followed by queuing time and scheduling time ("ready" state)
- In the absolute graph we see that a significant amount of time is spent to re-schedule evicted jobs ("evicted" state)
- Cluster A has unusually high queuing times
task_slowdown
\input{figures/task_slowdown}
Refer to figure \ref{fig:taskslowdown}
Observations:
- Priority values are different from 0-11 values in the 2011 traces. A conversion table is provided by Google;
- For some priorities (e.g. 101 for cluster D) the relative number of finishing task is very low and the mean slowdown is very high (315). This behaviour differs from the relatively homogeneous values from the 2011 traces.
- Some slowdown values cannot be computed since either some tasks have a 0ns execution time or for some priorities no tasks in the traces terminate successfully. More raw data on those exception is in Jupyter.
- The % of finishing jobs is relatively low comparing with the 2011 traces.
spatial_resource_waste
\input{figures/spatial_resource_waste}
Refer to figures \ref{fig:spatialresourcewaste-actual} and \ref{fig:spatialresourcewaste-requested}.
Observations:
- Most (mesasured and requested) resources are used by killed job, even more than in the 2011 traces.
- Behaviour is rather homogeneous across datacenters, with the exception of cluster G where a lot of LOST-terminated tasks acquired 70% of both CPU and RAM
figure_7
\input{figures/figure_7}
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and \ref{fig:figureVII-c}.
Observations:
- No smooth curves in this figure either, unlike 2011 traces
- The behaviour of curves for 7a (priority) is almost the opposite of 2011, i.e. in-between priorities have higher kill rates while priorities at the extremum have lower kill rates. This could also be due bt the inherent distribution of job terminations;
- Event execution time curves are quite different than 2011, here it seems there is a good correlation between short task execution times and finish event rates, instead of the U shape curve in 2015 DSN
- In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
- Machine concurrency seems to play little role in the event termination distribution, as for all concurrency factors the kill rate is at 90%.
figure_8
figure_9
\input{figures/figure_9}
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and \ref{fig:figureIX-c}.
Observations:
- Behaviour between cluster varies a lot
- There are no "smooth" gradients in the various curves unlike in the 2011 traces
- Killed jobs have higher event rates in general, and overall dominate all event rates measures
- There still seems to be a correlation between short execution job times and successfull final termination, and likewise for kills and higher job terminations
- Across all clusters, a machine locality factor of 1 seems to lead to the highest success event rate