2021-05-17 14:27:17 +00:00
|
|
|
\documentclass{usiinfbachelorproject}
|
2021-05-20 15:30:06 +00:00
|
|
|
\usepackage{multirow}
|
2021-05-17 16:50:25 +00:00
|
|
|
\usepackage{enumitem}
|
2021-05-18 14:29:54 +00:00
|
|
|
\usepackage{fontawesome5}
|
2021-05-20 10:33:36 +00:00
|
|
|
\usepackage{pgf}
|
2021-05-18 14:29:54 +00:00
|
|
|
\usepackage{tikz}
|
|
|
|
\usetikzlibrary{fit,arrows,calc,positioning}
|
2021-05-17 16:50:25 +00:00
|
|
|
\usepackage{parskip}
|
2021-05-19 13:33:10 +00:00
|
|
|
%\usepackage[printfigures]{figcaps} % figures at the end of the file
|
2021-05-17 15:12:46 +00:00
|
|
|
\usepackage{xcolor}
|
2021-05-17 14:27:17 +00:00
|
|
|
\usepackage{amsmath}
|
|
|
|
\usepackage{subcaption}
|
|
|
|
\usepackage{graphicx}
|
2021-05-20 10:33:36 +00:00
|
|
|
\usepackage[backend=biber,style=numeric,citestyle=ieee]{biblatex}
|
|
|
|
\usepackage{booktabs}
|
|
|
|
\usepackage{pgfplots}
|
2021-05-20 15:30:06 +00:00
|
|
|
\usepackage{array}
|
|
|
|
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}}
|
2021-05-20 10:33:36 +00:00
|
|
|
|
|
|
|
\usepgfplotslibrary{external}
|
|
|
|
|
2021-05-19 13:33:10 +00:00
|
|
|
\addbibresource{references.bib}
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-20 10:33:36 +00:00
|
|
|
\setlength{\parskip}{5pt}
|
|
|
|
\setlength{\parindent}{0pt}
|
|
|
|
|
2021-05-17 14:27:17 +00:00
|
|
|
\captionsetup{labelfont={bf}}
|
|
|
|
|
2021-05-19 13:33:10 +00:00
|
|
|
\title{Understanding and Comparing Unsuccessful Executions in Large Datacenters}
|
|
|
|
%\subtitle{The (optional) subtitle}
|
|
|
|
\author{Claudio Maggioni}
|
2021-05-17 14:27:17 +00:00
|
|
|
\versiondate{\today}
|
|
|
|
|
|
|
|
\begin{committee}
|
|
|
|
\advisor[Universit\`a della Svizzera Italiana,
|
|
|
|
Switzerland]{Prof.}{Walter}{Binder}
|
|
|
|
\assistant[Universit\`a della Svizzera Italiana,
|
|
|
|
Switzerland]{Dr.}{Andrea}{Ros\'a}
|
|
|
|
\end{committee}
|
|
|
|
|
|
|
|
\abstract{The project aims at comparing two different traces coming from large
|
|
|
|
datacenters, focusing in particular on unsuccessful executions of jobs and
|
|
|
|
tasks submitted by users. The objective of this project is to compare the
|
|
|
|
resource waste caused by unsuccessful executions, their impact on application
|
|
|
|
performance, and their root causes. We will show the strong negative impact on
|
|
|
|
CPU and RAM usage and on task slowdown. We will analyze patterns of
|
|
|
|
unsuccessful jobs and tasks, particularly focusing on their interdependency.
|
|
|
|
Moreover, we will uncover their root causes by inspecting key workload and
|
|
|
|
system attributes such asmachine locality and concurrency level.}
|
|
|
|
|
|
|
|
\begin{document}
|
2021-05-17 15:12:46 +00:00
|
|
|
\maketitle
|
2021-05-17 14:27:17 +00:00
|
|
|
\tableofcontents
|
|
|
|
\newpage
|
|
|
|
|
2021-05-20 10:33:36 +00:00
|
|
|
\section{Introduction} In today's world there is an ever growing demand for
|
|
|
|
efficient, large scale computations. The rising trend of ``big data'' put the
|
|
|
|
need for efficient management of large scaled parallelized computing at an all
|
|
|
|
time high. This fact also increases the demand for research in the field of
|
|
|
|
distributed systems, in particular in how to schedule computations effectively,
|
|
|
|
avoid wasting resources and avoid failures.
|
2021-05-19 12:06:38 +00:00
|
|
|
|
2021-05-20 10:33:36 +00:00
|
|
|
In 2011 Google released a month long data trace of their own cluster management
|
|
|
|
system\cite{google-marso-11} \textit{Borg}, containing a lot of data regarding
|
2021-05-19 13:33:10 +00:00
|
|
|
scheduling, priority management, and failures of a real production workload.
|
2021-05-20 10:33:36 +00:00
|
|
|
This data was 2009
|
2021-05-19 13:33:10 +00:00
|
|
|
This data was the foundation of the 2015 Ros\'a et al.\ paper
|
|
|
|
\textit{Understanding the Dark Side of Big Data Clusters: An Analysis beyond
|
|
|
|
Failures}\cite{vino-paper}, which in its many conclusions highlighted the need
|
|
|
|
for better cluster management highlighting the high amount of failures found in
|
|
|
|
the traces.
|
|
|
|
|
|
|
|
In 2019 Google released an updated version of the \textit{Borg} cluster
|
2021-05-20 10:33:36 +00:00
|
|
|
traces\cite{google-marso-19}, not only containing data from a far bigger
|
|
|
|
workload due to improvements in computational technology, but also providing
|
|
|
|
data from 8 different \textit{Borg} cells from datacenters located all over the
|
|
|
|
world. These new traces are therefore about 100 times larger than the old
|
|
|
|
traces, weighing in terms of storage spaces approximately 8TiB (when compressed
|
|
|
|
and stored in JSONL format)\cite{google-drive-marso}, requiring a considerable
|
|
|
|
amount of computational power to analyze them and the implementation of special
|
|
|
|
data engineering techniques for analysis of the data.
|
2021-05-19 12:06:38 +00:00
|
|
|
|
|
|
|
This project aims to repeat the analysis performed in 2015 to highlight
|
|
|
|
similarities and differences in workload this decade brought, and expanding the
|
|
|
|
old analysis to understand even better the causes of failures and how to prevent
|
|
|
|
them. Additionally, this report will provide an overview on the data engineering
|
2021-05-20 10:33:36 +00:00
|
|
|
techniques used to perform the queries and analyses on the 2019 traces.
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-19 16:55:09 +00:00
|
|
|
\section{Background information}
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-20 10:33:36 +00:00
|
|
|
\textit{Borg} is Google's own cluster management software able to run
|
|
|
|
thousands of different jobs. Among the various cluster management services it
|
|
|
|
provides, the main ones are: job queuing, scheduling, allocation, and
|
|
|
|
deallocation due to higher priority computations.
|
|
|
|
|
|
|
|
The core structure of Borg is a cell, a set of
|
|
|
|
machines usually all within the same cluster, whose work is allocated by the
|
|
|
|
same cluster-management system and hence a cell is handled as a unit. Each
|
|
|
|
cell may run large computational workload that is submitted to Borg. Such
|
|
|
|
workload is called ``job'', which outlines the computations that a user wants
|
|
|
|
to run and is made up of several ``tasks''. A task is an executable program,
|
|
|
|
consisting of multiple processes, which has to be run on a single machine.
|
|
|
|
Those tasks may be ran sequentially or in parallel, and the condition for a
|
|
|
|
job's successful termination is nontrivial.
|
|
|
|
% Both tasks and jobs lifecyles are represented by several events, which are
|
|
|
|
% encoded and stored in the trace as rows of various tables. Among the
|
|
|
|
% information events provide, the field ``type'' provides information on the
|
|
|
|
% execution status of the job or task. This field can have several values,
|
|
|
|
% which are illustrated in figure~\ref{fig:eventtypes}.
|
|
|
|
|
|
|
|
\subsection{Traces}
|
|
|
|
|
|
|
|
The data relative to the events happening while Borg cell
|
|
|
|
processes the workload is then encoded and stored as rows of several tables that
|
|
|
|
make up a single usage trace. Such data comes from the information obtained by
|
|
|
|
the cell's management system and single machines that make up the cell. Each
|
|
|
|
table is identified by a key, usually a timestamp.
|
|
|
|
|
|
|
|
In general events can be of two kinds, there are events that are relative to the
|
|
|
|
status of the schedule, and there are other events that are relative to the
|
|
|
|
status of a task itself.
|
|
|
|
|
|
|
|
% \subsection{Rosà et al.~2015 DSN paper}
|
|
|
|
|
|
|
|
In 2015, Dr.~Andrea Rosà, Lydia Y. Chen and Prof.~Walter Binder published a
|
|
|
|
research paper titled \textit{Understanding the Dark Side of Big Data Clusters:
|
|
|
|
An Analysis beyond Failures}\cite{vino-paper} in which they performed several
|
|
|
|
analysis on unsuccessful executions in the Google's 2011 Borg cluster traces
|
|
|
|
with the aim of identifying their resource waste, their impacts on the
|
|
|
|
performance of the application, and any causes that may lie behind such
|
|
|
|
failures. The salient conclusion of that research is that actually lots of
|
|
|
|
computations performed by Google would eventually end in failure, then leading
|
|
|
|
to large amounts of computational power being wasted.
|
2021-05-19 13:33:10 +00:00
|
|
|
|
|
|
|
\begin{figure}[h]
|
2021-05-17 16:50:25 +00:00
|
|
|
\begin{center}
|
|
|
|
\begin{tabular}{p{3cm}p{12cm}}
|
|
|
|
\toprule
|
|
|
|
\textbf{Type code} & \textbf{Description} \\
|
|
|
|
\midrule
|
2021-05-20 10:33:36 +00:00
|
|
|
% SUGGERIMENTO, NON CANCELLARE MAI, A MENO CHE NON SONO COSE COMPLETAMENTE
|
|
|
|
% INUTILI, IN MOLTI CASI VA BENE COMMENTARE, INTANTO NON INFLUISCONO CON LA
|
|
|
|
% COMPILAZIONE.
|
|
|
|
% \texttt{QUEUE} & The job or task was marked not eligible for scheduling
|
|
|
|
% by Borg's scheduler, and thus Borg will move the job/task in a long
|
|
|
|
% wait queue\\
|
|
|
|
% \texttt{SUBMIT} & The job or task was submitted to Borg for execution\\
|
|
|
|
% \texttt{ENABLE} & The job or task became eligible for scheduling\\
|
|
|
|
% \texttt{SCHEDULE} & The job or task's execution started\\
|
2021-05-17 16:50:25 +00:00
|
|
|
\texttt{EVICT} & The job or task was terminated in order to free
|
|
|
|
computational resources for an higher priority job\\
|
|
|
|
\texttt{FAIL} & The job or task terminated its execution unsuccesfully
|
|
|
|
due to a failure\\
|
|
|
|
\texttt{FINISH} & The job or task terminated succesfully\\
|
|
|
|
\texttt{KILL} & The job or task terminated its execution because of a
|
|
|
|
manual request to stop it\\
|
2021-05-20 10:33:36 +00:00
|
|
|
% \texttt{LOST} & It is assumed a job or task is has been terminated, but
|
|
|
|
% due to missing data there is insufficent information to identify when
|
|
|
|
% or how\\
|
|
|
|
% \texttt{UPDATE\_PENDING} & The metadata (scheduling class, resource
|
|
|
|
% requirements, \ldots) of the job/task was updated while the job was
|
|
|
|
% waiting to be scheduled\\
|
|
|
|
% \texttt{UPDATE\_RUNNING} & The metadata (scheduling class, resource
|
|
|
|
% requirements, \ldots) of the job/task was updated while the job was in
|
|
|
|
% execution\\
|
2021-05-17 16:50:25 +00:00
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
2021-05-19 13:33:10 +00:00
|
|
|
\caption{Overview of job and task event types.}\label{fig:eventtypes}
|
|
|
|
\end{figure}
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-17 15:12:46 +00:00
|
|
|
Figure~\ref{fig:eventTypes} shows the expected transitions between event
|
2021-05-17 14:27:17 +00:00
|
|
|
types.
|
|
|
|
|
2021-05-18 15:37:42 +00:00
|
|
|
\begin{figure}[h]
|
2021-05-17 14:27:17 +00:00
|
|
|
\centering
|
2021-05-17 15:12:46 +00:00
|
|
|
\resizebox{\textwidth}{!}{%
|
|
|
|
\includegraphics{./figures/event_types.png}}
|
2021-05-17 14:27:17 +00:00
|
|
|
\caption{Typical transitions between task/job event types according to
|
2021-05-17 15:12:46 +00:00
|
|
|
Google\label{fig:eventTypes}}
|
2021-05-17 14:27:17 +00:00
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
\hypertarget{traces-contents}{%
|
|
|
|
\subsection{Traces contents}\label{traces-contents}}
|
|
|
|
|
|
|
|
The traces provided by Google contain mainly a collection of job and
|
|
|
|
task events spanning a month of execution of the 8 different clusters.
|
|
|
|
In addition to this data, some additional data on the machines'
|
|
|
|
configuration in terms of resources (i.e.~amount of CPU and RAM) and
|
|
|
|
additional machine-related metadata.
|
|
|
|
|
|
|
|
Due to Google's policy, most identification related data (like job/task
|
|
|
|
IDs, raw resource amounts and other text values) were obfuscated prior
|
|
|
|
to the release of the traces. One obfuscation that is noteworthy in the
|
|
|
|
scope of this thesis is related to CPU and RAM amounts, which are
|
|
|
|
expressed respetively in NCUs (\emph{Normalized Compute Units}) and NMUs
|
|
|
|
(\emph{Normalized Memory Units}).
|
|
|
|
|
|
|
|
NCUs and NMUs are defined based on the raw machine resource
|
|
|
|
distributions of the machines within the 8 clusters. A machine having 1
|
|
|
|
NCU CPU power and 1 NMU memory size has the maximum amount of raw CPU
|
|
|
|
power and raw RAM size found in the clusters. While RAM size is measured
|
|
|
|
in bytes for normalization purposes, CPU power was measured in GCU
|
|
|
|
(\emph{Google Compute Units}), a proprietary CPU power measurement unit
|
|
|
|
used by Google that combines several parameters like number of
|
|
|
|
processors and cores, clock frequency, and architecture (i.e.~ISA).
|
|
|
|
|
|
|
|
\hypertarget{overview-of-traces-format}{%
|
|
|
|
\subsection{Overview of traces'
|
|
|
|
format}\label{overview-of-traces-format}}
|
|
|
|
|
|
|
|
The traces have a collective size of approximately 8TiB and are stored
|
|
|
|
in a Gzip-compressed JSONL (JSON lines) format, which means that each
|
|
|
|
table is represented by a single logical ``file'' (stored in several
|
|
|
|
file segments) where each carriage return separated line represents a
|
|
|
|
single record for that table.
|
|
|
|
|
|
|
|
There are namely 5 different table ``files'':
|
2021-05-17 16:50:25 +00:00
|
|
|
\begin{description}
|
|
|
|
\item[\texttt{machine\_configs},] which is a table containing each physical
|
2021-05-17 14:27:17 +00:00
|
|
|
machine's configuration and its evolution over time;
|
2021-05-17 16:50:25 +00:00
|
|
|
\item[\texttt{instance\_events},] which is a table of task events;
|
|
|
|
\item[\texttt{collection\_events},] which is a table of job events;
|
|
|
|
\item[\texttt{machine\_attributes},] which is a table containing (obfuscated)
|
2021-05-17 14:27:17 +00:00
|
|
|
metadata about each physical machine and its evolution over time;
|
2021-05-17 16:50:25 +00:00
|
|
|
\item[\texttt{instance\_usage},] which contains resource (CPU/RAM) measures
|
2021-05-17 14:27:17 +00:00
|
|
|
of jobs and tasks running on the single machines.
|
2021-05-17 16:50:25 +00:00
|
|
|
\end{description}
|
2021-05-17 14:27:17 +00:00
|
|
|
|
|
|
|
The scope of this thesis focuses on the tables
|
|
|
|
\texttt{machine\_configs}, \texttt{instance\_events} and
|
|
|
|
\texttt{collection\_events}.
|
|
|
|
|
|
|
|
\hypertarget{remark-on-traces-size}{%
|
|
|
|
\subsection{Remark on traces size}\label{remark-on-traces-size}}
|
|
|
|
|
|
|
|
While the 2011 Google Borg traces were relatively small, with a total
|
|
|
|
size in the order of the tens of gigabytes, the 2019 traces are quite
|
|
|
|
challenging to analyze due to their sheer size. As stated before, the
|
|
|
|
traces have a total size of 8 TiB when stored in the format provided by
|
|
|
|
Google. Even when broken down to table ``files'', unitary sizes still
|
|
|
|
reach the single tebibyte mark (namely for \texttt{machine\_configs},
|
|
|
|
the largest table in the trace).
|
|
|
|
|
|
|
|
Due to this constraints, a careful data engineering based approach was
|
|
|
|
used when reproducing the 2015 DSN paper analysis. Bleeding edge data
|
|
|
|
science technologies like Apache Spark were used to achieve efficient
|
|
|
|
and parallelized computations. This approach is discussed with further
|
|
|
|
detail in the following section.
|
|
|
|
|
|
|
|
\hypertarget{project-requirements-and-analysis}{%
|
|
|
|
\section{Project requirements and
|
|
|
|
analysis}\label{project-requirements-and-analysis}}
|
|
|
|
|
|
|
|
\textbf{TBD} (describe our objective with this analysis in detail)
|
2021-05-20 10:33:36 +00:00
|
|
|
The aim of this thesis is to repeat the analysis performed in 2015 on the
|
|
|
|
dataset Google has released in 2019 in order to find similarities and
|
|
|
|
differences with the previous analysis, and ultimately find whether
|
|
|
|
computational power is indeed wasted in this new workload as well. The 2019 data
|
|
|
|
comes from 8 Borg cells spanning 8 different datacenters located in different
|
|
|
|
geographical positions, all focused on computational oriented workloads. The
|
|
|
|
data collection time span matches the entire month of May 2019.
|
|
|
|
|
2021-05-17 14:27:17 +00:00
|
|
|
|
|
|
|
\hypertarget{analysis-methodology}{%
|
|
|
|
\section{Analysis methodology}\label{analysis-methodology}}
|
|
|
|
|
2021-05-17 16:50:25 +00:00
|
|
|
Due to the inherent complexity in analyzing traces of this size, novel
|
|
|
|
bleeding-edge data engineering tecniques were adopted to performed the required
|
|
|
|
computations. We used the framework Apache Spark to perform efficient and
|
|
|
|
parallel Map-Reduce computations. In this section, we discuss the technical
|
|
|
|
details behind our approach.
|
2021-05-17 14:27:17 +00:00
|
|
|
|
|
|
|
\hypertarget{introduction-on-apache-spark}{%
|
|
|
|
\subsection{Introduction on Apache
|
|
|
|
Spark}\label{introduction-on-apache-spark}}
|
|
|
|
|
|
|
|
Apache Spark is a unified analytics engine for large-scale data
|
|
|
|
processing. In layman's terms, Spark is really useful to parallelize
|
|
|
|
computations in a fast and streamlined way.
|
|
|
|
|
|
|
|
In the scope of this thesis, Spark was used essentially as a Map-Reduce
|
|
|
|
framework for computing aggregated results on the various tables. Due to
|
|
|
|
the sharded nature of table ``files'', Spark is able to spawn a thread
|
|
|
|
per file and run computations using all processors on the server
|
|
|
|
machines used to run the analysis.
|
|
|
|
|
|
|
|
Spark is also quite powerful since it provides automated thread pooling
|
|
|
|
services, and it is able to efficiently store and cache intermediate
|
|
|
|
computation on secondary storage without any additional effort required
|
|
|
|
from the data engineer. This feature was especially useful due to the
|
|
|
|
sheer size of the analyzed data, since the computations required to
|
|
|
|
store up to 1TiB of intermediate data on disk.
|
|
|
|
|
|
|
|
The chosen programming language for writing analysis scripts was Python.
|
|
|
|
Spark has very powerful native Python bindings in the form of the
|
|
|
|
\emph{PySpark} API, which were used to implement the various queries.
|
|
|
|
|
2021-05-20 10:33:36 +00:00
|
|
|
|
2021-05-17 14:27:17 +00:00
|
|
|
\hypertarget{query-architecture}{%
|
|
|
|
\subsection{Query architecture}\label{query-architecture}}
|
|
|
|
|
2021-05-17 15:12:46 +00:00
|
|
|
\subsubsection{Overview}
|
|
|
|
|
|
|
|
In general, each query written to execute the analysis
|
|
|
|
follows a general Map-Reduce template.
|
|
|
|
|
|
|
|
Traces are first read, then parsed, and then filtered by performing selections,
|
|
|
|
projections and computing new derived fields. After this preparation phase, the
|
|
|
|
trace records are often passed through a \texttt{groupby()} operation, which by
|
|
|
|
choosing one or many record fields sorts all the records into several ``bins''
|
|
|
|
containing records with matching values for the selected fields. Then, a map
|
|
|
|
operation is applied to each bin in order to derive some aggregated property
|
|
|
|
value for each grouping. Finally, a reduce operation is applied to either
|
|
|
|
further aggregate those computed properties or to generate an aggregated data
|
|
|
|
structure for storage purposes.
|
|
|
|
|
|
|
|
\subsubsection{Parsing table files}
|
|
|
|
|
|
|
|
As stated before, table ``files'' are composed of several Gzip-compressed
|
|
|
|
shards of JSONL record data. The specification for the types and constraints
|
|
|
|
of each record is outlined by Google in the form of a protobuffer specification
|
|
|
|
file found in the trace release
|
2021-05-19 13:33:10 +00:00
|
|
|
package\cite{google-proto-marso}. This file was used as
|
2021-05-17 15:12:46 +00:00
|
|
|
the oracle specification and was a critical reference for writing the query
|
|
|
|
code that checks, parses and carefully sanitizes the various JSONL records
|
|
|
|
prior to actual computations.
|
|
|
|
|
|
|
|
The JSONL encoding of traces records is often performed with non-trivial rules
|
|
|
|
that required careful attention. One of these involved fields that have a
|
|
|
|
logically-wise ``zero'' value (i.e.~values like ``0'' or the empty string). For
|
|
|
|
these values the key-value pair in the JSON object is outright omitted. When
|
|
|
|
reading the traces in Apache Spark is therefore necessary to check for this
|
|
|
|
possibility and insert back the omitted record attributes.
|
|
|
|
|
|
|
|
\subsubsection{The queries}
|
|
|
|
|
|
|
|
Most queries use only two or three fields in each trace records, while the
|
|
|
|
original table records often are made of a couple of dozen fields. In order to save
|
|
|
|
memory during the query, a projection is often applied to the data by the means
|
|
|
|
of a \texttt{.map()} operation over the entire trace set, performed using
|
|
|
|
Spark's RDD API.
|
|
|
|
|
|
|
|
Another operation that is often necessary to perform prior to the Map-Reduce
|
|
|
|
core of each query is a record filtering process, which is often motivated by
|
|
|
|
the presence of incomplete data (i.e.~records which contain fields whose values
|
|
|
|
is unknown). This filtering is performed using the \texttt{.filter()} operation
|
|
|
|
of Spark's RDD API.
|
|
|
|
|
2021-05-17 16:50:25 +00:00
|
|
|
The core of each query is often a \texttt{groupby()} followed by a
|
|
|
|
\texttt{map()} operation on the aggregated data. The \texttt{groupby()} groups
|
|
|
|
the set of all records into several subsets of records each having something in
|
|
|
|
common. Then, each of this small clusters is reduced with a \texttt{map()}
|
|
|
|
operation to a single record. The motivation behind this way of computing data
|
|
|
|
is that for the analysis in this thesis it is often necessary to analyze the
|
|
|
|
behaviour w.r.t. time of either task or jobs by looking at their events. These
|
|
|
|
queries are therefore implemented by \texttt{groupby()}-ing records by task or
|
|
|
|
job, and then \texttt{map()}-ing each set of event records sorting them by time
|
|
|
|
and performing the desired computation on the obtained chronological event log.
|
2021-05-17 15:12:46 +00:00
|
|
|
|
|
|
|
Sometimes intermediate results are saved in Spark's parquet format in order to
|
|
|
|
compute and save intermediate results beforehand.
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-18 15:37:42 +00:00
|
|
|
\subsection{Query script design}
|
2021-05-17 14:27:17 +00:00
|
|
|
|
2021-05-18 15:37:42 +00:00
|
|
|
In this section we aim to show the general complexity behind the implementations
|
|
|
|
of query scripts by explaining in detail some sampled scripts to better
|
|
|
|
appreciate their behaviour.
|
|
|
|
|
|
|
|
\subsubsection{The ``task slowdown'' query script}
|
|
|
|
|
|
|
|
One example of analysis script with average complexity and a pretty
|
|
|
|
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
|
|
|
|
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
|
|
|
|
(namely the tables in figure~\ref{fig:taskslowdown}).
|
|
|
|
|
|
|
|
``Slowdown'' is a task-wise measure of wasted execution time for tasks with a
|
|
|
|
\texttt{FINISH} termination type. It is computed as the total execution time of
|
|
|
|
the task divided by the execution time actually needed to complete the task
|
|
|
|
(i.e. the total time of the last execution attempt, successful by definition).
|
|
|
|
|
|
|
|
The analysis requires to compute the mean task slowdown for each task priority
|
|
|
|
value, and additionally compute the percentage of tasks with successful
|
|
|
|
terminations per priority. The query therefore needs to compute the execution
|
|
|
|
time of each execution attempt for each task, determine if each task has
|
|
|
|
successful termination or not, and finally combine this data to compute
|
|
|
|
slowdown, mean slowdown and ultimately the final table found in
|
|
|
|
figure~\ref{fig:taskslowdown}.
|
|
|
|
|
|
|
|
\begin{figure}[h]
|
2021-05-18 14:29:54 +00:00
|
|
|
\centering
|
|
|
|
\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png}
|
2021-05-18 15:37:42 +00:00
|
|
|
\caption{Diagram of the script used for the ``task slowdown''
|
|
|
|
query.}\label{fig:taskslowdownquery}
|
2021-05-18 14:29:54 +00:00
|
|
|
\end{figure}
|
|
|
|
|
2021-05-18 15:37:42 +00:00
|
|
|
Figure~\ref{fig:taskslowdownquery} shows a schematic representation of the query
|
|
|
|
structure.
|
|
|
|
|
|
|
|
The query first starts reading the \texttt{instance\_events} table, which
|
|
|
|
contains (among other data) all task event logs containing properties, event
|
|
|
|
types and timestamps. As already explained in the previous section, the logical
|
|
|
|
table file is actually stored as several Gzip-compressed JSONL shards. This is
|
|
|
|
very useful for processing purposes, since Spark is able to parse and load in
|
|
|
|
memory each shard in parallel, i.e. using all processing cores on the server
|
|
|
|
used to run the queries.
|
|
|
|
|
|
|
|
After loading the data, a selection and a projection operation are performed in
|
|
|
|
the preparation phase so as to ``clean up'' the records and fields that are not
|
|
|
|
needed, leaving only useful information to feed in the ``group by'' phase. In
|
|
|
|
this query, the selection phase removes all records that do not represent task
|
|
|
|
events or that contain an unknown task ID or a null event timestamp. In the 2019
|
|
|
|
traces it is quite common to find incomplete records, since the log process is
|
|
|
|
unable to capture the sheer amount of events generated by all jobs in a exact
|
|
|
|
and deterministic fashion.
|
|
|
|
|
|
|
|
Then, after the preparation stage is complete, the task event records are
|
|
|
|
grouped in several bins, one per task ID\@. Performing this operation the
|
|
|
|
collection of unsorted task event types is rearranged to form groups of task
|
|
|
|
events all relating to a single task.
|
|
|
|
|
|
|
|
These obtained collections of task events are then sorted by timestamp and
|
|
|
|
processed to compute intermediate data relating to execution attempt times and
|
|
|
|
task termination counts. After the task events are sorted, the script iterates
|
|
|
|
over the events in chronological order, storing each execution attempt time and
|
|
|
|
registering all execution termination types by checking the event type field.
|
|
|
|
The task termination is then equal to the last execution termination type,
|
|
|
|
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
|
|
|
|
|
|
|
|
If the task termination is determined to be unsuccessful, the tally counter of
|
|
|
|
task terminations for the matching task property is increased. Otherwise, all
|
|
|
|
the task termination attempt time deltas are returned. Tallies and time deltas
|
|
|
|
are saved in an intermediate time file for fine-grained processing.
|
|
|
|
|
|
|
|
Finally, the \texttt{task\_slowdown\_table.py} processes this intermediate
|
|
|
|
results to compute the percentage of successful tasks per execution and
|
|
|
|
computing slowdown values given the previously computed execution attempt time
|
|
|
|
deltas. Finally, the mean of the computed slowdown values is computed resulting
|
|
|
|
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
|
|
|
|
|
2021-05-17 14:27:17 +00:00
|
|
|
|
|
|
|
\hypertarget{ad-hoc-presentation-of-some-analysis-scripts}{%
|
|
|
|
\subsection{Ad-Hoc presentation of some analysis
|
|
|
|
scripts}\label{ad-hoc-presentation-of-some-analysis-scripts}}
|
|
|
|
|
|
|
|
\textbf{TBD} (with diagrams)
|
|
|
|
|
|
|
|
\hypertarget{analysis-and-observations}{%
|
|
|
|
\section{Analysis and observations}\label{analysis-and-observations}}
|
|
|
|
|
|
|
|
\hypertarget{overview-of-machine-configurations-in-each-cluster}{%
|
|
|
|
\subsection{Overview of machine configurations in each
|
|
|
|
cluster}\label{overview-of-machine-configurations-in-each-cluster}}
|
|
|
|
|
|
|
|
\input{figures/machine_configs}
|
|
|
|
|
|
|
|
Refer to figure \ref{fig:machineconfigs}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
machine configurations are definitely more varied than the ones in the
|
|
|
|
2011 traces
|
|
|
|
\item
|
|
|
|
some clusters have more machine variability
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{analysis-of-execution-time-per-each-execution-phase}{%
|
|
|
|
\subsection{Analysis of execution time per each execution
|
|
|
|
phase}\label{analysis-of-execution-time-per-each-execution-phase}}
|
|
|
|
|
|
|
|
\input{figures/machine_time_waste}
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:machinetimewaste-abs} and
|
|
|
|
\ref{fig:machinetimewaste-rel}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Across all cluster almost 50\% of time is spent in ``unknown''
|
|
|
|
transitions, i.e. there are some time slices that are related to a
|
|
|
|
state transition that Google says are not ``typical'' transitions.
|
|
|
|
This is mostly due to the trace log being intermittent when recording
|
|
|
|
all state transitions.
|
|
|
|
\item
|
|
|
|
80\% of the time spent in KILL and LOST is unknown. This is
|
|
|
|
predictable, since both states indicate that the job execution is not
|
|
|
|
stable (in particular LOST is used when the state logging itself is
|
|
|
|
unstable)
|
|
|
|
\item
|
|
|
|
From the absolute graph we see that the time ``wasted'' on non-finish
|
|
|
|
terminated jobs is very significant
|
|
|
|
\item
|
|
|
|
Execution is the most significant task phase, followed by queuing time
|
|
|
|
and scheduling time (``ready'' state)
|
|
|
|
\item
|
|
|
|
In the absolute graph we see that a significant amount of time is
|
|
|
|
spent to re-schedule evicted jobs (``evicted'' state)
|
|
|
|
\item
|
|
|
|
Cluster A has unusually high queuing times
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{task-slowdown}{%
|
|
|
|
\subsection{Task slowdown}\label{task-slowdown}}
|
|
|
|
|
|
|
|
\input{figures/task_slowdown}
|
|
|
|
|
|
|
|
Refer to figure \ref{fig:taskslowdown}
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Priority values are different from 0-11 values in the 2011 traces. A
|
|
|
|
conversion table is provided by Google;
|
|
|
|
\item
|
|
|
|
For some priorities (e.g.~101 for cluster D) the relative number of
|
|
|
|
finishing task is very low and the mean slowdown is very high (315).
|
|
|
|
This behaviour differs from the relatively homogeneous values from the
|
|
|
|
2011 traces.
|
|
|
|
\item
|
|
|
|
Some slowdown values cannot be computed since either some tasks have a
|
|
|
|
0ns execution time or for some priorities no tasks in the traces
|
|
|
|
terminate successfully. More raw data on those exception is in
|
|
|
|
Jupyter.
|
|
|
|
\item
|
|
|
|
The \% of finishing jobs is relatively low comparing with the 2011
|
|
|
|
traces.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{reserved-and-actual-resource-usage-of-tasks}{%
|
|
|
|
\subsection{Reserved and actual resource usage of
|
|
|
|
tasks}\label{reserved-and-actual-resource-usage-of-tasks}}
|
|
|
|
|
|
|
|
\input{figures/spatial_resource_waste}
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:spatialresourcewaste-actual} and
|
|
|
|
\ref{fig:spatialresourcewaste-requested}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Most (mesasured and requested) resources are used by killed job, even
|
|
|
|
more than in the 2011 traces.
|
|
|
|
\item
|
|
|
|
Behaviour is rather homogeneous across datacenters, with the exception
|
|
|
|
of cluster G where a lot of LOST-terminated tasks acquired 70\% of
|
|
|
|
both CPU and RAM
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{correlation-between-task-events-metadata-and-task-termination}{%
|
|
|
|
\subsection{Correlation between task events' metadata and task
|
|
|
|
termination}\label{correlation-between-task-events-metadata-and-task-termination}}
|
|
|
|
|
|
|
|
\input{figures/figure_7}
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:figureVII-a}, \ref{fig:figureVII-b}, and
|
|
|
|
\ref{fig:figureVII-c}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
No smooth curves in this figure either, unlike 2011 traces
|
|
|
|
\item
|
|
|
|
The behaviour of curves for 7a (priority) is almost the opposite of
|
|
|
|
2011, i.e. in-between priorities have higher kill rates while
|
|
|
|
priorities at the extremum have lower kill rates. This could also be
|
|
|
|
due bt the inherent distribution of job terminations;
|
|
|
|
\item
|
|
|
|
Event execution time curves are quite different than 2011, here it
|
|
|
|
seems there is a good correlation between short task execution times
|
|
|
|
and finish event rates, instead of the U shape curve in 2015 DSN
|
|
|
|
\item
|
|
|
|
In figure \ref{fig:figureVII-b} cluster behaviour seems quite uniform
|
|
|
|
\item
|
|
|
|
Machine concurrency seems to play little role in the event termination
|
|
|
|
distribution, as for all concurrency factors the kill rate is at 90\%.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{correlation-between-task-events-resource-metadata-and-task-termination}{%
|
|
|
|
\subsection{Correlation between task events' resource metadata and task
|
|
|
|
termination}\label{correlation-between-task-events-resource-metadata-and-task-termination}}
|
|
|
|
|
|
|
|
\hypertarget{correlation-between-job-events-metadata-and-job-termination}{%
|
|
|
|
\subsection{Correlation between job events' metadata and job
|
|
|
|
termination}\label{correlation-between-job-events-metadata-and-job-termination}}
|
|
|
|
|
|
|
|
\input{figures/figure_9}
|
|
|
|
|
|
|
|
Refer to figures \ref{fig:figureIX-a}, \ref{fig:figureIX-b}, and
|
|
|
|
\ref{fig:figureIX-c}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Behaviour between cluster varies a lot
|
|
|
|
\item
|
|
|
|
There are no ``smooth'' gradients in the various curves unlike in the
|
|
|
|
2011 traces
|
|
|
|
\item
|
|
|
|
Killed jobs have higher event rates in general, and overall dominate
|
|
|
|
all event rates measures
|
|
|
|
\item
|
|
|
|
There still seems to be a correlation between short execution job
|
|
|
|
times and successfull final termination, and likewise for kills and
|
|
|
|
higher job terminations
|
|
|
|
\item
|
|
|
|
Across all clusters, a machine locality factor of 1 seems to lead to
|
|
|
|
the highest success event rate
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{mean-number-of-tasks-and-event-distribution-per-task-type}{%
|
|
|
|
\subsection{Mean number of tasks and event distribution per task
|
|
|
|
type}\label{mean-number-of-tasks-and-event-distribution-per-task-type}}
|
|
|
|
|
|
|
|
\input{figures/table_iii}
|
|
|
|
|
|
|
|
Refer to figure \ref{fig:tableIII}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
The mean number of events per task is an order of magnitude higher
|
|
|
|
than in the 2011 traces
|
|
|
|
\item
|
|
|
|
Generally speaking, the event type with higher mean is the termination
|
|
|
|
event for the task
|
|
|
|
\item
|
|
|
|
The \# evts mean is higher than the sum of all other event type means,
|
|
|
|
since it appears there are a lot more non-termination events in the
|
|
|
|
2019 traces.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{mean-number-of-tasks-and-event-distribution-per-job-type}{%
|
|
|
|
\subsection{Mean number of tasks and event distribution per job
|
|
|
|
type}\label{mean-number-of-tasks-and-event-distribution-per-job-type}}
|
|
|
|
|
|
|
|
\input{figures/table_iv}
|
|
|
|
|
|
|
|
Refer to figure \ref{fig:tableIV}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Again the mean number of tasks is significantly higher than the 2011
|
|
|
|
traces, indicating a higher complexity of workloads
|
|
|
|
\item
|
|
|
|
Cluster A has no evicted jobs
|
|
|
|
\item
|
|
|
|
The number of events is however lower than the event means in the 2011
|
|
|
|
traces
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{probability-of-task-successful-termination-given-its-unsuccesful-events}{%
|
|
|
|
\subsection{Probability of task successful termination given its
|
|
|
|
unsuccesful
|
|
|
|
events}\label{probability-of-task-successful-termination-given-its-unsuccesful-events}}
|
|
|
|
|
|
|
|
\input{figures/figure_5}
|
|
|
|
|
|
|
|
Refer to figure \ref{fig:figureV}.
|
|
|
|
|
|
|
|
\textbf{Observations}:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item
|
|
|
|
Behaviour is very different from cluster to cluster
|
|
|
|
\item
|
|
|
|
There is no easy conclusion, unlike in 2011, on the correlation
|
|
|
|
between succesful probability and \# of events of a specific type.
|
|
|
|
\item
|
|
|
|
Clusters B, C and D in particular have very unsmooth lines that vary a
|
|
|
|
lot for small \# evts differences. This may be due to an uneven
|
|
|
|
distribution of \# evts in the traces.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\hypertarget{potential-causes-of-unsuccesful-executions}{%
|
|
|
|
\subsection{Potential causes of unsuccesful
|
|
|
|
executions}\label{potential-causes-of-unsuccesful-executions}}
|
|
|
|
|
|
|
|
\textbf{TBD}
|
|
|
|
|
|
|
|
\hypertarget{implementation-issues-analysis-limitations}{%
|
|
|
|
\section{Implementation issues -- Analysis
|
|
|
|
limitations}\label{implementation-issues-analysis-limitations}}
|
|
|
|
|
|
|
|
\hypertarget{discussion-on-unknown-fields}{%
|
|
|
|
\subsection{Discussion on unknown
|
|
|
|
fields}\label{discussion-on-unknown-fields}}
|
|
|
|
|
|
|
|
\textbf{TBD}
|
|
|
|
|
|
|
|
\hypertarget{limitation-on-computation-resources-required-for-the-analysis}{%
|
|
|
|
\subsection{Limitation on computation resources required for the
|
|
|
|
analysis}\label{limitation-on-computation-resources-required-for-the-analysis}}
|
|
|
|
|
|
|
|
\textbf{TBD}
|
|
|
|
|
|
|
|
\hypertarget{other-limitations}{%
|
|
|
|
\subsection{Other limitations \ldots{}}\label{other-limitations}}
|
|
|
|
|
|
|
|
\textbf{TBD}
|
|
|
|
|
|
|
|
\hypertarget{conclusions-and-future-work-or-possible-developments}{%
|
|
|
|
\section{Conclusions and future work or possible
|
|
|
|
developments}\label{conclusions-and-future-work-or-possible-developments}}
|
|
|
|
|
|
|
|
\textbf{TBD}
|
|
|
|
|
2021-05-19 13:33:10 +00:00
|
|
|
\printbibliography
|
|
|
|
|
2021-05-17 14:27:17 +00:00
|
|
|
\end{document}
|
2021-05-18 15:37:42 +00:00
|
|
|
% vim: set ts=2 sw=2 et tw=80:
|