report done explaination for task slowdown

This commit is contained in:
Claudio Maggioni 2021-05-18 17:37:42 +02:00
parent 4d3b711ce0
commit f5eb1f30dd
4 changed files with 74 additions and 7 deletions

Binary file not shown.

View File

@ -130,7 +130,7 @@ following values:
Figure~\ref{fig:eventTypes} shows the expected transitions between event
types.
\begin{figure}
\begin{figure}[h]
\centering
\resizebox{\textwidth}{!}{%
\includegraphics{./figures/event_types.png}}
@ -311,17 +311,83 @@ and performing the desired computation on the obtained chronological event log.
Sometimes intermediate results are saved in Spark's parquet format in order to
compute and save intermediate results beforehand.
\hypertarget{general-query-script-design}{%
\subsection{General Query script
design}\label{general-query-script-design}}
\subsection{Query script design}
\begin{figure}
In this section we aim to show the general complexity behind the implementations
of query scripts by explaining in detail some sampled scripts to better
appreciate their behaviour.
\subsubsection{The ``task slowdown'' query script}
One example of analysis script with average complexity and a pretty
straightforward structure is the pair of scripts \texttt{task\_slowdown.py} and
\texttt{task\_slowdown\_table.py} used to compute the ``task slowdown'' tables
(namely the tables in figure~\ref{fig:taskslowdown}).
``Slowdown'' is a task-wise measure of wasted execution time for tasks with a
\texttt{FINISH} termination type. It is computed as the total execution time of
the task divided by the execution time actually needed to complete the task
(i.e. the total time of the last execution attempt, successful by definition).
The analysis requires to compute the mean task slowdown for each task priority
value, and additionally compute the percentage of tasks with successful
terminations per priority. The query therefore needs to compute the execution
time of each execution attempt for each task, determine if each task has
successful termination or not, and finally combine this data to compute
slowdown, mean slowdown and ultimately the final table found in
figure~\ref{fig:taskslowdown}.
\begin{figure}[h]
\centering
\includegraphics[width=.75\textwidth]{figures/task_slowdown_query.png}
\caption{Diagram of the query scripts used for the ``task slowdown'' query}
\caption{Diagram of the script used for the ``task slowdown''
query.}\label{fig:taskslowdownquery}
\end{figure}
\textbf{TBD}
Figure~\ref{fig:taskslowdownquery} shows a schematic representation of the query
structure.
The query first starts reading the \texttt{instance\_events} table, which
contains (among other data) all task event logs containing properties, event
types and timestamps. As already explained in the previous section, the logical
table file is actually stored as several Gzip-compressed JSONL shards. This is
very useful for processing purposes, since Spark is able to parse and load in
memory each shard in parallel, i.e. using all processing cores on the server
used to run the queries.
After loading the data, a selection and a projection operation are performed in
the preparation phase so as to ``clean up'' the records and fields that are not
needed, leaving only useful information to feed in the ``group by'' phase. In
this query, the selection phase removes all records that do not represent task
events or that contain an unknown task ID or a null event timestamp. In the 2019
traces it is quite common to find incomplete records, since the log process is
unable to capture the sheer amount of events generated by all jobs in a exact
and deterministic fashion.
Then, after the preparation stage is complete, the task event records are
grouped in several bins, one per task ID\@. Performing this operation the
collection of unsorted task event types is rearranged to form groups of task
events all relating to a single task.
These obtained collections of task events are then sorted by timestamp and
processed to compute intermediate data relating to execution attempt times and
task termination counts. After the task events are sorted, the script iterates
over the events in chronological order, storing each execution attempt time and
registering all execution termination types by checking the event type field.
The task termination is then equal to the last execution termination type,
following the definition originally given in the 2015 Ros\'a et al. DSN paper.
If the task termination is determined to be unsuccessful, the tally counter of
task terminations for the matching task property is increased. Otherwise, all
the task termination attempt time deltas are returned. Tallies and time deltas
are saved in an intermediate time file for fine-grained processing.
Finally, the \texttt{task\_slowdown\_table.py} processes this intermediate
results to compute the percentage of successful tasks per execution and
computing slowdown values given the previously computed execution attempt time
deltas. Finally, the mean of the computed slowdown values is computed resulting
in the clear and coincise tables found in figure~\ref{fig:taskslowdown}.
\hypertarget{ad-hoc-presentation-of-some-analysis-scripts}{%
\subsection{Ad-Hoc presentation of some analysis
@ -599,3 +665,4 @@ developments}\label{conclusions-and-future-work-or-possible-developments}}
\textbf{TBD}
\end{document}
% vim: set ts=2 sw=2 et tw=80:

Binary file not shown.

Before

Width:  |  Height:  |  Size: 431 KiB

After

Width:  |  Height:  |  Size: 487 KiB