more work on report

2021-05-16 17:29:55 +02:00 · 2021-05-16 17:29:55 +02:00 · 0d935c2112
commit 0d935c2112
parent 7253804f26
5 changed files with 135 additions and 8 deletions
--- a/.~lock.status.ods#
+++ b/.~lock.status.ods#
@ -0,0 +1 @@
 ,maggicl,Apple2gs.local,16.05.2021 14:55,file:///Users/maggicl/Library/Application%20Support/LibreOffice/4;
--- a/report/Claudio_Maggioni_report.md
+++ b/report/Claudio_Maggioni_report.md
@ -39,29 +39,141 @@ header-includes:
  ```
 ---
 \tableofcontents
 \newpage
 # Introduction (including Motivation)
 # State of the Art
- Introduce Ros\'a 2015 DSN paper on analysis
+## Introduction
- Describe Google Borg clusters
+
- Describe Traces contents
+**TBD**
- Differences between 2011 and 2019 traces
+
 ## Rosà et al. 2015 DSN paper
 **TBD**
 ## Google Borg
 Borg is Google's own cluster management software. Among the various cluster
 management services it provides, the main ones are: job queuing, scheduling,
 allocation, and deallocation due to higher priority computations.
 The data this thesis is based on is from 8 Borg "cells" (i.e. clusters) spanning
 8 different datacenters, all focused on "compute" (i.e. computational oriented)
 workloads. The data collection timespan matches the entire month of May 2019.
 In Google's lingo a "job" is a large unit of computational workload made up of
 several "tasks", i.e. a number of executions of single executables running on a
 single machine. A job may run tasks sequentially or in parallel, and the
 condition for a job's succesful termination is nontrivial.
 Both tasks and jobs lifecyles are represented by several events, which are
 encoded and stored in the trace as rows of various tables. Among the information
 events provide, the field "type" provides information on the execution status of
 the job or task. This field can have the following values:
 - **QUEUE**: The job or task was marked not eligible for scheduling by Borg's
  scheduler, and thus Borg will move the job/task in a long wait queue;
 - **SUBMIT**: The job or task was submitted to Borg for execution;
 - **ENABLE**: The job or task became eligible for scheduling;
 - **SCHEDULE**: The job or task's execution started;
 - **EVICT**: The job or task was terminated in order to free computational
  resources for an higher priority job;
 - **FAIL**: The job or task terminated its execution unsuccesfully due to a
  failure;
 - **FINISH**: The job or task terminated succesfully;
 - **KILL**: The job or task terminated its execution because of a manual request
  to stop it;
 - **LOST**: It is assumed a job or task is has been terminated, but due to
  missing data there is insufficent information to identify when or how;
 - **UPDATE_PENDING**: The metadata (scheduling class, resource requirements,
  ...) of the job/task was updated while the job was waiting to be scheduled;
 - **UPDATE_RUNNING**: The metadata (scheduling class, resource requirements,
  ...) of the job/task was updated while the job was in execution;
 Figure \ref{fig:eventTypes} shows the expected transitions between event types.
 ![Typical transitions between task/job event types according to Google
 \label{fig:eventTypes}](./figures/event_types.png)
 ## Traces contents
 The traces provided by Google contain mainly a collection of job and task events
 spanning a month of execution of the 8 different clusters. In addition to this
 data, some additional data on the machines' configuration in terms of resources
 (i.e. amount of CPU and RAM) and additional machine-related metadata.
 Due to Google's policy, most identification related data (like job/task IDs,
 raw resource amounts and other text values) were obfuscated prior to the release
 of the traces. One obfuscation that is noteworthy in the scope of this thesis is
 related to CPU and RAM amounts, which are expressed respetively in NCUs
 (_Normalized Compute Units_) and NMUs (_Normalized Memory Units_).
 NCUs and NMUs are defined based on the raw machine resource distributions of the
 machines within the 8 clusters. A machine having 1 NCU CPU power and 1 NMU
 memory size has the maximum amount of raw CPU power and raw RAM size found in
 the clusters. While RAM size is measured in bytes for normalization purposes,
 CPU power was measured in GCU (_Google Compute Units_), a proprietary CPU power
 measurement unit used by Google that combines several parameters like number of
 processors and cores, clock frequency, and architecture (i.e. ISA).
 ## Overview of traces' format
 The traces have a collective size of approximately 8TiB and are stored in a
 Gzip-compressed JSONL (JSON lines) format, which means that each table is
 represented by a single logical "file" (stored in several file segments) where
 each carriage return separated line represents a single record for that table.
 There are namely 5 different table "files":
 - `machine_configs`, which is a table containing each physical machine's
  configuration and its evolution over time;
 - `instance_events`, which is a table of task events;
 - `collection_events`, which is a table of job events;
 - `machine_attributes`, which is a table containing (obfuscated) metadata about
  each physical machine and its evolution over time;
 - `instance_usage`, which contains resource (CPU/RAM) measures of jobs and tasks
  running on the single machines.
 The scope of this thesis focuses on the tables `machine_configs`,
 `instance_events` and `collection_events`.
 ## Remark on traces size
 While the 2011 Google Borg traces were relatively small, with a total size in
 the order of the tens of gigabytes, the 2019 traces are quite challenging to
 analyze due to their sheer size. As stated before, the traces have a total size
 of 8 TiB when stored in the format provided by Google. Even when broken down to
 table "files", unitary sizes still reach the single tebibyte mark (namely for
 `machine_configs`, the largest table in the trace).
 Due to this constraints, a careful data engineering based approach was used when
 reproducing the 2015 DSN paper analysis. Bleeding edge data science technologies
 like Apache Spark were used to achieve efficient and parallelized computations.
 This approach is discussed with further detail in the following section.
 # Project requirements and analysis
-(describe our objective with this analysis in detail)
+**TBD** (describe our objective with this analysis in detail)
 # Analysis methodology
-## Technical overview of traces' file format and schema
+**TBD**
 ## Overview on challenging aspects of analysis (data size, schema, avaliable computation resources)
-## Introduction on apache spark
+**TBD**
 ## Introduction on Apache Spark
 **TBD**
 ## General workflow description of apache spark workflow
 **TBD** (extract from the notes sent to Filippo shown below)
 The Google 2019 Borg cluster traces analysis were conducted by using Apache
 Spark and its Python 3 API (pyspark).  Spark was used to execute a series of
 queries to perform various sums and aggregations over the entire dataset
@ -110,7 +222,11 @@ compute and save intermediate results beforehand.
 ## General Query script design
-## Ad-Hoc presentation of some analysis scripts (w diagrams)
+**TBD**
 ## Ad-Hoc presentation of some analysis scripts
 **TBD** (with diagrams)
 # Analysis and observations
@ -271,14 +387,24 @@ Refer to figure \ref{fig:figureV}.
 ## Potential causes of unsuccesful executions
 **TBD**
 # Implementation issues -- Analysis limitations
 ## Discussion on unknown fields
 **TBD**
 ## Limitation on computation resources required for the analysis
 **TBD**
 ## Other limitations ...
 **TBD**
 # Conclusions and future work or possible developments
 **TBD**
 <!-- vim: set ts=2 sw=2 et tw=80: -->
--- a/report/Claudio_Maggioni_report.pdf
+++ b/report/Claudio_Maggioni_report.pdf
--- a/report/figures/event_types.png
+++ b/report/figures/event_types.png
--- a/status.ods
+++ b/status.ods
		`@ -0,0 +1 @@`
							`,maggicl,Apple2gs.local,16.05.2021 14:55,file:///Users/maggicl/Library/Application%20Support/LibreOffice/4;`