bachelorThesis/thesis-dev.md

<!-- vim: set ts=2 sw=2 et tw=80: -->

# Thesis development and status

## Thesis objective
Google comparazione cluster 2011 2020

Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti
Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of
Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare
le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel
paper di Google)

Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e
statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al
confronto Google)

Diversificare analisi per data center (ora sono 8)

Replicazione analisi per data center

*Motivazione del paper: i fallimenti sono tanti, perche?*

Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento
Google drive.

## Analysis from Rosa/Chen Paper
- [&#x2705; **machine_configs**] Table of distinct CPU/Memory configurations of machines and their distrib. (%)
  (Table I)
- [&#x2705; **machine_time_waste**] *III-A: Temporal impact: machine time waste*:
  Stacked histogram
  - Y-axis: normalized (%) aggregated machine time
  - X-axis: event type
  Three series:
  - Resubmission time: sum of all *subm. time* - *previous compl. time*
  - Queue time: sum of all *sched. time* - *subm. time*
  - Running time: sum of all *compl. time* - *subm. time*
- (%) total wasted time per unsuccessful event type
- (mins.) avg. wasted time per number of events for each event type
- breakdown of wasted time per *submission*, *scheduling*, *queue*
- [&#x2705; **task_slowdown**] *III-A-I: Average slowdown per task*: (Table II)
  For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all
  *response time* for each task event over *response time* of last event (which
  is by def. FINISH). Response time is defined as *Queue time* + *Exec time*
  Table II shows:
  - % of finish tasks
  - mean *response time* (all events)
  - mean *response time* (last event for each task)
  - mean *slowdown*
- *III-B: Spatial impact: resource waste*:
  Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis)
  used per task event type (distributions)
  - *resource demand*: UoM defined as RES (NCU/NMU) / s
- *IV-A-1 Table III: Mean number of events and their distribution per task type*:
  Mean and 95 %-ile number of events per each task type and mean number of
  events of each type
- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful
  evts for each type observed*:
  X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution,
  one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*:
  Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts.
  of each type)
- *IV-B-1 Figure 6: Job Inter-Type Times*:
  *Inter-Type* is defined as time between job completion of same evt. type
  Empirical CDF for distribution of job inter-type times for each evt. type.
  Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
- *IV-C Table V: Dependencies between jobs and events*:
  Probability that a job terminates with a given evt. type if an event of
  another evt. type is observed ("probability matrix")
- *V-A Figure 7: Event rates vs. task priority, event execution time, machine
  concurrency*
  3 graphs with x-axes (classes of priority, exec. time intervals, and
  *concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type /
  tot. # evts).  4 series per graph, one for each event type.
  - Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
  - *concurrency* is defined as # tasks running on the machine when the event is
    logged
  - *evt. execution time*: time between submission and execution of "event"
    (i.e. execution associated with event) (**included queue time**)
- *V-B Figure 8: Event rates vs. requested resources, resource reservation,
  resource utilization*:
  6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and
  series like Fig. 7
  - *reservation* is sum of reserved resources by all tasks executed on the
    machine at event time / resources on the machine
  - *utilization* is sum of used resources by all tasks executed on the machine
    at event time / resources on the machine
  - task-*requested* is the amount of resources requested by the event's task
- *V-C Figure 9: Job rates vs job size, job execution time and machine
  locality*:
  Like Fig 7/8, but for jobs
  - *job rate* = # of jobs of given type / tot. # jobs
  - *job size* = # of tasks in job
  - *machine locality* = ?
  - *job exec. time* includes **queue time**, like evt. exec. time

### Remarks from 2015 paper
- Event types are lingo for (FAIL, EVICT, FINISH, KILL)
- Tasks (event) type is based on the last event's type
- Tasks life cycle has times:
  - Submission time: when task enters the cluster
  - Scheduling time: when task is loaded on a machine
  - Completion time: when task produces an event
  Of course after completion a task may be resubmitted (e.g. if task is evicted)
- Metrics measured are:
  - *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK
  - Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**)
  - Execution time for jobs/tasks/events
  - Machine locality (*machines needed*/*job size*)
- Job data is sanitized:
  - Exclude jobs with no tasks
  - Exclude jobs with missing information
  - Exclude jobs out of trace bounds (started early, ended late than trace)
- "Wasted time" and "Wasted resources" are time and resources spent on
  unsuccessful executions of tasks (i.e. executions without a FINISH event)