2021-02-15 10:36:09 +00:00
|
|
|
|
<!-- vim: set ts=2 sw=2 et tw=80: -->
|
|
|
|
|
|
|
|
|
|
# Thesis development and status
|
|
|
|
|
|
|
|
|
|
## Thesis objective
|
|
|
|
|
Google comparazione cluster 2011 2020
|
|
|
|
|
|
|
|
|
|
Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti
|
|
|
|
|
Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of
|
|
|
|
|
Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare
|
|
|
|
|
le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel
|
|
|
|
|
paper di Google)
|
|
|
|
|
|
|
|
|
|
Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e
|
|
|
|
|
statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al
|
|
|
|
|
confronto Google)
|
|
|
|
|
|
|
|
|
|
Diversificare analisi per data center (ora sono 8)
|
|
|
|
|
|
|
|
|
|
Replicazione analisi per data center
|
|
|
|
|
|
|
|
|
|
*Motivazione del paper: i fallimenti sono tanti, perche?*
|
|
|
|
|
|
|
|
|
|
Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento
|
|
|
|
|
Google drive.
|
|
|
|
|
|
|
|
|
|
## Analysis from Rosa/Chen Paper
|
2021-02-15 10:43:10 +00:00
|
|
|
|
- [✅ **machine_configs**] Table of distinct CPU/Memory configurations of machines and their distrib. (%)
|
2021-02-15 10:36:09 +00:00
|
|
|
|
(Table I)
|
2021-02-27 12:07:15 +00:00
|
|
|
|
- [✅ **machine_time_waste**] *III-A: Temporal impact: machine time waste*:
|
2021-02-15 10:36:09 +00:00
|
|
|
|
Stacked histogram
|
|
|
|
|
- Y-axis: normalized (%) aggregated machine time
|
|
|
|
|
- X-axis: event type
|
|
|
|
|
Three series:
|
|
|
|
|
- Resubmission time: sum of all *subm. time* - *previous compl. time*
|
|
|
|
|
- Queue time: sum of all *sched. time* - *subm. time*
|
|
|
|
|
- Running time: sum of all *compl. time* - *subm. time*
|
|
|
|
|
- (%) total wasted time per unsuccessful event type
|
|
|
|
|
- (mins.) avg. wasted time per number of events for each event type
|
|
|
|
|
- breakdown of wasted time per *submission*, *scheduling*, *queue*
|
2021-03-10 16:03:54 +00:00
|
|
|
|
- [✅ **task_slowdown**] *III-A-I: Average slowdown per task*: (Table II)
|
2021-02-15 10:36:09 +00:00
|
|
|
|
For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all
|
|
|
|
|
*response time* for each task event over *response time* of last event (which
|
|
|
|
|
is by def. FINISH). Response time is defined as *Queue time* + *Exec time*
|
|
|
|
|
Table II shows:
|
|
|
|
|
- % of finish tasks
|
|
|
|
|
- mean *response time* (all events)
|
|
|
|
|
- mean *response time* (last event for each task)
|
|
|
|
|
- mean *slowdown*
|
|
|
|
|
- *III-B: Spatial impact: resource waste*:
|
|
|
|
|
Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis)
|
|
|
|
|
used per task event type (distributions)
|
|
|
|
|
- *resource demand*: UoM defined as RES (NCU/NMU) / s
|
|
|
|
|
- *IV-A-1 Table III: Mean number of events and their distribution per task type*:
|
|
|
|
|
Mean and 95 %-ile number of events per each task type and mean number of
|
|
|
|
|
events of each type
|
|
|
|
|
- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful
|
|
|
|
|
evts for each type observed*:
|
|
|
|
|
X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution,
|
|
|
|
|
one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
|
|
|
|
|
- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*:
|
|
|
|
|
Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts.
|
|
|
|
|
of each type)
|
|
|
|
|
- *IV-B-1 Figure 6: Job Inter-Type Times*:
|
|
|
|
|
*Inter-Type* is defined as time between job completion of same evt. type
|
|
|
|
|
Empirical CDF for distribution of job inter-type times for each evt. type.
|
|
|
|
|
Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
|
|
|
|
|
- *IV-C Table V: Dependencies between jobs and events*:
|
|
|
|
|
Probability that a job terminates with a given evt. type if an event of
|
|
|
|
|
another evt. type is observed ("probability matrix")
|
|
|
|
|
- *V-A Figure 7: Event rates vs. task priority, event execution time, machine
|
|
|
|
|
concurrency*
|
|
|
|
|
3 graphs with x-axes (classes of priority, exec. time intervals, and
|
|
|
|
|
*concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type /
|
|
|
|
|
tot. # evts). 4 series per graph, one for each event type.
|
|
|
|
|
- Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
|
|
|
|
|
- *concurrency* is defined as # tasks running on the machine when the event is
|
|
|
|
|
logged
|
|
|
|
|
- *evt. execution time*: time between submission and execution of "event"
|
|
|
|
|
(i.e. execution associated with event) (**included queue time**)
|
|
|
|
|
- *V-B Figure 8: Event rates vs. requested resources, resource reservation,
|
|
|
|
|
resource utilization*:
|
|
|
|
|
6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and
|
|
|
|
|
series like Fig. 7
|
|
|
|
|
- *reservation* is sum of reserved resources by all tasks executed on the
|
|
|
|
|
machine at event time / resources on the machine
|
|
|
|
|
- *utilization* is sum of used resources by all tasks executed on the machine
|
|
|
|
|
at event time / resources on the machine
|
|
|
|
|
- task-*requested* is the amount of resources requested by the event's task
|
|
|
|
|
- *V-C Figure 9: Job rates vs job size, job execution time and machine
|
|
|
|
|
locality*:
|
|
|
|
|
Like Fig 7/8, but for jobs
|
|
|
|
|
- *job rate* = # of jobs of given type / tot. # jobs
|
|
|
|
|
- *job size* = # of tasks in job
|
|
|
|
|
- *machine locality* = ?
|
|
|
|
|
- *job exec. time* includes **queue time**, like evt. exec. time
|
|
|
|
|
|
|
|
|
|
### Remarks from 2015 paper
|
|
|
|
|
- Event types are lingo for (FAIL, EVICT, FINISH, KILL)
|
|
|
|
|
- Tasks (event) type is based on the last event's type
|
|
|
|
|
- Tasks life cycle has times:
|
|
|
|
|
- Submission time: when task enters the cluster
|
|
|
|
|
- Scheduling time: when task is loaded on a machine
|
|
|
|
|
- Completion time: when task produces an event
|
|
|
|
|
Of course after completion a task may be resubmitted (e.g. if task is evicted)
|
|
|
|
|
- Metrics measured are:
|
|
|
|
|
- *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK
|
|
|
|
|
- Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**)
|
|
|
|
|
- Execution time for jobs/tasks/events
|
|
|
|
|
- Machine locality (*machines needed*/*job size*)
|
|
|
|
|
- Job data is sanitized:
|
|
|
|
|
- Exclude jobs with no tasks
|
|
|
|
|
- Exclude jobs with missing information
|
|
|
|
|
- Exclude jobs out of trace bounds (started early, ended late than trace)
|
|
|
|
|
- "Wasted time" and "Wasted resources" are time and resources spent on
|
|
|
|
|
unsuccessful executions of tasks (i.e. executions without a FINISH event)
|