116 lines
5.6 KiB
Markdown
116 lines
5.6 KiB
Markdown
<!-- vim: set ts=2 sw=2 et tw=80: -->
|
||
|
||
# Thesis development and status
|
||
|
||
## Thesis objective
|
||
Google comparazione cluster 2011 2020
|
||
|
||
Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti
|
||
Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of
|
||
Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare
|
||
le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel
|
||
paper di Google)
|
||
|
||
Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e
|
||
statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al
|
||
confronto Google)
|
||
|
||
Diversificare analisi per data center (ora sono 8)
|
||
|
||
Replicazione analisi per data center
|
||
|
||
*Motivazione del paper: i fallimenti sono tanti, perche?*
|
||
|
||
Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento
|
||
Google drive.
|
||
|
||
## Analysis from Rosa/Chen Paper
|
||
- [✅ **machine_configs**] Table of distinct CPU/Memory configurations of machines and their distrib. (%)
|
||
(Table I)
|
||
- *III-A: Temporal impact: machine time waste*:
|
||
Stacked histogram
|
||
- Y-axis: normalized (%) aggregated machine time
|
||
- X-axis: event type
|
||
Three series:
|
||
- Resubmission time: sum of all *subm. time* - *previous compl. time*
|
||
- Queue time: sum of all *sched. time* - *subm. time*
|
||
- Running time: sum of all *compl. time* - *subm. time*
|
||
- (%) total wasted time per unsuccessful event type
|
||
- (mins.) avg. wasted time per number of events for each event type
|
||
- breakdown of wasted time per *submission*, *scheduling*, *queue*
|
||
- *III-A-I: Average slowdown per task*: (Table II)
|
||
For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all
|
||
*response time* for each task event over *response time* of last event (which
|
||
is by def. FINISH). Response time is defined as *Queue time* + *Exec time*
|
||
Table II shows:
|
||
- % of finish tasks
|
||
- mean *response time* (all events)
|
||
- mean *response time* (last event for each task)
|
||
- mean *slowdown*
|
||
- *III-B: Spatial impact: resource waste*:
|
||
Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis)
|
||
used per task event type (distributions)
|
||
- *resource demand*: UoM defined as RES (NCU/NMU) / s
|
||
- *IV-A-1 Table III: Mean number of events and their distribution per task type*:
|
||
Mean and 95 %-ile number of events per each task type and mean number of
|
||
events of each type
|
||
- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful
|
||
evts for each type observed*:
|
||
X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution,
|
||
one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
|
||
- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*:
|
||
Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts.
|
||
of each type)
|
||
- *IV-B-1 Figure 6: Job Inter-Type Times*:
|
||
*Inter-Type* is defined as time between job completion of same evt. type
|
||
Empirical CDF for distribution of job inter-type times for each evt. type.
|
||
Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
|
||
- *IV-C Table V: Dependencies between jobs and events*:
|
||
Probability that a job terminates with a given evt. type if an event of
|
||
another evt. type is observed ("probability matrix")
|
||
- *V-A Figure 7: Event rates vs. task priority, event execution time, machine
|
||
concurrency*
|
||
3 graphs with x-axes (classes of priority, exec. time intervals, and
|
||
*concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type /
|
||
tot. # evts). 4 series per graph, one for each event type.
|
||
- Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
|
||
- *concurrency* is defined as # tasks running on the machine when the event is
|
||
logged
|
||
- *evt. execution time*: time between submission and execution of "event"
|
||
(i.e. execution associated with event) (**included queue time**)
|
||
- *V-B Figure 8: Event rates vs. requested resources, resource reservation,
|
||
resource utilization*:
|
||
6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and
|
||
series like Fig. 7
|
||
- *reservation* is sum of reserved resources by all tasks executed on the
|
||
machine at event time / resources on the machine
|
||
- *utilization* is sum of used resources by all tasks executed on the machine
|
||
at event time / resources on the machine
|
||
- task-*requested* is the amount of resources requested by the event's task
|
||
- *V-C Figure 9: Job rates vs job size, job execution time and machine
|
||
locality*:
|
||
Like Fig 7/8, but for jobs
|
||
- *job rate* = # of jobs of given type / tot. # jobs
|
||
- *job size* = # of tasks in job
|
||
- *machine locality* = ?
|
||
- *job exec. time* includes **queue time**, like evt. exec. time
|
||
|
||
### Remarks from 2015 paper
|
||
- Event types are lingo for (FAIL, EVICT, FINISH, KILL)
|
||
- Tasks (event) type is based on the last event's type
|
||
- Tasks life cycle has times:
|
||
- Submission time: when task enters the cluster
|
||
- Scheduling time: when task is loaded on a machine
|
||
- Completion time: when task produces an event
|
||
Of course after completion a task may be resubmitted (e.g. if task is evicted)
|
||
- Metrics measured are:
|
||
- *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK
|
||
- Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**)
|
||
- Execution time for jobs/tasks/events
|
||
- Machine locality (*machines needed*/*job size*)
|
||
- Job data is sanitized:
|
||
- Exclude jobs with no tasks
|
||
- Exclude jobs with missing information
|
||
- Exclude jobs out of trace bounds (started early, ended late than trace)
|
||
- "Wasted time" and "Wasted resources" are time and resources spent on
|
||
unsuccessful executions of tasks (i.e. executions without a FINISH event)
|