Added thesis-dev
This commit is contained in:
parent
4d5ec72df3
commit
e455c6efbd
1 changed files with 116 additions and 0 deletions
116
thesis-dev.md
Normal file
116
thesis-dev.md
Normal file
|
@ -0,0 +1,116 @@
|
|||
<!-- vim: set ts=2 sw=2 et tw=80: -->
|
||||
|
||||
# Thesis development and status
|
||||
|
||||
## Thesis objective
|
||||
Google comparazione cluster 2011 2020
|
||||
|
||||
Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti
|
||||
Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of
|
||||
Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare
|
||||
le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel
|
||||
paper di Google)
|
||||
|
||||
Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e
|
||||
statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al
|
||||
confronto Google)
|
||||
|
||||
Diversificare analisi per data center (ora sono 8)
|
||||
|
||||
Replicazione analisi per data center
|
||||
|
||||
*Motivazione del paper: i fallimenti sono tanti, perche?*
|
||||
|
||||
Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento
|
||||
Google drive.
|
||||
|
||||
## Analysis from Rosa/Chen Paper
|
||||
- Table of distinct CPU/Memory configurations of machines and their distrib. (%)
|
||||
(Table I)
|
||||
- *III-A: Temporal impact: machine time waste*:
|
||||
Stacked histogram
|
||||
- Y-axis: normalized (%) aggregated machine time
|
||||
- X-axis: event type
|
||||
Three series:
|
||||
- Resubmission time: sum of all *subm. time* - *previous compl. time*
|
||||
- Queue time: sum of all *sched. time* - *subm. time*
|
||||
- Running time: sum of all *compl. time* - *subm. time*
|
||||
- (%) total wasted time per unsuccessful event type
|
||||
- (mins.) avg. wasted time per number of events for each event type
|
||||
- breakdown of wasted time per *submission*, *scheduling*, *queue*
|
||||
- *III-A-I: Average slowdown per task*: (Table II)
|
||||
For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all
|
||||
*response time* for each task event over *response time* of last event (which
|
||||
is by def. FINISH). Response time is defined as *Queue time* + *Exec time*
|
||||
Table II shows:
|
||||
- % of finish tasks
|
||||
- mean *response time* (all events)
|
||||
- mean *response time* (last event for each task)
|
||||
- mean *slowdown*
|
||||
- *III-B: Spatial impact: resource waste*:
|
||||
Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis)
|
||||
used per task event type (distributions)
|
||||
- *resource demand*: UoM defined as RES (NCU/NMU) / s
|
||||
- *IV-A-1 Table III: Mean number of events and their distribution per task type*:
|
||||
Mean and 95 %-ile number of events per each task type and mean number of
|
||||
events of each type
|
||||
- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful
|
||||
evts for each type observed*:
|
||||
X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution,
|
||||
one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
|
||||
- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*:
|
||||
Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts.
|
||||
of each type)
|
||||
- *IV-B-1 Figure 6: Job Inter-Type Times*:
|
||||
*Inter-Type* is defined as time between job completion of same evt. type
|
||||
Empirical CDF for distribution of job inter-type times for each evt. type.
|
||||
Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
|
||||
- *IV-C Table V: Dependencies between jobs and events*:
|
||||
Probability that a job terminates with a given evt. type if an event of
|
||||
another evt. type is observed ("probability matrix")
|
||||
- *V-A Figure 7: Event rates vs. task priority, event execution time, machine
|
||||
concurrency*
|
||||
3 graphs with x-axes (classes of priority, exec. time intervals, and
|
||||
*concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type /
|
||||
tot. # evts). 4 series per graph, one for each event type.
|
||||
- Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
|
||||
- *concurrency* is defined as # tasks running on the machine when the event is
|
||||
logged
|
||||
- *evt. execution time*: time between submission and execution of "event"
|
||||
(i.e. execution associated with event) (**included queue time**)
|
||||
- *V-B Figure 8: Event rates vs. requested resources, resource reservation,
|
||||
resource utilization*:
|
||||
6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and
|
||||
series like Fig. 7
|
||||
- *reservation* is sum of reserved resources by all tasks executed on the
|
||||
machine at event time / resources on the machine
|
||||
- *utilization* is sum of used resources by all tasks executed on the machine
|
||||
at event time / resources on the machine
|
||||
- task-*requested* is the amount of resources requested by the event's task
|
||||
- *V-C Figure 9: Job rates vs job size, job execution time and machine
|
||||
locality*:
|
||||
Like Fig 7/8, but for jobs
|
||||
- *job rate* = # of jobs of given type / tot. # jobs
|
||||
- *job size* = # of tasks in job
|
||||
- *machine locality* = ?
|
||||
- *job exec. time* includes **queue time**, like evt. exec. time
|
||||
|
||||
### Remarks from 2015 paper
|
||||
- Event types are lingo for (FAIL, EVICT, FINISH, KILL)
|
||||
- Tasks (event) type is based on the last event's type
|
||||
- Tasks life cycle has times:
|
||||
- Submission time: when task enters the cluster
|
||||
- Scheduling time: when task is loaded on a machine
|
||||
- Completion time: when task produces an event
|
||||
Of course after completion a task may be resubmitted (e.g. if task is evicted)
|
||||
- Metrics measured are:
|
||||
- *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK
|
||||
- Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**)
|
||||
- Execution time for jobs/tasks/events
|
||||
- Machine locality (*machines needed*/*job size*)
|
||||
- Job data is sanitized:
|
||||
- Exclude jobs with no tasks
|
||||
- Exclude jobs with missing information
|
||||
- Exclude jobs out of trace bounds (started early, ended late than trace)
|
||||
- "Wasted time" and "Wasted resources" are time and resources spent on
|
||||
unsuccessful executions of tasks (i.e. executions without a FINISH event)
|
Loading…
Reference in a new issue