diff --git a/thesis-dev.md b/thesis-dev.md new file mode 100644 index 00000000..68d78132 --- /dev/null +++ b/thesis-dev.md @@ -0,0 +1,116 @@ + + +# Thesis development and status + +## Thesis objective +Google comparazione cluster 2011 2020 + +Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti +Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of +Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare +le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel +paper di Google) + +Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e +statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al +confronto Google) + +Diversificare analisi per data center (ora sono 8) + +Replicazione analisi per data center + +*Motivazione del paper: i fallimenti sono tanti, perche?* + +Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento +Google drive. + +## Analysis from Rosa/Chen Paper +- Table of distinct CPU/Memory configurations of machines and their distrib. (%) + (Table I) +- *III-A: Temporal impact: machine time waste*: + Stacked histogram + - Y-axis: normalized (%) aggregated machine time + - X-axis: event type + Three series: + - Resubmission time: sum of all *subm. time* - *previous compl. time* + - Queue time: sum of all *sched. time* - *subm. time* + - Running time: sum of all *compl. time* - *subm. time* +- (%) total wasted time per unsuccessful event type +- (mins.) avg. wasted time per number of events for each event type +- breakdown of wasted time per *submission*, *scheduling*, *queue* +- *III-A-I: Average slowdown per task*: (Table II) + For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all + *response time* for each task event over *response time* of last event (which + is by def. FINISH). Response time is defined as *Queue time* + *Exec time* + Table II shows: + - % of finish tasks + - mean *response time* (all events) + - mean *response time* (last event for each task) + - mean *slowdown* +- *III-B: Spatial impact: resource waste*: + Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis) + used per task event type (distributions) + - *resource demand*: UoM defined as RES (NCU/NMU) / s +- *IV-A-1 Table III: Mean number of events and their distribution per task type*: + Mean and 95 %-ile number of events per each task type and mean number of + events of each type +- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful + evts for each type observed*: + X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution, + one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type) +- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*: + Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts. + of each type) +- *IV-B-1 Figure 6: Job Inter-Type Times*: + *Inter-Type* is defined as time between job completion of same evt. type + Empirical CDF for distribution of job inter-type times for each evt. type. + Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test. +- *IV-C Table V: Dependencies between jobs and events*: + Probability that a job terminates with a given evt. type if an event of + another evt. type is observed ("probability matrix") +- *V-A Figure 7: Event rates vs. task priority, event execution time, machine + concurrency* + 3 graphs with x-axes (classes of priority, exec. time intervals, and + *concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type / + tot. # evts). 4 series per graph, one for each event type. + - Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers" + - *concurrency* is defined as # tasks running on the machine when the event is + logged + - *evt. execution time*: time between submission and execution of "event" + (i.e. execution associated with event) (**included queue time**) +- *V-B Figure 8: Event rates vs. requested resources, resource reservation, + resource utilization*: + 6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and + series like Fig. 7 + - *reservation* is sum of reserved resources by all tasks executed on the + machine at event time / resources on the machine + - *utilization* is sum of used resources by all tasks executed on the machine + at event time / resources on the machine + - task-*requested* is the amount of resources requested by the event's task +- *V-C Figure 9: Job rates vs job size, job execution time and machine + locality*: + Like Fig 7/8, but for jobs + - *job rate* = # of jobs of given type / tot. # jobs + - *job size* = # of tasks in job + - *machine locality* = ? + - *job exec. time* includes **queue time**, like evt. exec. time + +### Remarks from 2015 paper +- Event types are lingo for (FAIL, EVICT, FINISH, KILL) +- Tasks (event) type is based on the last event's type +- Tasks life cycle has times: + - Submission time: when task enters the cluster + - Scheduling time: when task is loaded on a machine + - Completion time: when task produces an event + Of course after completion a task may be resubmitted (e.g. if task is evicted) +- Metrics measured are: + - *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK + - Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**) + - Execution time for jobs/tasks/events + - Machine locality (*machines needed*/*job size*) +- Job data is sanitized: + - Exclude jobs with no tasks + - Exclude jobs with missing information + - Exclude jobs out of trace bounds (started early, ended late than trace) +- "Wasted time" and "Wasted resources" are time and resources spent on + unsuccessful executions of tasks (i.e. executions without a FINISH event)