Added thesis-dev

2021-02-15 11:36:09 +01:00 · 2021-02-15 11:36:09 +01:00 · 4cd796d5f0
parent c28c2051e7
commit 4cd796d5f0
1 changed files with 116 additions and 0 deletions
--- a/thesis-dev.md
+++ b/thesis-dev.md
@ -0,0 +1,116 @@
+<!-- vim: set ts=2 sw=2 et tw=80: -->
+
+# Thesis development and status
+
+## Thesis objective
+Google comparazione cluster 2011 2020
+
+Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti
+Prendere paper Rosa’ 2015 (parte analisi, paper “Understanding the Dark Side of
+Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare
+le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel
+paper di Google)
+
+Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e
+statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al
+confronto Google)
+
+Diversificare analisi per data center (ora sono 8)
+
+Replicazione analisi per data center
+
+*Motivazione del paper: i fallimenti sono tanti, perche?*
+
+Deadline riguardo al progetto, avvisare quando si sa da pezze’ via documento
+Google drive.
+
+## Analysis from Rosa/Chen Paper
+- Table of distinct CPU/Memory configurations of machines and their distrib. (%)
+  (Table I)
+- *III-A: Temporal impact: machine time waste*:
+  Stacked histogram
+  - Y-axis: normalized (%) aggregated machine time
+  - X-axis: event type
+  Three series:
+  - Resubmission time: sum of all *subm. time* - *previous compl. time*
+  - Queue time: sum of all *sched. time* - *subm. time*
+  - Running time: sum of all *compl. time* - *subm. time*
+- (%) total wasted time per unsuccessful event type
+- (mins.) avg. wasted time per number of events for each event type
+- breakdown of wasted time per *submission*, *scheduling*, *queue*
+- *III-A-I: Average slowdown per task*: (Table II)
+  For FINISH type tasks, compute *slowdown*, i.e. mean (**ask Rosa**) of all
+  *response time* for each task event over *response time* of last event (which
+  is by def. FINISH). Response time is defined as *Queue time* + *Exec time*
+  Table II shows:
+  - % of finish tasks
+  - mean *response time* (all events)
+  - mean *response time* (last event for each task)
+  - mean *slowdown*
+- *III-B: Spatial impact: resource waste*:
+  Normalized % (y-axis) partition of *resource demand* (CPU, DISK, RAM, x-axis)
+  used per task event type (distributions)
+  - *resource demand*: UoM defined as RES (NCU/NMU) / s
+- *IV-A-1 Table III: Mean number of events and their distribution per task type*:
+  Mean and 95 %-ile number of events per each task type and mean number of
+  events of each type
+- *IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful
+  evts for each type observed*:
+  X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution,
+  one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
+- *IV-B Table IV: Mean number of tasks and evt. distibution per job type*:
+  Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts.
+  of each type)
+- *IV-B-1 Figure 6: Job Inter-Type Times*:
+  *Inter-Type* is defined as time between job completion of same evt. type
+  Empirical CDF for distribution of job inter-type times for each evt. type.
+  Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
+- *IV-C Table V: Dependencies between jobs and events*:
+  Probability that a job terminates with a given evt. type if an event of
+  another evt. type is observed ("probability matrix")
+- *V-A Figure 7: Event rates vs. task priority, event execution time, machine
+  concurrency*
+  3 graphs with x-axes (classes of priority, exec. time intervals, and
+  *concurrency* intervals), y-axis is Event rate (i.e. # of evts of that type /
+  tot. # evts).  4 series per graph, one for each event type.
+  - Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
+  - *concurrency* is defined as # tasks running on the machine when the event is
+    logged
+  - *evt. execution time*: time between submission and execution of "event"
+    (i.e. execution associated with event) (**included queue time**)
+- *V-B Figure 8: Event rates vs. requested resources, resource reservation,
+  resource utilization*:
+  6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and
+  series like Fig. 7
+  - *reservation* is sum of reserved resources by all tasks executed on the
+    machine at event time / resources on the machine
+  - *utilization* is sum of used resources by all tasks executed on the machine
+    at event time / resources on the machine
+  - task-*requested* is the amount of resources requested by the event's task
+- *V-C Figure 9: Job rates vs job size, job execution time and machine
+  locality*:
+  Like Fig 7/8, but for jobs
+  - *job rate* = # of jobs of given type / tot. # jobs
+  - *job size* = # of tasks in job
+  - *machine locality* = ?
+  - *job exec. time* includes **queue time**, like evt. exec. time
+
+### Remarks from 2015 paper
+- Event types are lingo for (FAIL, EVICT, FINISH, KILL)
+- Tasks (event) type is based on the last event's type
+- Tasks life cycle has times:
+  - Submission time: when task enters the cluster
+  - Scheduling time: when task is loaded on a machine
+  - Completion time: when task produces an event
+  Of course after completion a task may be resubmitted (e.g. if task is evicted)
+- Metrics measured are:
+  - *requested*, *used*, *machine capacity*: resources for CPU, RAM, DISK
+  - Priority (**Priorities are 0-11 in the 2015 traces, use conversion table**)
+  - Execution time for jobs/tasks/events
+  - Machine locality (*machines needed*/*job size*)
+- Job data is sanitized:
+  - Exclude jobs with no tasks
+  - Exclude jobs with missing information
+  - Exclude jobs out of trace bounds (started early, ended late than trace)
+- "Wasted time" and "Wasted resources" are time and resources spent on
+  unsuccessful executions of tasks (i.e. executions without a FINISH event)