bachelorThesis/thesis-dev.md

5.7 KiB
Raw Permalink Blame History

Thesis development and status

Thesis objective

Google comparazione cluster 2011 2020

Rifacciamo la stessa cosa, ma non generale ma dal punto di vista dei fallimenti Prendere paper Rosa 2015 (parte analisi, paper “Understanding the Dark Side of Big Data Clusters An Analysis beyond Failures - Rosa Chen Binder.pdf”) e rifare le analisi su dati 2020. Poi, comparare analisi 2015 e analisi 2020 (come nel paper di Google)

Partire la tesi con parte generale dove in 2 3 pagine descrivere tracce e statistiche generali Seconda parte, rifacciamo le analisi (citare ispirazione al confronto Google)

Diversificare analisi per data center (ora sono 8)

Replicazione analisi per data center

Motivazione del paper: i fallimenti sono tanti, perche?

Deadline riguardo al progetto, avvisare quando si sa da pezze via documento Google drive.

Analysis from Rosa/Chen Paper

  • [ machine_configs] Table of distinct CPU/Memory configurations of machines and their distrib. (%) (Table I)
  • [ machine_time_waste] III-A: Temporal impact: machine time waste: Stacked histogram
    • Y-axis: normalized (%) aggregated machine time
    • X-axis: event type Three series:
    • Resubmission time: sum of all subm. time - previous compl. time
    • Queue time: sum of all sched. time - subm. time
    • Running time: sum of all compl. time - subm. time
  • (%) total wasted time per unsuccessful event type
  • (mins.) avg. wasted time per number of events for each event type
  • breakdown of wasted time per submission, scheduling, queue
  • [ task_slowdown] III-A-I: Average slowdown per task: (Table II) For FINISH type tasks, compute slowdown, i.e. mean (ask Rosa) of all response time for each task event over response time of last event (which is by def. FINISH). Response time is defined as Queue time + Exec time Table II shows:
    • % of finish tasks
    • mean response time (all events)
    • mean response time (last event for each task)
    • mean slowdown
  • III-B: Spatial impact: resource waste: Normalized % (y-axis) partition of resource demand (CPU, DISK, RAM, x-axis) used per task event type (distributions)
    • resource demand: UoM defined as RES (NCU/NMU) / s
  • IV-A-1 Table III: Mean number of events and their distribution per task type: Mean and 95 %-ile number of events per each task type and mean number of events of each type
  • IV-A-2 Figure 5: Cond. probability of task success given # of unsuccessful evts for each type observed: X-axis is # evts. Y-axis is probability the task will succeed. 3 distribution, one for EVICT, FAIL, and KILL. (# evts refers to events of that specific type)
  • IV-B Table IV: Mean number of tasks and evt. distibution per job type: Like table III but for jobs (mean # of tasks + 95 %-ile, then avg. # of evts. of each type)
  • IV-B-1 Figure 6: Job Inter-Type Times: Inter-Type is defined as time between job completion of same evt. type Empirical CDF for distribution of job inter-type times for each evt. type. Curve fitting with Weibull, Exp., Gamma, Normal and Log-normal + KS test.
  • IV-C Table V: Dependencies between jobs and events: Probability that a job terminates with a given evt. type if an event of another evt. type is observed ("probability matrix")
  • V-A Figure 7: Event rates vs. task priority, event execution time, machine concurrency 3 graphs with x-axes (classes of priority, exec. time intervals, and concurrency intervals), y-axis is Event rate (i.e. # of evts of that type / tot. # evts). 4 series per graph, one for each event type.
    • Note: priority classes are based on FREE, LOW, HIGH, PROD Borg "tiers"
    • concurrency is defined as # tasks running on the machine when the event is logged
    • evt. execution time: time between submission and execution of "event" (i.e. execution associated with event) (included queue time)
  • V-B Figure 8: Event rates vs. requested resources, resource reservation, resource utilization: 6 graphs, one for [CPU, RAM] X [requested, reserved, utilized]. X, Y, and series like Fig. 7
    • reservation is sum of reserved resources by all tasks executed on the machine at event time / resources on the machine
    • utilization is sum of used resources by all tasks executed on the machine at event time / resources on the machine
    • task-requested is the amount of resources requested by the event's task
  • V-C Figure 9: Job rates vs job size, job execution time and machine locality: Like Fig 7/8, but for jobs
    • job rate = # of jobs of given type / tot. # jobs
    • job size = # of tasks in job
    • machine locality = ?
    • job exec. time includes queue time, like evt. exec. time

Remarks from 2015 paper

  • Event types are lingo for (FAIL, EVICT, FINISH, KILL)
  • Tasks (event) type is based on the last event's type
  • Tasks life cycle has times:
    • Submission time: when task enters the cluster
    • Scheduling time: when task is loaded on a machine
    • Completion time: when task produces an event Of course after completion a task may be resubmitted (e.g. if task is evicted)
  • Metrics measured are:
    • requested, used, machine capacity: resources for CPU, RAM, DISK
    • Priority (Priorities are 0-11 in the 2015 traces, use conversion table)
    • Execution time for jobs/tasks/events
    • Machine locality (machines needed/job size)
  • Job data is sanitized:
    • Exclude jobs with no tasks
    • Exclude jobs with missing information
    • Exclude jobs out of trace bounds (started early, ended late than trace)
  • "Wasted time" and "Wasted resources" are time and resources spent on unsuccessful executions of tasks (i.e. executions without a FINISH event)