pptx

Report
Cluster Scheduler
Ytao 2013.5.15
Reference:
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011
Multi-agent Cluster Scheduling for Scalability and
Flexibility. Berkerly techdoc EECS-2012-273. (doctoral dissertation)
Omega: flexible scalable schedulers for large compute clusters EuroSys’2013
Cluster Scheduler(Intro)
• Cloud computing framework varies.
• New frameworks will likely continue to merge,
and no single framework will be optimal for all
application.
• Cluster with multiple frameworks improves
ultilization and data sharing.
Problem Statement:
• In the face of increasing demand for cluster
resources by diverse cluster computing
applications and the growing number of
machines in typical clusters,
it is a challenge to design cluster schedulers
that provide flexible, scalable, and effcient
resource allocations
Cluster Scheduler
Monolithic State Scheduling(MSS)
• Traditional & popular ones: (LSF , condor
Hadoop)
• Concept: a single scheduling agent process
that makes all scheduling decisions
sequentially
• Usage: The agent takes input about
framework requirements, resource availability,
and organizational policies, and computes a
global schedule for all tasks
Cluster Scheduler
Monolithic State Scheduling(MSS 2)
• Advantage: optimal scheduling. Global!
• Challenge:
– Complexity: capture all framework requirements
– Scalebility : New frameworks emerge
– Lose framework’s own scheduling optimization
Cluster Scheduler (update)
• Scalability. (response time, number of machines)
• Flexibility (heterogeneous mix of job)
• Usability and Maintainability(easily adapt new
types of jobs, frameworks)
• Fault isolation(Minimize dependencies between
unrelated jobs)
• Utilization(Achieve high cluster resource
utilization. e.g., cpu utilization, memory
utilization)
Partitioned State Scheduling(PSS)
• PSS: in PSS, cluster state is divided between multiple
scheduling agents as non-overlapping scheduling domains
• Statically Partitioned State Scheduling (SPS): statically set
cluster resources for particular frameworks.
• Dynamically Partitioned State Scheduling(DPS) : Mesos
NSDI 2011
Replicated State Scheduling(RSS)
• in RSS scheduling domains may overlap and
optimistic consistency control is used to
resolve conflicting transactions
• Omega EuroSys 2013
Cluster Environment
•
•
•
•
Use of commodity servers
Tens to hundreds of thousands of servers
Heterogeneous resources
Use of commodity networks
Mix workloads
• Service Jobs vs. Terminating Jobs
• Service Jobs consist of a set of service tasks that conceptually are intended
to run forever, and these tasks are interacted with by means of requestresponse interfaces. , e.g., a set of web servers or relational database
servers.
• Terminating Jobs, on the other hand, are given a set of inputs, perform
some work as a function of those inputs, and are intended terminate
eventually (traditional HPC cluster management only considers this)
Mesos
• Goal:
–
–
–
–
Support and demonstrate multi-agent scheduling
Support fair-sharing meta-scheduling policy
Increase overall cluster utilization
Scale to tens of thousands of machines and hundreds of jobs
• Mesos aims to provide a scalable and resilient core for
enabling various frameworks to efficiently share clusters.
• To define a minimal interface that enables efficient
resource sharing across frameworks,
• push control of task scheduling and execution to the
frameworks.
Mesos (resource allocation)
• Resource offer strategy: spare available
resource
– Fairness
– Priority
• Framework rejects offer if resource allocation
is not satisfied. Wait for a good offer.
Mesos(sum)
• take advantage of short tasks to increase
cluster utilization
• Two-level :Resource offer and scheduler
makes it scalable.
• Good at batch jobs. How about service jobs?
RSS Omega
• One of the major drawbacks of Partitioned
State Scheduling is that scheduling domains
must be selected before the scheduling agent
performs its task-resource assignments,
thereby potentially restricting the “goodness
of fit” that might be achieved by the
scheduling agent in its task-resource
assignments.
Role of job manager
• 1.If job queue is not empty, remove next job from job queue
• 2. Sync: Begin a transaction by synchronizing private cluster state
with common cluster
• 3. Schedule: Engage scheduling agent to attempt to create taskresource assignments for all tasks in job, modifying private cluster
state in the process.
• 4. Submit: Attempt to commit job transaction (i.e., all task-resource
assignments for the job) from private cluster state back to common
cluster state. Job transaction can succeed or fail.
• 5. Record which task-resource assignments were successfully
committed to common cluster state.
• 6. If any tasks in job remain unscheduled---either because no
suitable resources were found for the task during the “schedule"
stage or the task-resource assignment, experienced a---insert job
back into job queue to be handled again in a future transaction
Role of meta scheduling agent
• Attempt to execute transactions submitted by
job managers according to transaction mode
settings
• Detect conflicts according to conflict detection
semantics and policies
• Enforce meta-scheduling policies
For each submitted job
• 1. reject task-resource assignments that would
violate policies
• 2. reject task-resource assignments that
conflict with previously accepted transactions
• 3. reject job all task-resource assignments in a
job transaction if using all-or-nothing
transaction semantics and at least one taskresource assignment was rejected
Conflict mode
• Machine-granularity:
• Reject task-resourse assign only if sequential
number and machine are both conflicted.
• Resource-granularity:
• Reject task-resourse assign if cluster state
enter an invalid state(cpu memory)
Transaction Granularity Semantics
• all-or-nothing
– if a single task-resource assignment conflicts
thennone of the task-resource assignments in the
transaction are applied to common cluster state
• incremental
– each task-resource assignment can fail
independent of all others
Omega (sum)
• trade of between whole cluster resource
utilization and conflicts.
• By having a view of whole cluster, dynastically
assign task number based on global optimal is
feasible
• Meta-sheduler needs more modification
• Thanks

similar documents