Mesos

Report
Apache Mesos
Design Decisions
mesos.apache.org
@ApacheMesos
Benjamin Hindman – @benh
this is not
a talk about YARN
at least not explicitly!
this talk is about Mesos!
a little history
Mesos started as a research project at Berkeley
in early 2009 by Benjamin Hindman, Andy
Konwinski, Matei Zaharia, Ali Ghodsi, Anthony
D. Joseph, Randy Katz, Scott Shenker, Ion
Stoica
our motivation
increase performance
and utilization of
clusters
our intuition
① static partitioning
considered harmful
static partitioning considered harmful
datacenter
static partitioning considered harmful
static partitioning considered harmful
static partitioning considered harmful
static partitioning considered harmful
faster!
static partitioning considered harmful
higher utilization!
our intuition
② build new
frameworks
“Map/Reduce is a big hammer,
but not everything is a nail!”
Apache Mesos is a
distributed system
for running and building
other distributed systems
Mesos is a cluster manager
Mesos is a resource manager
Mesos is a resource negotiator
Mesos replaces static partitioning
of resources to frameworks with
dynamic resource allocation
Mesos is a distributed system
with a master/slave architecture
masters
slaves
frameworks register with the
Mesos master in order to run
jobs/tasks
frameworks
masters
slaves
Mesos @Twitter in early 2010
goal: run long-running
services elastically on Mesos
Apache Aurora (incubating)
Aurora is a Mesos framework that
makes it easy to launch services
written in Ruby, Java, Scala,
Python, Go, etc!
masters
Storm, Jenkins, …
masters
a lot of interesting
design decisions
along the way
many appear (IMHO)
in YARN too
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
frameworks get allocated
resources from the masters
framework
offer
hostname
4 CPUs
4 GB RAM
masters
resources are allocated via
resource offers
a resource offer represents a
snapshot of available resources
(one offer per host) that a
framework can use to run tasks
frameworks use these resources
to decide what tasks to run
framework
task
3 CPUs
2 GB RAM
masters
a task can use a subset of an offer
Mesos challenged
the status quo
of cluster managers
cluster manager status quo
application
specification
cluster manager
the specification includes as much
information as possible to assist
the cluster manager in scheduling
and execution
cluster manager status quo
application
cluster manager
wait for task
to be executed
cluster manager status quo
application
result
cluster manager
problems with specifications
① hard to specify certain desires or constraints
② hard to update specifications dynamically as
tasks executed and finished/failed
an alternative model
framework
request
3 CPUs
2 GB RAM
masters
a request is purposely simplified
subset of a specification, mainly
including the required resources
question: what should Mesos
do if it can’t satisfy a request?
question: what should Mesos
do if it can’t satisfy a request?
① wait until it can …
question: what should Mesos
do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately
question: what should Mesos
do if it can’t satisfy a request?
① wait until it can …
② offer the best it can immediately
an alternative model
framework
offer
hostname
4 CPUs
4 GB RAM
masters
an alternative model
framework
offer
offer
hostname
offer
hostname
4 CPUs
offer
hostname
44
CPUs
GB RAM
hostname
44
CPUs
GB RAM
44
CPUs
GB RAM
4 GB RAM
masters
an alternative model
framework
offer
offer
hostname
offer
hostname
4 CPUs
offer
hostname
44
CPUs
GB RAM
hostname
44
CPUs
GB RAM
44
CPUs
GB RAM
4 GB RAM
masters
framework uses the offers to
perform it’s own scheduling
an analogue:
non-blocking sockets
application
write(s, buffer, size);
kernel
an analogue:
non-blocking sockets
application
42 of 100 bytes written!
kernel
resource offers address
asynchrony in resource
allocation
IIUC, even YARN allocates
“the best it can” to an
application when it can’t
satisfy a request
requests are complimentary
(but not necessary)
offers represent
the currently available
resources a framework can use
question: should resources
within offers be disjoint?
framework1
framework2
offer
hostname
4 CPUs
4 GB RAM
offer
hostname
4 CPUs
4 GB RAM
masters
concurrency control
pessimistic
optimistic
concurrency control
pessimistic
optimistic
all offers overlap with one
another, thus causing
frameworks to “compete”
first-come-first-served
concurrency control
pessimistic
offers made to different
frameworks are disjoint
optimistic
Mesos semantics:
assume overlapping offers
design comparison:
Google’s Omega
the Omega model
framework
snapshot
database
a framework gets a snapshot of
the cluster state from a database
(note, does not make a request!)
the Omega model
framework
transaction
database
a framework submits a transaction
to the database to “acquire”
resources (which it can then use to
run tasks)
failed transactions occur when
another framework has already
acquired sought resources
isomorphism?
observation:
snapshots are optimistic
offers
Omega and Mesos
framework
framework
offer
hostname
4 CPUs
4 GB RAM
snapshot
database
masters
Omega and Mesos
framework
framework
task
3 CPUs
2 GB RAM
transaction
database
masters
thought experiment:
what’s gained by exploiting
the continuous spectrum of
pessimistic to optimistic?
pessimistic
optimistic
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
Mesos allocates resources to
frameworks using a
fair-sharing algorithm
we created called Dominant
Resource Fairness (DRF)
DRF, born of static partitioning
datacenter
static partitioning across teams
team
promotions
trends
recommendations
static partitioning across teams
team
promotions
trends
fairly shared!
recommendations
goal: fairly share the resources
without static partitioning
partition utilizations
team
promotions
trends
recommendations
utilization
45% CPU
100% RAM
75% CPU
100% RAM
100% CPU
50% RAM
observation: a dominant resource
bottlenecks each team from
running any more jobs/tasks
dominant resource bottlenecks
team
promotions
trends
recommendations
utilization
45% CPU
100% RAM
75% CPU
100% RAM
100% CPU
50% RAM
bottleneck
RAM
RAM
CPU
insight: allocating a fair share of
each team’s dominant resource
guarantees they can run at least
as many jobs/tasks as with static
partitioning!
… if my team gets at least 1/N of
my dominant resource I will do
no worse than if I had my own
cluster, but I might do better
when resources are available!
DRF in Mesos
framework
masters
① frameworks specify a role
when they register (i.e., the
team to charge for the
resources)
DRF in Mesos
framework
masters
① frameworks specify a role
when they register (i.e., the
team to charge for the
resources)
② master calculates each role’s
dominant resource
(dynamically) and allocates
appropriately
$tep 4: Profit
(statistical multiplexing)
in practice,
fair sharing is insufficient
weighted fair sharing
team
promotions
trends
recommendations
weighted fair sharing
team
weight
promotions
trends
recommendations
0.17
0.5
0.33
Mesos implements weighted DRF
masters
masters can be configured with
weights per role
resource allocation decisions
incorporate the weights to
determine dominant fair shares
in practice,
weighted fair sharing
is still insufficient
a non-cooperative framework
(i.e., has long tasks or is buggy)
can get allocated too many
resources
Mesos provides reservations
framework (trends)
offer
hostname
4 CPUs
4 GB RAM
role: trends
masters
slaves can be configured with
resource reservations for
particular roles (dynamic, time
based, and percentage based
reservations are in development)
resource offers include the
reservation role (if any)
reservations provide guarantees,
but at the cost of utilization
reservations
trends
20% recommendations
40%
unused
30%
used
10%
promotions
40%
revocable resources
framework (promotions)
offer
hostname
4 CPUs
4 GB RAM
role: trends
masters
reserved resources that are unused
can be allocated to frameworks
from different roles but those
resources may be revoked at any
time
preemption via revocation
… my tasks will not be killed
unless I’m using revocable
resources!
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
high-availability and faulttolerance a prerequisite @twitter
machine failure
① framework failover
process failure (bugs!)
② master failover
upgrades
③ slave failover
high-availability and faulttolerance a prerequisite @twitter
machine failure
① framework failover
process failure (bugs!)
② master failover
upgrades
③ slave failover
① framework failover
framework
framework
framework re-registers with
master and resumes operation
all tasks keep running across
framework failover!
masters
high-availability and faulttolerance a prerequisite @twitter
machine failure
① framework failover
process failure (bugs!)
② master failover
upgrades
③ slave failover
② master failover
framework
after a new master is elected all
frameworks and slaves connect to
the new master
masters
all tasks keep running across master
failover!
high-availability and faulttolerance a prerequisite @twitter
machine failure
① framework failover
process failure (bugs!)
② master failover
upgrades
③ slave failover
③ slave failover
mesos-slave
task
slave
task
③ slave failover
mesos-slave
task
slave
task
③ slave failover
task
slave
task
③ slave failover
mesos-slave
task
slave
task
③ slave failover
mesos-slave
task
slave
task
③ slave failover @twitter
mesos-slave
slave
(large in-memory services,
expensive to restart)
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
execution
framework
frameworks launch fine-grained
tasks for execution
task
3 CPUs
2 GB RAM
masters
if necessary, a framework can
provide an executor to handle the
execution of a task
executor
mesos-slave
executor
slave
task
task
executor
mesos-slave
executor
slave
task
task
task
executor
mesos-slave
executor
slave
task
goal: isolation
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
task
slave
task
containers
executor + task design means
containers can have changing
resource allocations
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
isolation
mesos-slave
executor
slave
task
task
making the task first-class
gives us true fine-grained
resources sharing
requirement:
fast task launching (i.e.,
milliseconds or less)
virtual machines
an anti-pattern
operating-system virtualization
containers
(zones and projects)
control groups (cgroups)
namespaces
isolation support
tight integration with cgroups
CPU (upper and lower bounds)
memory
network I/O (traffic controller, in development)
filesystem (using LVM, in development)
statistics too
rarely does allocation == usage (humans are bad
at estimating the amount of resources they’re
using)
used @twitter for capacity planning (and
oversubscription in development)
CPU upper bounds?
in practice,
determinism trumps utilization
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
requirements:
① performance
② maintainability (static typing)
③ interfaces to low-level OS (for isolation, etc)
④ interoperability with other languages (for
library bindings)
garbage collection
a performance anti-pattern
consequences:
① antiquated libraries (especially around
concurrency and networking)
② nascent community
github.com/3rdparty/libprocess
concurrency via futures/actors,
networking via message passing
github.com/3rdparty/stout
monads in C++,
safe and understandable utilities
but …
scalability simulations
to 50,000+ slaves
@twitter we run
multiple Mesos clusters
each with 3500+ nodes
design decisions
① two-level scheduling and resource offers
② fair-sharing and revocable resources
③ high-availability and fault-tolerance
④ execution and isolation
⑤ C++
final remarks
frameworks
• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)
write your next
distributed system with
Mesos!
port a framework to Mesos
write a “wrapper”
~100 lines of code to write a wrapper (the more
lines, the more you can take advantage of
elasticity or other mesos features)
see http://github.com/mesos/hadoop
Thank You!
mesos.apache.org
mesos.apache.org/blog
@ApacheMesos
② master failover
framework
after a new master is elected all
frameworks and slaves connect to
the new master
master
all tasks keep running across master
failover!
stateless master
to make master failover fast, we choose to
make the master stateless
state is stored in the leaves, at the frameworks
and the slaves
makes sense for frameworks that don’t want to
store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated
(need to checkpoint), frameworks need to save
master failover
to make master failover fast, we choose to
make the master stateless
state is stored in the leaves, at the frameworks
and the slaves
makes sense for frameworks that don’t want to
store state (i.e., can’t actually failover)
consequences: slaves are fairly complicated
(need to checkpoint), frameworks need to save
Apache Mesos is a
distributed system
for running and building
other distributed systems
origins
Berkeley research project including Benjamin
Hindman, Andy Konwinski, Matei Zaharia, Ali
Ghodsi, Anthony D. Joseph, Randy Katz, Scott
Shenker, Ion Stoica
mesos.apache.org/documentation
ecosystem
operators
framework
developers
mesos
developers
a tour of mesos from
different perspectives of
the ecosystem
the operator
the operator
People who run and manage
frameworks (Hadoop, Storm, MPI,
Spark, Memcache, etc)
Tools: virtual machines, Chef,
Puppet (emerging: PAAS, Docker)
“ops” at most companies (SREs at
Twitter)
the static partitioners
for the operator,
Mesos is a cluster manager
for the operator,
Mesos is a resource manager
for the operator,
Mesos is a resource negotiator
for the operator,
Mesos replaces static partitioning
of resources to frameworks with
dynamic resource allocation
for the operator,
Mesos is a distributed system
with a master/slave architecture
masters
slaves
frameworks/applications register
with the Mesos master in order to
run jobs/tasks
masters
slaves
frameworks can be required to
authenticate as a principal*
framework
SASL
CRAM-MD5 secret mechanism
(Kerberos in development)
SASL
masters
masters initialized with secrets
Mesos is highly-available
and fault-tolerant
the framework developer
the framework developer
…
Mesos uses Apache ZooKeeper
for coordination
Apache
ZooKeeper
masters
slaves
increase utilization with revocable
resources and preemption
framework1
framework2
framework3
reservations
framework1
framework2
framework3
masters
hostname:
4 CPUs
4 GB RAM
role: -
15%
24%
61%
optimistic vs pessimistic
what to say here …
authorization*
principals can be used for:
authorizing allocation roles
authorizing operating system users (for execution)
authorization
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
I’d love to answer some
questions with the help of my
data!
I think I’ll try Hadoop.
your datacenter
+ Hadoop
happy?
Not exactly …
… Hadoop is a big hammer,
but not everything is a nail!
I’ve got some iterative
algorithms, I want to try Spark!
datacenter management
datacenter management
datacenter management
static partitioning
static partitioning
static partitioning
considered harmful
static partitioning
considered harmful
(1) hard to share data
(2) hard to scale elastically (to exploit statistical
multiplexing)
(3) hard to fully utilize machines
(4) hard to deal with failures
static partitioning
considered harmful
(1) hard to share data
(2) hard to scale elastically (to exploit statistical
multiplexing)
(3) hard to fully utilize machines
(4) hard to deal with failures
Hadoop …
(map/reduce)
(distributed file system)
HDFS
HDFS
HDFS
Could we just give Spark it’s
own HDFS cluster too?
HDFS x 2
HDFS x 2
HDFS x 2
HDFS x 2
tee incoming data
(2 copies)
HDFS x 2
tee incoming data
(2 copies)
periodic copy/sync
That sounds annoying … let’s
not do that. Can we do any
better though?
HDFS
HDFS
HDFS
HDFS
static partitioning
considered harmful
(1) hard to share data
(2) hard to scale elastically (to exploit statistical
multiplexing)
(3) hard to fully utilize machines
(4) hard to deal with failures
During the day I’d rather give
more machines to Spark but at
night I’d rather give more
machines to Hadoop!
datacenter management
datacenter management
datacenter management
datacenter management
static partitioning
considered harmful
(1) hard to share data
(2) hard to scale elastically (to exploit statistical
multiplexing)
(3) hard to fully utilize machines
(4) hard to deal with failures
datacenter management
datacenter management
datacenter management
static partitioning
considered harmful
(1) hard to share data
(2) hard to scale elastically (to exploit statistical
multiplexing)
(3) hard to fully utilize machines
(4) hard to deal with failures
datacenter management
datacenter management
datacenter management
I don’t want to deal with this!
the datacenter …
rather than think about the datacenter like this …
… is a computer
think about it like this …
datacenter computer
applications
resources
filesystem
mesos
applications
kernel
resources
filesystem
mesos
applications
kernel
resources
filesystem
mesos
frameworks
kernel
resources
filesystem
Step 1: filesystem
Step 2: mesos
run a “master” (or multiple for high availability)
Step 2: mesos
run “slaves” on the rest of the machines
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
Step 3: frameworks
$tep 4: profit
$tep 4: profit
(statistical multiplexing)
$tep 4: profit
(statistical multiplexing)
$tep 4: profit
(statistical multiplexing)
$tep 4: profit
(statistical multiplexing)
$tep 4: profit
(statistical multiplexing)
$tep 4: profit
(statistical multiplexing)
reduces CapEx and OpEx!
$tep 4: profit
(statistical multiplexing)
reduces latency!
$tep 4: profit (utilize)
$tep 4: profit (utilize)
$tep 4: profit (utilize)
$tep 4: profit (utilize)
$tep 4: profit (utilize)
$tep 4: profit (utilize)
$tep 4: profit (failures)
$tep 4: profit (failures)
$tep 4: profit (failures)
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
mesos
frameworks
kernel
resources
filesystem
mesos
frameworks
kernel
resources
resource allocation
resource allocation
reservations
can reserve resources per slave to provide
guaranteed resources
requires human participation (ops) to determine
what roles should be reserved what resources
kind of like thread affinity, but across many
machines (and not just for CPUs)
resource allocation
resource allocation
resource allocation
(1) allocate reserved resources to
frameworks authorized for a
particular role
(2) allocate unused reserved resources
and unused unreserved resources
fairly amongst all frameworks
according to their weights
preemption
if a framework runs tasks outside of it’s
reservations they can be preempted (i.e., the task
killed and the resources revoked) for a framework
running a task within its reservation
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
mesos
frameworks
kernel
framework
≈
distributed system
framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
framework commonality
run processes/tasks simultaneously (distributed)
coordinate execution
handle process failures (fault-tolerant)
optimize performance (elastic)
frameworks
are
execution coordinators
frameworks
are
execution coordinators
frameworks
are
execution schedulers
end-to-end principle
“application-specific functions ought to
reside in the end hosts of a network
rather than intermediary nodes”
i.e., frameworks want to coordinate
their tasks execution and they should
be able to
framework anatomy
frameworks
framework anatomy
frameworks
scheduling API
scheduling
scheduling
i’d like to run some tasks!
scheduling
here are some resource offers!
resource offers
an offer represents the snapshot of available
resources on a particular machine that a
framework can use to run tasks
foo.bar.com:
4 CPUs
4 GB RAM
schedulers pick which resources to use to run
their tasks
“two-level scheduling”
mesos: controls resource allocations to
schedulers
schedulers: make decisions about what to run
given allocated resources
concurrency control
the same resources may be offered to different
frameworks
concurrency control
the same resources may be offered to different
frameworks
pessimistic
no overlapping offers
optimistic
all overlapping offers
tasks
the “threads” of the framework, a consumer of
resources (cpu, memory, etc)
either a concrete command line or an opaque
description (which requires an executor)
tasks
here are some resources!
tasks
launch these tasks!
tasks
tasks
status updates
status updates
status updates
task status update!
status updates
status updates
status updates
task status update!
more scheduling
more scheduling
i’d like to run some tasks!
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
high-availability
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)
high-availability (master)
task status update!
high-availability (master)
i’d like to run some tasks!
high-availability (master)
high-availability (framework)
high-availability (framework)
high-availability (framework)
high-availability (framework)
high-availability (slave)
high-availability (slave)
high-availability (slave)
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
resource isolation
leverage Linux control groups (cgroups)
CPU (upper and lower bounds)
memory
network I/O (traffic controller, in progress)
filesystem (lvm, in progress)
resource statistics
rarely does allocation == usage (humans are bad
at estimating the amount of resources they’re
using)
per task/executor statistics are collected (for all
fork/exec’ed processes too!)
can help with capacity planning
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
security
Twitter recently added SASL support, default
mechanism is CRAM-MD5, will support
Kerberos in the short term
agenda
motivation and overview
resource allocation
frameworks, schedulers, tasks, status updates
high-availability
resource isolation and statistics
security
case studies
framework commonality
run processes/tasks simultaneously (distributed)
handle process failures (fault-tolerant)
optimize performance (elastic)
framework commonality
as a “kernel”, mesos provides a lot of primitives
that make writing a new framework easier such
as launching tasks, doing failure detection, etc,
why re-implement them each time!?
case study: chronos
distributed cron with dependencies
developed at airbnb
~3k lines of Scala!
distributed, highly available, and fault tolerant
without any network programming!
http://github.com/airbnb/chronos
analytics
analytics + services
analytics + services
analytics + services
case study: aurora
“run 200 of these, somewhere, forever”
developed at Twitter
highly available (uses the mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see
Twitter commons)
http://github.com/twitter/aurora
frameworks
• Hadoop (github.com/mesos/hadoop)
• Spark (github.com/mesos/spark)
• DPark (github.com/douban/dpark)
• Storm (github.com/nathanmarz/storm)
• Chronos (github.com/airbnb/chronos)
• MPICH2 (in mesos git repository)
• Marathon (github.com/mesosphere/marathon)
• Aurora (github.com/twitter/aurora)
write your next
distributed system with
mesos!
port a framework to mesos
write a “wrapper” scheduler
~100 lines of code to write a wrapper (the more
lines, the more you can take advantage of
elasticity or other mesos features)
see http://github.com/mesos/hadoop
conclusions
datacenter management is a pain
conclusions
mesos makes running frameworks on your
datacenter easier as well as increasing utilization
and performance while reducing CapEx and
OpEx!
conclusions
rather than build your next distributed system
from scratch, consider using mesos
conclusions
you can share your datacenter between analytics
and online services!
Questions?
mesos.apache.org
@ApacheMesos
aurora
aurora
aurora
aurora
aurora
framework commonality
run processes simultaneously (distributed)
handle process failures (fault-tolerance)
optimize execution (elasticity, scheduling)
primitives
scheduler – distributed system “master” or
“coordinator”
(executor – lower-level control of task
execution, optional)
requests/offers – resource allocations
tasks – “threads” of the distributed system
…
scheduler
Apache
Hadoop
Chronos
scheduler
(1) brokers for resources
(2) launches tasks
(3) handles task termination
brokering for resources
(1) make resource requests
2 CPUs
1 GB RAM
slave *
(2) respond to resource offers
4 CPUs
4 GB RAM
slave foo.bar.com
offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately
offers: non-blocking resource allocation
exist to answer the question:
“what should mesos do if it can’t satisfy a request?”
(1) wait until it can
(2) offer the best allocation it can immediately
resource allocation
request
Apache
Hadoop
Chronos
resource allocation
request
Apache
Hadoop
Chronos
allocator
dominant resource fairness
resource reservations
resource allocation
request
Apache
Hadoop
Chronos
allocator
dominant resource fairness
resource reservations
pessimistic
optimistic
resource allocation
request
Apache
Hadoop
Chronos
allocator
dominant resource fairness
resource reservations
pessimistic
no overlapping offers
optimistic
all overlapping offers
resource allocation
offer
Apache
Hadoop
Chronos
allocator
dominant resource fairness
resource reservations
“two-level scheduling”
mesos: controls resource allocations to
framework schedulers
schedulers: make decisions about what to run
given allocated resources
end-to-end principle
“application-specific functions ought to
reside in the end hosts of a network
rather than intermediary nodes”
tasks
either a concrete command line or an opaque
description (which requires a framework
executor to execute)
a consumer of resources
task operations
launching/killing
health monitoring/reporting (failure detection)
resource usage monitoring (statistics)
resource isolation
cgroup per executor or task (if no executor)
resource controls adjusted dynamically as
tasks come and go!
case study: chronos
distributed cron with dependencies
built at airbnb by @flo
before chronos
before chronos
single point of failure (and AWS was unreliable)
resource starved (not scalable)
chronos requirements
fault tolerance
distributed (elastically take advantage of
resources)
retries (make sure a command eventually
finishes)
dependencies
chronos
leverages the primitives of mesos
~3k lines of scala
highly available (uses Mesos state)
distributed / elastic
no actual network programming!
after chronos
after chronos + hadoop
case study: aurora
“run 200 of these, somewhere, forever”
built at Twitter
before aurora
static partitioning of machines to services
hardware outages caused site outages
puppet + monit
ops couldn’t scale as fast as engineers
aurora
highly available (uses mesos replicated log)
uses a python DSL to describe services
leverages service discovery and proxying (see
Twitter commons)
after aurora
power loss to 19 racks, no lost services!
more than 400 engineers running services
largest cluster has >2500 machines
Mesos
Hadoop
Spark
MPI
Storm
Chronos
Mesos
Node Node Node Node Node Node Node Node Node Node Node
Mesos
Hadoop
Spark
MPI
…
Mesos
Node Node Node Node Node Node Node Node Node Node Node
Mesos
Hadoop
Spark
MPI
Storm
…
Mesos
Node Node Node Node Node Node Node Node Node Node Node
Mesos
Hadoop
Spark
MPI
Storm
Chronos
…
Mesos
Node Node Node Node Node Node Node Node Node Node Node
$tep 4: Profit
(statistical multiplexing)

similar documents