The Datacenter Needs an Operating System

The Datacenter Needs an
Operating System
Matei Zaharia, Benjamin Hindman, Andy
Konwinski, Ali Ghodsi, Anthony Joseph,
Randy Katz, Scott Shenker, Ion Stoica
• Clusters of commodity servers have become a
major computing platform in industry and
• Driven by data volumes outpacing the
processing capabilities of single machines
• Democratized by cloud computing
• Some have declared that “the datacenter is
the new computer”
• Claim: this new computer increasingly needs
an operating system
• Not necessarily a new host OS, but a common
software layer that manages resources and
provides shared services for the whole
datacenter, like an OS does for one host
Why Datacenters Need an OS
• Growing number of applications
– Parallel processing systems: MapReduce, Dryad,
Pregel, Percolator, Dremel, MR Online
– Storage systems: GFS, BigTable, Dynamo, SCADS
– Web apps and supporting services
• Growing number of users
– 200+ for Facebook’s Hadoop data warehouse,
running near-interactive ad hoc queries
What Operating Systems Provide
• Resource sharing across applications & users
• Data sharing between programs
• Programming abstractions (e.g. threads, IPC)
• Debugging facilities (e.g. ptrace, gdb)
Result: OSes enable a highly interoperable
software ecosystem that we now take for granted
An Analogy
• Today, a scientist analyzing data on a single
machine can pipe it through a variety of tools,
write new tools that interface with these through
standard APIs, and trace across the stack
• In the future, the scientist should be able to fire
up a cloud on EC2 and do the same thing:
Intermix a variety of apps & programming models
Write new parallel programs that talk to these
Get a unified interface for managing the cluster
Debug and trace across all these components
Today’s Datacenter OS
• Hadoop MapReduce as common execution
and resource sharing platform
• Hadoop InputFormat API for data sharing
• Abstractions for productivity programmers,
but not for system builders
• Very challenging to debug across all the layers
Tomorrow’s Datacenter OS
• Resource sharing:
– Lower-level interfaces for fine-grained sharing
(Mesos is a first step in this direction)
– Optimization for a variety of metrics (e.g. energy)
– Integration with network scheduling mechanisms
(e.g. Seawall [NSDI ‘11], NOX, Orchestra)
Tomorrow’s Datacenter OS
• Data sharing:
– Standard interfaces for cluster file systems, keyvalue stores, etc
– In-memory data sharing (e.g. Spark, DFS cache),
and a unified system to manage this memory
– Streaming data abstractions (analogous to pipes)
– Lineage instead of replication for reliability (RDDs)
Tomorrow’s Datacenter OS
• Programming abstractions:
– Tools that can be used to build the next
MapReduce / BigTable in a week (e.g. BOOM)
– Efficient implementations of communication
primitives (e.g. shuffle, broadcast)
– New distributed programming models
Tomorrow’s Datacenter OS
• Debugging facilities:
– Tracing and debugging tools that work across the
cluster software stack (e.g. X-Trace, Dapper)
– Replay debugging that takes advantage of limited
languages / computational models
– Unified monitoring infrastructure and APIs
Putting it All Together
• A successful datacenter OS might let users:
– Build a Hadoop-like software stack in a week
using the OS’s abstractions, while gaining other
benefits (e.g. cross-stack replay debugging)
– Share data efficiently between independently
developed programming models and applications
– Understand cluster behavior without having to
log into individual nodes
– Dynamically share the cluster with other users
• Datacenters need an OS-like software stack
for the same reasons single computers did:
manageability, efficiency & programmability
• An OS is already emerging in an ad-hoc way
• Researchers can help by taking a long-term
approach towards these problems
How Researchers can Help
• Focus on paradigms, not performance
– Industry is tackling performance but lacks luxury
to take long-term view towards abstractions
• Explore clean-slate approaches
– Likelier to have impact here than in a “real” OS
because datacenter software changes quickly!
• Bring cluster computing to non-experts
– Much harder and more rewarding than big users

similar documents