Berkley_Data_Analysis_Stack_(BDAS).

Report
Berkley Data Analysis Stack
(BDAS)
Mesos, Spark, Shark, Spark Streaming
Current Data Analysis Open Stack
Application
Data Processing
Storage
Infrastructure
Characteristics:
• Batch Processing on on-disk data..
• Not very efficient with “Interactive” and “Streaming” computations.
Goal
Berkeley Data Analytics Stack (BDAS)
New apps: AMP-Genomics, Carat, …
Application
Data Processing
• in-memory processing
• trade between time, quality, and cost
Data Management
Storage
Efficient data sharing across frameworks
Resource
Infrastructure
Management
Share infrastructure across frameworks
(multi-programming for datacenters)
BDAS Components
Mesos
• A platform for sharing commodity clusters between diverse
computing frameworks.
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Mesos
• “Resource Offers” to publish available resources
• Has to deal with “framework” specific constraints
(without knowing the specific constraints).
• e.g. data locality.
• Allows the framework scheduler to “reject
offer” if constraints are not met.
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Mesos
• Other Issues:
• Resource Allocation Strategies: Pluggable
• fair sharing plugin implemented
• Revocation
• Isolation:
• “Existing OS isolation techniques: Linux Containers”.
• Fault Tolerance:
• Master: stand by master nodes and zoo keeper
• Slaves: Reports task/slave failures to the framework, the latter handles.
• Framework scheduler failure: Replicate
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Spark
Current popular programming models for clusters transform data flowing from
stable storage to stable storage
• Spark: In-Memory Cluster Computing for Iterative and Interactive Applications.
– Assumption
• Same working data set is used across iterations or for a number of interactive queries.
• Commodity Cluster
• Local Data partitions fit in Memory
Some Slides taken from presentation:
Spark
• Acyclic data flows – powerful abstractions
– But not efficient for Iterative/interactive applications that
repeatedly use the same “working data set”.
Solution: augment data flow model with in-memory
“resilient distributed datasets” (RDDs)
RDDs
• An RDD is an immutable, partitioned, logical collection of
records
– Need not be materialized, but rather contains information to rebuild a
dataset from stable storage (lazy-loading and lineage)
– can be rebuilt if a partition is lost (Transform once read many)
• Partitioning can be based on a key in each record (using hash or
range partitioning)
• Created by transforming data in stable storage using data flow operators (map,
filter, group-by, …)
• Can be cached for future reuse
Generality of RDDs
• Claim: Spark’s combination of data flow with RDDs
unifies many proposed cluster programming models
– General data flow models: MapReduce, Dryad, SQL
– Specialized models for stateful apps: Pregel (BSP), HaLoop
(iterative MR), Continuous Bulk Processing
• Instead of specialized APIs for one type of app, give
user first-class control of distributed datasets
Programming Model
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations
(return a result to driver)
reduce
collect
count
save
lookupKey
…
Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
messages = errors.map(_.split(‘\t’)(2))
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
cachedMsgs = messages.cache()
Cache 1
BaseTransformed
RDD
RDD
tasks
Block 1
Driver
Cached RDD
Parallel operation
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worker
. . .
Cache 3
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Worker
Block 3
Block 2
RDD Fault Tolerance
• RDDs maintain lineage information that can be
used to reconstruct lost partitions
• Ex: cachedMsgs
= textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()
HdfsRDD
FilteredRDD
MappedRDD
path: hdfs://…
func: contains(...)
func: split(…)
CachedRDD
Benefits of RDD Model
• Consistency is easy due to immutability
• Inexpensive fault tolerance (log lineage rather than
replicating/checkpointing data)
• Locality-aware scheduling of tasks on partitions
• Despite being restricted, model seems applicable to a broad
variety of applications
Example: Logistic Regression
• Goal: find best line separating two sets of
points
random initial line
target
Logistic Regression Code
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Page Rank: Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
Spark Summary
• Fast, expressive cluster computing system compatible with
Apache Hadoop
– Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
• Improves efficiency through:
Up to 100× faster
– In-memory computing primitives
– General computation graphs
• Improves usability through:
– Rich APIs in Java, Scala, Python
– Interactive shell
Often 2-10× less code
Spark Streaming
• Framework for large scale stream processing
–
–
–
–
–
Scales to 100s of nodes
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing
Provides a simple batch-like API for implementing complex algorithm
Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Requirements
 Scalable to large clusters
 Second-scale latencies
 Simple programming model
 Integrated with batch & interactive processing

Stateful Stream Processing
 Traditional streaming systems have a eventdriven record-at-a-time processing model
- Each node has mutable state
- For each record, update state & send new
records
 State is lost if node dies!
 Making stateful stream processing be faulttolerant is challenging
mutable state
input
records
node 1
node 3
input
records
node 2
24
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream

Chop up the live stream into batches of X seconds

Spark treats each batch of data as RDDs and
processes them using RDD operations

Finally, the processed results of the RDD
operations are returned in batches
Spark
Streaming
batches of X seconds
processed
results
25
Spark
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream of data
Twitter Streaming API
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
new DStream
transformation: modify data in one Dstream to create another DStream
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
flatMap
flatMap
…
flatMap
new RDDs created for
every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
hashTags DStream
flatMa
p
flatMa
p
flatMa
p
save
save
save
every batch saved
to HDFS
Fault-tolerance
 RDDs remember the sequence of
operations that created it from the
original fault-tolerant input data
 Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault-tolerant
 Data lost due to worker failure, can be
recomputed from input data
tweets
RDD
input data
replicated
in memory
flatMap
hashTags
RDD
lost partitions
recomputed on
other workers
Key concepts
• DStream – sequence of RDDs representing a stream of data
– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
• Transformations – modify data from on DStream to another
– Standard RDD operations – map, countByValue, reduce, join, …
– Stateful operations – window, countByValueAndWindow, …
• Output Operations – send data to external entity
– saveAsHadoopFiles – saves to HDFS
– foreach – do anything with each batch of results
Higher throughput than Storm
 Spark Streaming: 670k records/second/node
Throughput per node
(MB/s)
Comparison with Storm and S4
 Storm: 115k records/second/node
120
Grep
Spark
80
40
Storm
0
100
1000
Record Size (bytes)
Throughput per node
(MB/s)
 Apache S4: 7.5k records/second/node
30
WordCount
Spark
20
10
Storm
0
100
1000
Record Size (bytes)
31

similar documents