u,v - University of Pennsylvania

Report
Hadoop MapReduce
and Iterative (Graph) Algorithms
NETS 212: Scalable & Cloud Computing
Fall 2014
Z. Ives
University of Pennsylvania
1
© 2013 A. Haeberlen, Z. Ives
Last Time
• One-pass algorithms in MapReduce
• Filtering (heavily dependent on the mapper)
• Aggregation (heavily dependent on the reducer)
•
Can also have a combiner that pre-aggregates data before it’s sent on the network
• Join (items of dissimilar types; make sure they have the same Reduce key)
• Sort (use shuffle)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
2
Join
(see Lin & Dyer Chapter 3 for more detail)
Two main ways (not counting loading into RAM on every mapper) to do this:
1. Reduce-side join: Roughly the same as intersection
•
•
send data from all source tables to the same reducer, by the “join key”
in the reducer, compare all pairs of items in the set
• if they are of dissimilar types and satisfy the predicate, emit them
2. Map-side join: Need a way of ensuring both sources are partitioned the same way
•
typically requires that we directly access files from within Map
• e.g., the outputs of prior Map / Reduce runs
•
as Map gets called with an argument, merge it with the contents of the (hopefully local) file
Which should be more efficient?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
3
Sorting
• Goal: Sort input
• Examples:
• Return all the domains covered by Google's index and the number of pages in each, ordered by the
number of pages
• The programming model does not support this per se, but the implementations do
• Let’s take a look at what happens in the Shuffle stage
4
© 2013 A. Haeberlen, Z. Ives
Let’s Roll up Our Sleeves
• We saw “abstract MapReduce” largely based on Google’s original design
• In reality we’ll be using Hadoop MapReduce, which has a few variations
• Goal #1: Be able to write and run simple MapReduce programs on a local Hadoop
•
•
Mappers, reducers, drivers
Running Hadoop locally
• Goal #2: Understand how a distributed Hadoop works internally
•
HDFS; internal dataflow; distributed operation
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
5
What is Hadoop?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
6
History of Hadoop –
2002-2004: Lucene and Nutch
• Early 2000s: Doug Cutting develops
two open-source search projects:
• Lucene: Search indexer
•
Used e.g., by Wikipedia
• Nutch: A spider/crawler
(with Mike Carafella, now a Prof . at UMich)
• Nutch
• Goal: Web-scale, crawler-based search
• Written by a few part-time developers
• Distributed, 'by necessity'
• Demonstrated 100M web pages on 4 nodes, but true
'web scale' still very distant
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
7
2004-2006: GFS and MapReduce
• 2003/04: GFS, MapReduce papers published
• Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: "The Google File
System", SOSP 2003
• Jeffrey Dean and Sanjay Ghemawat: "MapReduce: Simplified Data
Processing on Large Clusters", OSDI 2004
• Directly addressed Nutch's scaling issues
• GFS & MapReduce added to Nutch
• Two part-time developers over two years (2004-2006)
• Crawler & indexer ported in two weeks
• Ran on 20 nodes at IA and UW
• Much easier to program and run, scales to several 100M web pages, but
still far from web scale
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
8
2006-2008: Yahoo
• 2006: Yahoo hires Cutting
• Provides engineers, clusters, users, ...
• Big boost for the project; Yahoo spends tens of M$
• Not without a price: Yahoo has a slightly different focus (e.g., security)
than the rest of the project; delays result
• Hadoop project split out of Nutch
• Finally hit web scale in early 2008
• Cutting is now at Cloudera
• Startup; started by three top engineers from Google, Facebook, Yahoo,
and a former executive from Oracle
• Has its own version of Hadoop; software remains free, but company sells
support and consulting services
• Was elected chairman of Apache Software Foundation
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
9
Who uses Hadoop?
• Hadoop is running search on some of the Internet's largest sites:
Chapter 16
of your
textbook
• Amazon Web Services: Elastic MapReduce
• AOL: Variety of uses, e.g., behavioral analysis & targeting
• EBay: Search optimization (532-node cluster)
• Facebook: Reporting/analytics, machine learning (1100 m.)
• Fox Interactive Media: MySpace, Photobucket, Rotten T.
• Last.fm: Track statistics and charts
• IBM: Blue Cloud Computing Clusters
• LinkedIn: People You May Know (2x50 machines)
• Rackspace: Log processing
• Twitter: Store + process tweets, log files, other data
• Yahoo: >36,000 nodes; biggest cluster is 4,000 nodes
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
10
Writing Hadoop Programs (Jobs)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
11
Simplified scenario
• Let’s start with Hadoop in standalone mode
• Useful for development and debugging (NOT for production)
• Single node (e.g., your laptop computer)
• No jobtrackers or tasktrackers
• Data in local file system, not in HDFS
• This is how the Hadoop installation in your virtual machine works by
default
• Later: Fully-distributed mode
• Used when running Hadoop on actual clusters
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
12
Recall the Basic Dataflow
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
Output data
Input data
Intermediate
(key,value) pairs
"The Shuffle"
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
13
What do we need to write?
• A mapper
• Accepts (key,value) pairs from the input
• Produces intermediate (key,value) pairs, which are then shuffled
• A reducer
• Accepts intermediate (key,value) pairs
• Produces final (key,value) pairs for the output
• A driver
• Specifies which inputs to use, where to put the outputs
• Chooses the mapper and the reducer to use
• Hadoop takes care of the rest!!
• Default behaviors can be customized by the driver
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
14
Hadoop I/O
• Your mapper and reducer need to read and write data
• They do this through an object called the Context
• Instead of simply using ints/Integers, Strings, etc. we need to do something else…
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
15
Hadoop has its own data types
Name
Description
JDK equivalent
IntWritable
32-bit integers
Integer
LongWritable
64-bit integers
Long
DoubleWritable
Floating-point numbers
Double
Text
Strings
String
• Hadoop uses its own serialization
• Java serialization is known to be very inefficient
• Result: A set of special data types
• All implement the 'Writable' interface
• Most common types shown above; also has some more specialized types
(SortedMapWritable, ObjectWritable, ...)
• Caution: Behavior somewhat unusual
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
16
Input format
The Mapper(file
offset, line)
Intermediate format
can be freely chosen
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
Write
the key/value
to the context
public class FooMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) {
context.write(new Text(“key"), value);
}
}
• Extends abstract 'Mapper' class
• Input/output types are specified as type parameters
• Implements a 'map' function
• Accepts (key,value) pair of the specified type
• Writes output pairs by calling 'write' method on context
• Mixing up the types will cause problems at runtime (!)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
17
Intermediate format
(same as mapper output)
The Reducer
Output format
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class MyReducer extends Reducer<Text, Text, IntWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws java.io.IOException, InterruptedException
{
for (Text value: values)
Note: We may get
context.write(new IntWritable(4711), value);
multiple values for
}
the same key!
}
• Extends abstract 'Reducer' class
• Must specify types again (must be compatible with mapper!)
• Implements a 'reduce' function
• Values are passed in as an ‘Iterable’ of the appropriate type
• Caution: These are NOT normal Java classes. Do not store them in
collections - content can change between iterations!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
18
The Driver
import
import
import
import
import
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.io.*;
org.apache.hadoop.fs.Path;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(MyDriver.class);
Mapper&Reducer are
in the same Jar as
MyDriver
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
Input and Output
paths
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Format of the (key,value)
pairs output by the
reducer
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
• Specifies how the job is to be executed
• Input and output directories; mapper & reducer classes
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
19
Ways of Running Your Hadoop Task
• Compile / build / run the Driver via ToolRunner from Eclipse (this is done in your HW2)
• Your driver implements Tool, main() calls it via the ToolRunner
• (see HW2 GeocodeDriver)
• More advanced modes require you to create a JAR file:
• Compile / build via javac
• Compile / build via ant
• Ultimately the JAR file is what gets shared with multiple machines in a cluster!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
20
Manual compilation
• Goal: Produce a JAR file that contains the classes for
mapper, reducer, and driver
• This can be submitted to the Job Tracker, or run directly through Hadoop
• Step #1: Put hadoop-core-1.0.3.jar into classpath:
export CLASSPATH=$CLASSPATH:/path/to/hadoop/hadoop-core-1.0.3.jar
• Step #2: Compile mapper, reducer, driver:
javac MyMapper.java MyReducer.java MyDriver.java
• Step #3: Package into a JAR file:
jar cvf My.jar *.class
• Alternative: "Export..."/"Java JAR file" in Eclipse
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
21
Optional: Compilation with Ant
(Needs build.xml file)
<project name=“my" default="jar" basedir="./">
<target name="init">
<mkdir dir="classes"/>
</target>
<target name="compile" depends="init">
<javac srcdir="src" destdir="classes" includes="*.java" debug="true"/>
</target>
<target name="jar" depends="compile">
<jar destfile=“my.jar">
<fileset dir="classes" includes="**/*.class"/>
</jar>
</target>
Directory where
source files
are kept
<target name="clean">
<delete dir="classes"/>
<delete file=“my.jar"/>
</target>
</project>
Makes the JAR
file
Clean up any
derived files
• Apache Ant: A build tool for Java (~"make")
• Run "ant jar" to build the JAR automatically
• Run "ant clean" to clean up derived files (like make clean)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
22
Standalone mode installation
• What is standalone mode?
• Installation on a single node
• Hadoop runs as an 'ordinary' Java program
• Used for debugging
• How to install Hadoop in standalone mode?
• See Textbook Appendix A
• Already done in your VM image
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
23
Running a job in standalone mode
• Step #1: Create & populate input directory
• Configured in the Driver via addInputPath()
• Put input file(s) into this directory (ok to have more than 1)
• Output directory must not exist yet
• Step #2: Run Hadoop
• As simple as this: hadoop jar <jarName> <driverClassName>
• Example: hadoop jar foo.jar upenn.nets212.MyDriver
• In verbose mode, Hadoop will print statistics while running
• Step #3: Collect output files
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
24
Recap: Writing simple jobs for Hadoop
• Write a mapper, reducer, driver
• Custom serialization  Must use special data types (Writable)
• Explicitly declare all three (key,value) types
• Package into a JAR file
• Must contain class files for mapper, reducer, driver
• Create manually (javac/jar) or automatically (ant)
• Running in standalone mode
• hadoop jar foo.jar FooDriver
• Input and output directories in local file system
• More details: Chapters 2,4,5 of your textbook
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
25
Wait a second...
• Wasn't Hadoop supposed to be very scalable?
• Work on Petabytes of data, run on thousands of machines
• Some more puzzle pieces are needed
• Special file system that can a) hold huge amounts of data, and b) feed them into MapReduce
efficiently
 Hadoop Distributed File System (HDFS)
• Framework for distributing map and reduce tasks across many nodes, coordination, fault
tolerance...
 Fully distributed mode
• Mechanism for customizing dataflow for particular applications (e.g., non-textual input format,
special sort...)
 Hadoop data flow
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
26
Hadoop over Distributed Files
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
27
What is HDFS?
• HDFS is a distributed file system
• Makes some unique tradeoffs that are good for MapReduce
• What HDFS does well:
• Very large read-only or append-only files (individual files may contain
Gigabytes/Terabytes of data)
• Sequential access patterns
• What HDFS does not do well:
• Storing lots of small files
• Low-latency access
• Multiple writers
• Writing to arbitrary offsets in the file
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
28
HDFS versus NFS (and SMB)
Network File System (NFS)
• Single machine makes part of its
•
•
•
file system available to other
machines
Sequential or random access
PRO: Simplicity, generality,
transparency
CON: Storage capacity and
throughput limited by single
server
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
Hadoop Distributed File System (HDFS)




Single virtual file system spread
over many machines
Optimized for sequential read
and local accesses
PRO: High throughput, high
capacity
"CON": Specialized for particular
types of applications
29
How data is stored in HDFS
foo.txt: 3,9,6
bar.data: 2,4
block #2 of
foo.txt?
Name node
9
Read block 9
9
Client
3
4 2
9
9
9
2
3
4
3
6
6
4
2
Data nodes
• Files are stored as sets of (large) blocks
•
•
•
Default block size: 64 MB (ext4 default is 4kB!)
Blocks are replicated for durability and availability
What are the advantages of this design?
• Namespace is managed by a single name node
•
•
Actual data transfer is directly between client & data node
Pros and cons of this decision?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
30
The Namenode
foo.txt: 3,9,6
bar.data: 2,4
blah.txt: 17,18,19,20
xyz.img: 8,5,1,11
Name node
Created abc.txt
Appended block 21 to blah.txt
Deleted foo.txt
Appended block 22 to blah.txt
Appended block 23 to xyz.img
...
fsimage
edits
• State stored in two files: fsimage and edits
• fsimage: Snapshot of file system metadata
• edits: Changes since last snapshot
• Normal operation:
• When namenode starts, it reads fsimage and then applies all the changes
from edits sequentially
• Pros and cons of this design?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
31
The Secondary Namenode
• What if the state of the namenode is lost?
• Data in the file system can no longer be read!
• Solution #1: Metadata backups
• Namenode can write its metadata to a local disk, and/or to a remote
NFS mount
• Solution #2: Secondary Namenode
• Purpose: Periodically merge the edit log with the fsimage to prevent the
log from growing too large
• Has a copy of the metadata, which can be used to reconstruct the state
of the namenode
• But: State lags behind somewhat, so data loss is likely if the namenode
fails
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
32
Your Files in HDFS
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
33
Accessing data in HDFS
[[email protected]
total 209588
drwxrwxr-x 2
drwxrwxr-x 5
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
[[email protected]
~]$ ls -la /tmp/hadoop-ahae/dfs/data/current/
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
~]$
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
4096
4096
11568995
90391
4
11
67108864
524295
67108864
524295
67108864
524295
158
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
15:46
15:39
15:44
15:44
15:40
15:40
15:44
15:44
15:44
15:44
15:44
15:44
15:40
.
..
blk_-3562426239750716067
blk_-3562426239750716067_1020.meta
blk_5467088600876920840
blk_5467088600876920840_1019.meta
blk_7080460240917416109
blk_7080460240917416109_1020.meta
blk_-8388309644856805769
blk_-8388309644856805769_1020.meta
blk_-9220415087134372383
blk_-9220415087134372383_1020.meta
VERSION
• HDFS implements a separate file namespace
•
•
•
Files in HDFS are not visible in the normal file system
Only the blocks and the block metadata are visible
HDFS cannot be (easily) mounted
•
Some FUSE drivers have been implemented for it
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
34
Accessing data in HDFS
[[email protected] ~]$ /usr/local/hadoop/bin/hadoop fs -ls /user/ahae
Found 4 items
-rw-r--r-1 ahae supergroup
1366 2013-10-08 15:46 /user/ahae/README.txt
-rw-r--r-1 ahae supergroup
0 2013-10-083 15:35 /user/ahae/input
-rw-r--r-1 ahae supergroup
0 2013-10-08 15:39 /user/ahae/input2
-rw-r--r-1 ahae supergroup 212895587 2013-10-08 15:44 /user/ahae/input3
[[email protected] ~]$
• File access is through the hdfs dfs command, UNIX commands have a
dash in front
• Examples:
•
•
•
•
•
hdfs dfs -put [file] [hdfsPath]
Stores a file in HDFS
hdfs dfs -ls [hdfsPath]
List a directory
hdfs dfs -get [hdfsPath] [file]
Retrieves a file from HDFS
hdfs dfs -rm [hdfsPath]
Deletes a file in HDFS
hdfs dfs -mkdir [hdfsPath]
Makes a directory in HDFS
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
35
More Details
• Your home directory in HDFS is /user/username instead of /home/username
• Two important commands:
• hdfs dfs -copyFromLocal [path] [hdfsPath]
• hdfs dfs -copyToLocal [hdfsPath] [path]
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
Copies data to HDFS from local filesystem
Does the opposite
36
Alternatives to the command line
• Getting data in and out of HDFS through the command-line
interface is a bit cumbersome
• Alternatives have been developed:
• FUSE file system: Allows HDFS to be mounted under Unix
• WebDAV share: Can be mounted as filesystem on many OSes
• HTTP: Read access through namenode's embedded web svr
• FTP: Standard FTP interface
• ...
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
37
Accessing HDFS directly from Java
• Programs can read/write HDFS files directly
• Not needed in MapReduce; I/O is handled by the framework
• Files are represented as URIs
• Example: hdfs://localhost/user/nets212/example.txt
• Access is via the FileSystem API
• To get access to the file: FileSystem.get()
• For reading, call open() -- returns InputStream
• For writing, call create() -- returns OutputStream
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
38
What about permissions?
• Since 0.16.1, Hadoop has rudimentary support for POSIX-style
permissions
• rwx for users, groups, 'other' -- just like in Unix
• ‘hdfs dfs' has support for chmod, chgrp, chown
• But: POSIX model is not a very good fit
• Many combinations are meaningless: Files cannot be executed, and
existing files cannot really be written to
• Permissions were not really enforced
• Hadoop does not verify whether user's identity is genuine
• Useful more to prevent accidental data corruption or casual misuse of
information
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
39
Where are things today?
• Since v.20.20x, Hadoop has some security
• Kerberos RPC (SASL/GSSAPI)
• HTTP SPNEGO authentication for web consoles
• HDFS file permissions actually enforced
• Various kinds of delegation tokens
• Network encryption
• For more details, see:
https://issues.apache.org/jira/secure/attachment/12428537/securitydesign.pdf
• Big changes are coming
• Project Rhino (e.g., encrypted data at rest)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
40
Recap: HDFS
• HDFS: A specialized distributed file system
• Good for large amounts of data, sequential reads
• Bad for lots of small files, random access, non-append writes
• Architecture: Blocks, namenode, datanodes
• File data is broken into large blocks (64MB default)
• Blocks are stored & replicated by datanodes
• Single namenode manages all the metadata
• Secondary namenode: Housekeeping & (some) redundancy
• Usage: Special command-line interface
• Example: hadoop fs -ls /path/in/hdfs
• More details: Chapter 3 of your textbook
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
41
How Hadoop Works
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
42
A Familiar Diagram: High-level dataflow
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
Output data
Input data
Intermediate
(key,value) pairs
"The Shuffle"
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
43
Detailed dataflow in Hadoop
Node 2
Node 1
File
File
Local HDFS
store
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
InputFormat
InputFormat
Split
Split
Split
Split
Split
Split
RR
RR
RR
RR
RR
RR
map
map
map
map
map
map
Combine
Combine
Partition
Partition
Sort
Sort
Reduce
Reduce
OutputFormat
OutputFormat
File
File
Local HDFS
store
44
Input Format
• Defines which input files
should be read, and how
• Defaults provided, e.g., TextInputFormat,
DBInputFormat, KeyValueTextInputFormat...
• Defines InputSplits
• InputSplits break file into separate tasks
• Example: one task for each 64MB block (why?)
• Provides a factory for RecordReaders
• RecordReaders actually read the file into (key,value) pairs
• Default format, TextInputFormat, uses byte offset in file as the key, and
line as the value
• KeyValueInputFormat reads (key,value) pairs from the file directly; key is
everything up to the first tab character
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
45
Combiners
• Optional component that can
be inserted after the mappers
• Input: All data emitted by the mappers
on a given node
• Output passed to the partitioner
• Why is this useful?
• Suppose your mapper counts words by emitting (xyz, 1) pairs for each word xyz it finds
• If a word occurs many times, it is much more efficient to pass (xyz, k) to the reducer, than passing k
copies of (xyz,1)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
46
Partitioner
• Controls which intermediate
key-value pairs should go
to which reducer
• Defines a partition on the set of KV pairs
• Number of partitions is the same as the number of reducers
• Default partitioner (HashPartitioner) assigns partition based on a hash of the key
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
47
Output Format
• Counterpart to InputFormat
• Controls where output is
stored, and how
• Provides a factory for RecordWriter
• Several implementations provided
• TextOutputFormat (default)
• DBOutputFormat
• MultipleTextOutputFormat
• ...
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
48
Recap: Dataflow in Hadoop
• Hadoop has many components that are usually hidden from the developer
• Many of these can be customized:
• InputFormat: Defines how input files are read
• InputSplit: Defines how data portions are assigned to tasks
• RecordReader: Reads actual KV pairs from input files
• Combiner: Mini-reduce step on each node, for efficiency
• Partitioner: Assigns intermediate KV pairs to reducers
• Comparator: Controls how KV pairs are sorted after shuffle
• More details: Chapter 7 of your textbook
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
49
Hadoop in Distributed Mode
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
50
Hadoop daemons
• TaskTracker
• Runs maps and reduces. One per node.
• JobTracker
• Accepts jobs; assigns tasks to TaskTrackers
• DataNode
• Stores HDFS blocks
• NameNode
A single node can run
more than one of these!
• Stores HDFS metadata
• SecondaryNameNode
• Merges edits file with snapshot; "backup" for NameNode
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
51
An example configuration
JobTracker
Small cluster
NameNode
Secondary
NameNode
Medium cluster
JobTracker
NameNode
Secondary NameNode
TaskTracker
DataNode
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
52
Fault tolerance
• What if a node fails during a job?
• JobTracker notices that the node's TaskTracker no longer responds; re-executes the failed node's
tasks
• What specifically should be re-executed?
• Depends on the phase the job was in
• Mapping phase: Re-execute all maps assigned to failed node
• Reduce phase: Re-execute all reduces assigned to the node
•
•
•
Is this sufficient?
No! Failed node may also have completed map tasks, and other nodes may not have finished copying out the
results
Need to re-execute map tasks on the failed node as well!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
53
Speculative execution
• What if some tasks are much harder, or some nodes much slower, than the others?
• Entire job is delayed!
• Solution: Speculative execution
• If task is almost complete, schedule a few redundant rasks on nodes that have nothing else to do
• Whichever one finishes first becomes the definitive copy; the others' results are discarded to
prevent duplicates
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
54
Placement and locality
Block
replicas
Task
Rack 1
Rack 2
Datacenter A
Rack 1
Rack 2
Datacenter B
• Which of the replicated blocks should be read?
• If possible, pick the closest one (reduces network load)
• Distance metric takes into account: Nodes, racks, datacenters
• Where should the replicas be put?
• Tradeoff between fault tolerance and locality/performance
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
55
Recap: Distributed mode
• Five important daemons:
• MapReduce daemons: JobTracker, TaskTracker
• HDFS daemons: DataNode, NameNode, Secondary NameN.
• Workers run TaskTracker+DataNode
• Special features:
• Transparently re-executes jobs if nodes fail
• Speculatively executes jobs to limit impact of stragglers
• Rack-aware placement to keep traffic local
• More details: Chapter 9 of your textbook
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
56
Beyond Single Hadoop/MapReduce Jobs:
Processing Graphs
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
57
Beyond average/sum/count
• Much of the world is a network of relationships and shared features
• Members of a social network can be friends, and may have shared interests / memberships / etc.
• Customers might view similar movies, and might even be clustered by interest groups
• The Web consists of documents with links
• Documents are also related by topics, words, authors, etc.
58
© 2013 A. Haeberlen, Z. Ives
Goal: Develop a toolbox
• We need a toolbox of algorithms useful for analyzing data that has both relationships
and properties
• For the next ~2 lectures we’ll start to build this toolbox
• Some of the problems are studied in courses you may not have taken yet:
•
CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems)
• So we’ll see both the traditional solution and the
MapReduce one
59
© 2013 A. Haeberlen, Z. Ives
Encoding Data as Graphs
(Should Be Familiar after Seeing Dbpedia)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
60
Images by Jojo Mendoza, Creative Commons licensed
Thinking about related objects
Facebook
fan-of
fan-of
friend-of
Alice
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
• We can represent related objects as a labeled, directed graph
• Entities are typically represented as nodes; relationships are
typically edges
• Nodes all have IDs, and possibly other properties
• Edges typically have values, possibly IDs and other properties
61
© 2013 A. Haeberlen, Z. Ives
Encoding the data in a graph
Facebook
Mikhail
Magna Carta
Alice
Sunita
Jose
• Recall basic definition of a graph:
•
G = (V, E) where V is vertices, E is edges of the form (v1,v2) where v1,v2  V
• Assume we only care about connected vertices
•
•
Then we can capture a graph simply as the edges
... or as an adjacency list: vi goes to [vj, vj+1, … ]
62
© 2013 A. Haeberlen, Z. Ives
Graph encodings: Set of edges
Facebook
Mikhail
Magna Carta
Alice
Sunita
Jose
(Alice, Facebook)
(Alice, Sunita)
(Jose, Magna Carta)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Magna Carta)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
63
© 2013 A. Haeberlen, Z. Ives
Graph encodings: Adding edge types
Facebook
fan-of
fan-of
friend-of
Alice
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
(Alice, fan-of, Facebook)
(Alice, friend-of, Sunita)
(Jose, fan-of, Magna Carta)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Magna Carta)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)
64
© 2013 A. Haeberlen, Z. Ives
Graph encodings: Adding weights
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
(Alice, fan-of, 0.5, Facebook)
(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Magna Carta)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8, Facebook)
(Mikhail, fan-of, 0.7, Magna Carta)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
(Sunita, friend-of, 0.3, Jose)
65
© 2013 A. Haeberlen, Z. Ives
Recap: Related objects
• We can represent the relationships between related objects as a
directed, labeled graph
• Vertices represent the objects
• Edges represent relationships
• We can annotate this graph in various ways
• Add labels to edges to distinguish different types
• Add weights to edges
• ...
• We can encode the graph in various ways
• Examples: Edge set, adjacency list
66
© 2013 A. Haeberlen, Z. Ives
Graph Algorithms in MapReduce
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
67
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
• Once the data is encoded in this way, we can perform various
computations on it
• Simple example: Which users are their friends' best friend?
• More complicated examples (later): Page rank, adsorption, ...
• This is often done by
• annotating the vertices with additional information, and
• propagating the information along the edges
• "Think like a vertex"!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
68
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
Slightly more technical:
How many of my friends
have me as their
best friend?
• Example: Am I my friends' best friend?
• Step #1: Discard irrelevant vertices and edges
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
69
A computation model for graphs
Mikhail
friend-of
Alice
alicesunita: 0.9
0.9
friend-of
Sunita
0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
josesunita: 0.3
• Example: Am I my friends' best friend?
• Step #1: Discard irrelevant vertices and edges
• Step #2: Annotate each vertex with list of friends
• Step #3: Push annotations along each edge
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
70
A computation model for graphs
Mikhail
friend-of
Alice
sunitaalice: 0.9
sunitajose: 0.3
alicesunita: 0.9
0.9
friend-of
Sunita
0.3
alicesunita: 0.9
josesunita: 0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
sunitaalice: 0.9
sunitajose: 0.3
josesunita: 0.3
• Example: Am I my friends' best friend?
• Step #1: Discard irrelevant vertices and edges
• Step #2: Annotate each vertex with list of friends
• Step #3: Push annotations along each edge
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
71
A computation model for graphs
Mikhail
friend-of
Alice
sunitaalice: 0.9
sunitajose: 0.3
alicesunita: 0.9
0.9
friend-of
Sunita
0.3
alicesunita: 0.9
josesunita: 0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
sunitaalice: 0.9
sunitajose: 0.3
josesunita: 0.3
• Example: Am I my friends' best friend?
• Step #1: Discard irrelevant vertices and edges
• Step #2: Annotate each vertex with list of friends
• Step #3: Push annotations along each edge
• Step #4: Determine result at each vertex
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
72
Can we do this in MapReduce?
map(key: node, value: [<otherNode, relType, strength>])
{
}
reduce(key: ________, values: list of _________)
{
}
• Using adjacency list representation?
73
© 2013 A. Haeberlen, Z. Ives
Can we do this in MapReduce?
map(key: node, value: <otherNode, relType, strength>)
{
}
reduce(key: ________, values: list of _________)
{
}
• Using single-edge data representation?
74
© 2013 A. Haeberlen, Z. Ives
A real-world use case
• A variant that is actually used in social networks today: "Who are the friends of multiple
of my friends?"
• Where have you seen this before?
• Friend recommendation!
• Maybe these people should be my friends too!
75
© 2013 A. Haeberlen, Z. Ives
Generalizing…
• Now suppose we want to go beyond direct friend relationships
• Example: How many of my friends' friends (distance-2 neighbors) have me as their best friend's
best friend?
• What do we need to do?
• How about distance k>2?
• To compute the answer, we need to run multiple iterations of MapReduce!
76
© 2013 A. Haeberlen, Z. Ives
Iterative MapReduce
• The basic model:
copy files from input dir  staging dir 1
(optional: do some preprocessing)
while (!terminating condition) {
map from staging dir 1
reduce into staging dir 2
move files from staging dir 2  staging dir1
}
(optional: postprocessing)
move files from staging dir 2  output dir
• Note that reduce output must be compatible with the map
input!
• What can happen if we filter out some information in the mapper or in the reducer?
77
© 2013 A. Haeberlen, Z. Ives
Graph algorithms and MapReduce
• A centralized algorithm typically traverses a tree or a graph one
item at a time (there’s only one “cursor”)
• You’ve learned breadth-first and depth-first traversals
• Most algorithms that are based on graphs make use of multiple
map/reduce stages processing one “wave” at a time
• Sometimes iterative MapReduce, other times chains of map/reduce
78
© 2013 A. Haeberlen, Z. Ives
Recap: MapReduce on graphs
• Suppose we want to:
• compute a function for each vertex in a graph...
• ... using data from vertices at most k hops away
• We can do this as follows:
• "Push" information along the edges
•
"Think like a vertex"
• Finally, perform the computation at each vertex
• May need more than one MapReduce phase
• Iterative MapReduce: Outputs of stage i  inputs of stage i+1
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
79
Basic Graph Algorithms:
Single-Source Shortest Path
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
80
Path-based algorithms
• Sometimes our goal is to compute information about the paths (sets
of paths) between nodes
• Edges may be annotated with cost, distance, or similarity
• Examples of such problems (see CIS 121+320):
• Shortest path from one node to another
• Minimum spanning tree (minimal-cost tree connecting all vertices in a graph)
• Steiner tree (minimal-cost tree connecting certain nodes)
• Topological sort (node in a DAG comes before all nodes it points to)
81
© 2013 A. Haeberlen, Z. Ives
Single-Source Shortest Path (SSSP)
Given a directed graph G = (V, E) in which each edge e has a cost c(e):
 Compute the cost of reaching each node from the source node s in the
most efficient way (potentially after multiple 'hops')
a
b
1
?
?
10
2
3
s 0
9
5
7
?
c
© 2013 A. Haeberlen, Z. Ives
6
4
2
?
d
82
SSSP: Intuition
• We can formulate the problem using induction
• The shortest path follows the principle of optimality: the last step (u,v)
makes use of the shortest path to u
bestDistanceAndPath(v) {
We
express
this then
as follows:
if can
(v ==
source)
{
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
•
83
© 2013 A. Haeberlen, Z. Ives
SSSP: CIS 320-style solution
• Traditional approach: Dijkstra's algorithm
V: vertices, E: edges, S: start node
foreach v in V
dist_S_to[v] := infinity
Initialize length and
predecessor[v] = nil
last step of path
to default values
spSet = {}
Q := V
Update length and
while (Q not empty) do
path based on edges
u := Q.removeNodeClosestTo(S)
radiating from u
spSet := spSet + {u}
foreach v in V where (u,v) in E
if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then
dist_S_To[v] = dist_S_To[u] + cost(u,v)
predecessor[v] = u
84
© 2013 A. Haeberlen, Z. Ives
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
∞
∞
b
10
2
3
s 0
9
5
7
c
∞
2
Q = {s,a,b,c,d}
spSet = {}
dist_S_To: {(a,∞), (b,∞), (c,∞), (d,∞)}
predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}
© 2013 A. Haeberlen, Z. Ives
6
4
∞
d
85
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
10
∞
b
10
2
3
s 0
9
5
7
c
5
2
Q = {a,b,c,d}
spSet = {s}
dist_S_To: {(a,10), (b,∞), (c,5), (d,∞)}
predecessor: {(a,s), (b,nil), (c,s), (d,nil)}
© 2013 A. Haeberlen, Z. Ives
6
4
∞
d
86
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
14
b
10
2
3
s 0
9
5
7
c
5
2
Q = {a,b,d}
spSet = {c,s}
dist_S_To: {(a,8), (b,14), (c,5), (d,7)}
predecessor: {(a,c), (b,c), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
87
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
13
b
10
2
3
s 0
9
5
7
c
5
2
Q = {a,b}
spSet = {c,d,s}
dist_S_To: {(a,8), (b,13), (c,5), (d,7)}
predecessor: {(a,c), (b,d), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
88
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
9
b
10
2
3
s 0
9
5
7
c
5
2
Q = {b}
spSet = {a,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
89
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
9
b
10
2
3
s 0
9
5
7
c
5
2
Q = {}
spSet = {a,b,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
90
SSSP: How to parallelize?
• Dijkstra traverses the graph along a single route at a time,
prioritizing its traversal to the next step based on total path length
(and avoiding cycles)
• No real parallelism to be had here!
• Intuitively, we want something
that “radiates” from the origin,
one “edge hop distance” at a time
?
?
?
?
s 0
• Each step outwards can be done in parallel, before another iteration occurs
- or we are done
• Recall our earlier discussion: Scalability depends on the algorithm, not
(just) on the problem!
91
© 2013 A. Haeberlen, Z. Ives
SSSP: Revisiting the inductive definition
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
• Dijkstra’s algorithm carefully considered each u in a way that allowed us
to prune certain points
• Instead we can look at all potential u’s for each v
• Compute iteratively, by keeping a “frontier set” of u nodes i edge-hops from the
source
92
© 2013 A. Haeberlen, Z. Ives
SSSP: MapReduce formulation
The shortest path we have found so far
from the source to nodeID has length ...
• init:
... this is the next ... and here is the adjacency
list for nodeID
hop on that path...
• For each node, node ID  <, -, {<succ-node-ID,edge-cost>}>
• map:
• take node ID  <dist, next, {<succ-node-ID,edge-cost>}>
• For each succ-node-ID:
•
emit succ-node ID  {<node ID, distance+edge-cost>}
• emit node ID  distance,{<succ-node-ID,edge-cost>}
• reduce:
This is a new path from
the source to succ-node-ID
that we just discovered
(not necessarily shortest)
Why is this necessary?
• distance := min cost from a predecessor; next := that predec.
• emit node ID  <distance, next, {<succ-node-ID,edge-cost>}>
• Repeat until no changes
• Postprocessing: Remove adjacency lists
93
© 2013 A. Haeberlen, Z. Ives
Iteration 0: Base case
mapper:
(a,<s,10>) (c,<s,5>) edges
reducer:
(a,<10, ...>) (c,<5, ...>)
a
"Wave"
1
∞
∞
b
10
2
3
s 0
9
5
6
4
7
c
∞
2
∞
d
94
© 2013 A. Haeberlen, Z. Ives
Iteration 1
mapper:
reducer:
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) edges
(a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)
a
"Wave"
10
1
∞
b
10
2
3
s 0
9
5
6
4
7
c
5
2
∞
d
95
© 2013 A. Haeberlen, Z. Ives
Iteration 2
mapper:
reducer:
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges
(a,<8>) (c,<5>) (b,<11>) (d,<7>)
a
1
8
11
b
"Wave"
10
2
3
s 0
9
5
6
4
7
c
5
2
7
d
96
© 2013 A. Haeberlen, Z. Ives
Iteration 3
mapper:
reducer:
No change!
Convergence!
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges
(a,<8>) (c,<5>) (b,<11>) (d,<7>)
a
1
8
11
b
10
2
3
s 0
9
5
Question: If a vertex's path cost
is the same in two consecutive
rounds, can we be sure that
this vertex has converged?
© 2013 A. Haeberlen, Z. Ives
6
4
7
c
5
2
7
d
97
Summary: SSSP
• Path-based algorithms typically involve iterative map/reduce
• They are typically formulated in a way that traverses in “waves” or
“stages”, like breadth-first search
• This allows for parallelism
• They need a way to test for convergence
• Example: Single-source shortest path (SSSP)
• Original Dijkstra formulation is hard to parallelize
• But we can make it work with the "wave" approach
98
© 2013 A. Haeberlen, Z. Ives
Simple Clustering:
k-means
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
99
Learning (clustering / classification)
• Sometimes our goal is to take a set of entities, possibly related, and group them
• If the groups are based on similarity, we call this clustering
• If the groups are based on putting them into a semantically meaningful class, we call this
classification
• Both are instances of machine learning
100
© 2013 A. Haeberlen, Z. Ives
Age
The k-clustering Problem
Clusters
Items
Expenses
• Given: A set of items in a n-dimensional feature space
• Example: data points from survey, people in a social network
• Goal: Group the items into k “clusters”
• What would be a 'good' set of clusters?
101
© 2013 A. Haeberlen, Z. Ives
Approach: k-Means
• Let m1, m2, …, mk be representative points for each of our k
clusters
• Specifically: the centroid of the cluster
• Initialize m1, m2, …, mk to random values in the data
• For t = 1, 2, …:
• Map each observation to the closest mean

Si(t )  x j : x j  mi( t )  x j  mi(*t ) , i*  1,...,k
• Assign the mi to be a new centroid for each set
mi(t 1) 
© 2013 A. Haeberlen, Z. Ives
1
Si(t )
x
x j Si( t )

j
102
A simple example (1/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
(10,10)
(15,12)
Expenses
103
© 2013 A. Haeberlen, Z. Ives
A simple example (2/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
Randomly chosen
initial centers
(10,10)
(15,12)
Expenses
104
© 2013 A. Haeberlen, Z. Ives
A simple example (3/4)
(20,21)
(18,20)
(30,21)
Age
(19.75,19.5)
(11,16)
(12.5,11)
(10,10)
(15,12)
Expenses
105
© 2013 A. Haeberlen, Z. Ives
A simple example (4/4)
(20,21)
(30,21)
(18,20)
Age
(22.67,20.67)
(11,16)
(12,12.67)
(10,10)
(15,12)
Expenses
Stable!
106
© 2013 A. Haeberlen, Z. Ives
k-Means in MapReduce
• Map #1:
•
•
Input: node ID  <position, centroid ID, [centroid IDs and positions]>
Compute nearest centroid; emit centroid ID  <node ID, position>
• Reduce #1:
•
•
Recompute centroid position from positions of nodes in it
Emit centroidID  <node IDs, positions> and for all other centroid IDs, emit
otherCentroidID  centroid(centroidID,X,Y)
•
Each centroid will need to know where all the other centroids are
• Map #2:
•
Pass through values to Reducer #2
• Reduce #2:
•
•
For each node in the current centroid, emit
node ID  <position, centroid ID, [centroid IDs and positions]>
•
Input for the next map iteration
Also, emit <X, <centroid ID, position>>
•
This will be the 'result' (remember that we wanted the centroids!)
• Repeat until no change
107
© 2013 A. Haeberlen, Z. Ives
Simple Classification:
Naïve Bayes
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
108
Classification
• Suppose we want to learn what is
spam (or interesting, or …)
• Predefine a set of classes with semantic meaning
• Train an algorithm to look at data and assign a class
•
•
Based on giving it some examples of data in each class
… and the sets of features they have
• Many probabilistic techniques exist
• Each class has probabilistic relationships with others
•
•
e.g., p (spam | isSentLocally), p (isSentLocally | fromBob), …
Typically represented as a graph(ical model)! See CIS 520
• But we’ll focus on a simple, “flat” model: Naïve Bayes
109
© 2013 A. Haeberlen, Z. Ives
A simple example
• Suppose we just look at the keywords in the email's title:
Message(1, “Won contract”)
Message(2, “Won award”)
Message(3, "Won the lottery")
Message(4, “Unsubscribe”)
Message(5, "Millions of customers")
Message(6, "Millions of dollars")
• What is probability message "Won Millions" is
?
p(spam|containsWon,containsMillions)
= p(spam) p(containsWon,containsMillions |spam)
p(containsWon,containsMillions)
Bayes’
Theorem
110
© 2013 A. Haeberlen, Z. Ives
Classification using Naïve Bayes
•
Basic assumption: Probabilities of events are independent
•
•
This is why it is called 'naïve'
Under this assumption,
p(spam) p(containsWon,containsMillions | spam)
p(containsWon,containsMillions)
= p(spam) p(containsWon | spam) p(containsMillions | spam)
p(containsWon) p(containsMillions)
= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67
•
So how do we “train” a learner (compute the above probabilities) using
MapReduce?
111
© 2013 A. Haeberlen, Z. Ives
What do we need to train the learner?
• p(spam)
• Count how many spam emails there are
• Count total number of emails
Easy
Easy
• p(containsXYZ | spam)
• Count how many spam emails contain XYZ
• Count how many emails contain XYZ overall
1
2
• p(containsXYZ)
• Count how many emails contain XYZ overall
• Count total number of emails
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
2
Easy
112
Training a Naïve Bayes Learner
• map 1:
•
•
takes messageId  <class, {words}>
•
emits <word, class>  <count>
emits <word, class>  1
• reduce 1:
Count how many
emails in the class
contain the word
(modified WordCount)
• map 2:
•
•
takes messageId -> <class, {words}>
emits word  1
• reduce 2:
•
emits word  <totalCount>
Count how many
emails contain the
word overall
(WordCount)
113
© 2013 A. Haeberlen, Z. Ives
Summary: Learning and MapReduce
• Clustering algorithms typically have multiple aggregation stages or iterations
• k-means clustering repeatedly computes centroids,
maps items to them
• Fixpoint computation
• Classification algorithms can be quite complex
• In general: need to capture conditional probabilities
• Naïve Bayes assumes everything is independent
• Training is a matter of computing probability distribution
•
Can be accomplished using two Map/Reduce passes
114
© 2013 A. Haeberlen, Z. Ives
Stay tuned
Next time you will learn about:
PageRank and Adsorption
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
115

similar documents