Hierarchical data - University of Pennsylvania

Report
NETS 212: Scalable and Cloud Computing
Hierarchical data
November 21, 2013
© 2013 A. Haeberlen, Z. Ives
University of Pennsylvania
1
Announcements

How is the PennBook project going?



Second midterm exam on December 10th



Code is (technically) due on December 10th
Getting Started guide is now available (see Piazza post)
Comparable to first midterm (open-book, closed-Google, ...)
Covers all material up to, and including, December 5 lecture
Please schedule a check/review session!!!


© 2013 A. Haeberlen, Z. Ives
Each team should already have contacted their grader
If you have not done this yet, please do so right after class!
University of Pennsylvania
2
Data = Records… or not?

To this point, we have assumed record-oriented
data throughout most of the semester:





(key, value) pairs
(from, to) edges
Comma or tab-delimited files
Relational tuples
Except implicitly:



objects with collection-typed member variables
sets of objects to be reduced
XML with repeated sub-elements (e.g., in AJAX)
3
© 2013 A. Haeberlen, Z. Ives
Goals for today: Beyond relations!


Representational capacity
Pig Latin



NEXT
Language and examples
Implementation
XQuery


© 2013 A. Haeberlen, Z. Ives
Basic constructs
Applicability to distributed settings
University of Pennsylvania
4
Representational capacity

We know that a Turing-complete language just
needs:


What does a complete data representation need?



sequence, selection, recursion (or iteration)
Typed binary relations: pred(src,target)
OR equivalently, triples: edge(src,type,target)
But in general, it’s not convenient to be limited
to these!
5
© 2013 A. Haeberlen, Z. Ives
Towards Pig #1: Beyond relations?

The relational data model allows us to have
arbitrary numbers of relations


But: No nested tables!



Each with its own schema that includes arbitrary numbers of
attributes
These would be converted into multiple tables by 1NF
normalization
Hence SQL has no nested collections at all, (sets, lists, bags...)
Can we add support for these?
6
© 2013 A. Haeberlen, Z. Ives
Towards Pig #2: Programming model

Hadoop MapReduce:









file-oriented, procedural
opacity
regularized “pipeline” – map, combine, shuffle, reduce
custom code
arbitrary Java functions at each step
even for very
SQL:

rigid dataflow
common operations
random access-storage-oriented (DBMS controls storage)
compositional, tuple-collection-oriented query model
what about
declarative queries are automatically optimized
"procedural
programmers"?
can accommodate Java functions, but not naturally
Hive: SQL queries  file-oriented Map/Reduce
Is there something in between?


Declarative is nice, but many data analysts are 'entrenched'
procedural programmers...
Pig and Pig Latin!
7
© 2013 A. Haeberlen, Z. Ives
Goals for today: Beyond relations!


Representational capacity
Pig Latin
NEXT



Language and examples
Implementation
XQuery


© 2013 A. Haeberlen, Z. Ives
Basic constructs
Applicability to distributed settings
University of Pennsylvania
8
Pig Latin and Pig

Pig Latin: a compositional, collectionsoriented dataflow language


Oriented towards parallel data processing & analysis
Think of it as a more procedural SQL-like language with
nested collections



By Chris Olston et al. at Yahoo! Research


Emphasizes user-defined functions, esp. those that have nice algebraic
properties (unlike SQL)
Supports external data from files (like Hive)
http://www.tomkinshome.com/site_media/papers/papers/ORS+08.pdf
Pig: the runtime system
9
© 2013 A. Haeberlen, Z. Ives
Pig Latin: Basic constructs

Collection-valued expressions whose results get
assigned to variables


A program does a series of assignments in a dataflow
It gets compiled down to a sequence of MapReduces


Similar to Hive, but Pig Latin has its own query language (not SQL)
Basic SQL-like operations are explicitly specified:







load … as
Remapping: foreach … generate
Filtering: filter by
Intersecting: join
Aggregating: group by
Sorting: order
store
[HDFS scan]
[Map]
[Map]
[Reduce]
[Reduce]
[Shuffle]
[HDFS store]
10
© 2013 A. Haeberlen, Z. Ives
Simple example: Face detection

Each expression creates a named collection



load collections from files
process them (e.g., per tuple) using a user-defined function
store the results into files
I = load ‘/mydata/images’ using ImageParser()
as (id, image);
F = foreach I generate id, detectFaces(image);
store F into ‘/mydata/faces’;
11
© 2013 A. Haeberlen, Z. Ives
Example: Session classification

Goal: Find web sessions that end on the 'best' page
(i.e., the page with the highest PageRank)

We need to join two tables, and then compare the final rank in the
sequence to the other ranks
Visits
Pages
User
URL
Time
URL
Alice
www.cnn.com
7:00
www.cnn.com
0.9
Alice
www.digg.com
7:20
www.flickr.com
0.9
Alice
www.social.com
10:00
www.social.com
0.7
Alice
www.flickr.com
10:05
www.digg.com
0.2
Joe
www.cnn.com/index.htm 12:00
...
...
© 2013 A. Haeberlen, Z. Ives
PageRank
12
The computation in Pig Latin
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user,
Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
Sessions = foreach UserVisits generate
flatten(FindSessions(*));
HappyEndings = filter Sessions by BestIsLast(*);
store HappyEndings into '/data/happy_endings';
13
© 2013 A. Haeberlen, Z. Ives
What does this query compile to?

Parallel evaluation is really a
Map-Map/Reduce/Reduce chain:





⋈
⋈
⋈
⋈
⋈
Parallel group-by () session /
choose best
Parallel joins (⋈)
…
…
© 2013 A. Haeberlen, Z. Ives
Visit lists (filesystem)
Rank lists (filesystem)
Pig Latin features

Record-oriented transformations
Can work over, create nested collections
(Resembles Nested Relational variants of SQL)




Basic operators expose parallelism; userdefined operators may not
Operations are explicit, not declarative
Unlike SQL

•
operators:
• FILTER
FOREACH … GENERATE
• GROUP
binary operators:
• JOIN
• COGROUP
• UNION
15
© 2013 A. Haeberlen, Z. Ives
Nesting: COGROUP & FLATTEN

Cogrouping: nesting groups into columns

Flattening: unnesting groups
16
© 2013 A. Haeberlen, Z. Ives
Pig Latin vs. MapReduce

MapReduce combines 3 primitives:
process records  create groups  process groups
a = FOREACH input GENERATE flatten(Map(*));
b = GROUP a BY $0;
c = FOREACH b GENERATE Reduce(*);

In Pig, these primitives are:
– explicit
– independent
– fully composable
© 2013 A. Haeberlen, Z. Ives

Pig adds primitives for:
– filtering tables
– projecting tables
– combining 2 or more tables
Recap: Pig Latin

A dataflow language that compiles to
MapReduce



Borrows many of the elements of SQL, but eliminates the
reliance on declarative optimization
Incorporates primitives for nested collections
Quite successful:


As of 2008: 25% of Yahoo Map/Reduce jobs from Pig
Part of the Hadoop standard distribution
18
© 2013 A. Haeberlen, Z. Ives
Pig system implementation

Let’s briefly look at the Pig implementation, and how it can do a bit
more because of the higher-level language:
parsed
program
Parser
Pig compiler
Pig Latin
program
cross-job
optimizer
execution
plan
MapReduce
jobs
© 2013 A. Haeberlen, Z. Ives
User
output
f( )
MR Compiler
join
MapReduce
filter
cluster
X
Y
Key issue: Minimizing redundancy

Popular tables



Popular transformations




web crawl
search log
eliminate spam pages
group pages by host
join web crawl with search log
Goal: Minimize redundant work
© 2013 A. Haeberlen, Z. Ives
Work-sharing techniques
execute similar
jobs together
cache data
transformations
cache data moves
Job 2
Job 1
Job 2
Job 1
Join
A&B
Op3
Op2
Op1
A
© 2013 A. Haeberlen, Z. Ives
A
Op2
Op1
A
A
A
D
B
C
Worker 1 Worker 2
Recap: Pig and Pig Latin




Somewhere between a programming
language and a DBMS
Allows distributed programming with explicit
parallel dataflow operators
Supports explicit management of nested
collections
Runtime system does caching and batching
22
© 2013 A. Haeberlen, Z. Ives
Goals for today: Beyond relations!


Representational capacity
Pig Latin



Language and examples
Implementation
XQuery


© 2013 A. Haeberlen, Z. Ives
NEXT
Basic constructs
Applicability to distributed settings
University of Pennsylvania
23
One step further


Pig Latin gives us nested collections
Suppose we want to take a single hierarchical
data model for everything!


XML, perhaps?
A functional language for XML: XQuery
24
© 2013 A. Haeberlen, Z. Ives
XQuery: Basic form



Has an analogous form to SQL’s
SELECT..FROM..WHERE..GROUP BY..ORDER BY
Builds upon XPath expressions (traversals from the root)
The model:




Bind nodes (or node sets) to variables
Operate over each legal combination of bindings
Produce a set of nodes
“FLWOR” statement [note case sensitivity!]:
for {iterators that bind variables}
let {collections}
where {conditions}
order by {order-conditions}
return {output constructor}
25
© 2013 A. Haeberlen, Z. Ives
Example XML Data
Root
mastersthesis
1992…
article
element
university
key
author title year school
PRPL…
name loc
1982
Bob S….
key
year
PLDI82
WISC
PL…
WISC
proceedings
author title year conf
1992
Kurt P….
p-i
text
key
ms/Brown92
attribute
dblp
?xml
mdate
root
USA
Wisconsin
1999
PLDI82
26
© 2013 A. Haeberlen, Z. Ives
XQuery: "Iterations"

A series of (possibly nested) FOR statements assigning the
results of XPaths to variables
for $root in document(“http://my.org/dblp.xml”)
for $sub in $root/dblp,
$sub2 in $sub/mastersthesis, …



Something like a template that pattern-matches, produces a
“binding tuple”
For each of these, we evaluate the where and possibly output
the return template
document() or doc() function specifies an input file as a URI
27
© 2013 A. Haeberlen, Z. Ives
Two XQuery examples
<root-tag> {
for $p in document(“dblp.xml”)/dblp/proceedings,
$yr in $p/yr
where $yr = “1999”
return <proc> {$p} </proc>
} </root-tag>
for $i in document(“dblp.xml”)/dblp/mastersthesis[author/text()
= "John Doe"]
return <johns-theses>
<title>{ $i/title/text() }</title>
<key>{ [email protected] }</key>
</johns-theses>
28
© 2013 A. Haeberlen, Z. Ives
XQuery: Nesting

Nesting XML trees is perhaps the most common
operation

In XQuery, this is easy – put a subquery in the return clause
where you want things to repeat!
for $u in document(“dblp.xml”)/dblp/university
where $u/loc = “USA”
return <ms-theses-99>
{ $u/name } {
for $mt in $u/../mastersthesis
where $mt/year/text() = “1999”
and $mt/school = [email protected]
return $mt/title }
</ms-theses-99>
© 2013 A. Haeberlen, Z. Ives
29
XQuery: Collections & Aggregation

In XQuery, many operations return collections



XPaths, sub-XQueries, functions over these, …
The let clause assigns the results to a variable
Aggregation simply applies a function over a
collection, where the function returns a value
let $allpapers := document(“dblp.xml”)/dblp/*[author]
return <article-author-counts>
<count> { fn:count(fn:distinct-values($allpapers/author)) }
</count>
{ for $paper in allPapers
let $pauth := $paper/author
return <paper> {$paper/title}
<count> { fn:count($pauth) } </count>
</paper>
} </article-author-counts>
© 2013 A. Haeberlen, Z. Ives
30
XQuery: Defining and querying metadata

Can get a node’s name by querying node-name():
for $x in document(“dblp.xml”)/dblp/*
return node-name($x)

Can construct elements and attributes using
computed constructors:
for $x in document(“dblp.xml”)/dblp/*,
$year in $x/year,
element name {
attribute name
$title in $x/title/text(),
element node-name($x) {
attribute {“year-” + $year} { $title }
}

contents }
{ value }
Can't do this in SQL!
31
© 2013 A. Haeberlen, Z. Ives
XQuery: Functions
A function can return arbitrary XML…
calls
declare function V() as element(content)* {
for $r in doc(“R”)/root/tree,
$a in $r/a, $b in $r/b, $c in $r/c
where $a = “123”
return <content>{$a, $b, $c}</content>
}
declare function fact($n as xs:integer) as
xs:integer {
if ($n > 1) then $n * fact($n – 1)
else 1
}
It can be queried over like a
document…
for $v in V()/content,
$r in doc(“r”)/root/tree
where $v/b = $r/b
return
<result>{$v, fact(1)}
</result>
calls
32
© 2013 A. Haeberlen, Z. Ives
Summary: XQuery

Flexible and powerful language for XML



Supports arbitrary hierarchy, and conversion between
different structures
Turing-complete: has been proposed as Web programming
language
Turing-complete, but…

No “cloud” XQuery engines yet, though all major commercial
DBMSs support centralized XQuery


It’s harder to parallelize hierarchical data!
A worthy alternative to SQL/Pig Latin?
33
© 2013 A. Haeberlen, Z. Ives
http://www.flickr.com/photos/timbodon/2328484294/
Stay tuned
Next time you will learn about:
Peer-to-peer systems
© 2013 A. Haeberlen, Z. Ives
University of Pennsylvania
34

similar documents