Programming Parallel N-Body Codes with the BSP Model

Report
Part IV - Advanced Techniques
- High-performance crawling
- Recrawling and focused crawling
- Link-based ranking (Pagerank, HITS)
- Structural analysis of the web graph
- Optimizing query execution
- Parallel search engines and scaling
- Meta search engines
- Document clustering and duplicate detection
1- High-performance crawling:
100 million pages or more
• experience from research project at Poly
• need high-performance crawler: >100 pages/sec
• robust operation over several weeks
- crawler will crash
- system will have to be modified
• controlled operation
- other users on campus
- remember the 30-second rule
- things will happen
Networking performance
• server/DNS/network latency
(0.2-1 seconds minimum)
• must open hundreds of connections simultaneously
• 150 pages = 2 MB/s = 40% of max. T-3 capacity
• DNS can become bottleneck
• 10-20% additional accesses for robots.txt
• data must be streamed to disk
• OS limits and overheads: networking, files
Crawler Architectures
• Google crawler
“backRub” (see WWW’98 paper)
- python downloaders on several machines
- up to 100 pages per second
• Internet Archive Crawler
• Mercator/Atrax (DEC/Altavista)
(see Mercator paper)
- 2 GB machine with RAID
- implemented in Java
(many performance issues)
- up to 200 pages/sec
- detailed discussion of data structure size
- optimized DNS caching
• PolyBot (Shkapenyuk/Suel
ICDE 2002)
Polybot crawler:
(ICDE 2002, with V. Shkapenyuk)
• distributed implementation in C++
• manager handles several request streams with priorities
• manager handles DNS, exclusion, and frontier
• 300 pages/sec (and more)
Crawling Strategy and Download Rate:
• crawling strategy: “What page to download next?”
• download rate: “How many pages per second?”
• different scenarios require different strategies
• lots of recent work on crawling strategy
• little published work on optimizing download rate
(main exception: Mercator from DEC/Compaq/HP?)
• somewhat separate issues
• building a slow crawler is (fairly) easy ...
System Requirements:
• flexibility (different crawling strategies)
• scalabilty (high performance at low cost)
• robustness
(odd server content/behavior, crashes)
• crawling etiquette and speed control
(robot exclusion, 30 second intervals, domain level
throttling, speed control for other users)
• manageable and reconfigurable
(interface for statistics and control, system setup)
PolyBot System Architecture
Structure:
• separation of crawling strategy and basic system
• collection of scalable distributed services
(DNS, downloading, scheduling, strategy)
• for clusters and wide-area distributed
• optimized per-node performance
• no random disk accesses (no per-page access)
Basic Architecture, revisited:
• application issues
requests to manager
• manager does DNS
and robot exclusion
• manager schedules
URL on downloader
• downloader gets file
and puts it on disk
• application is notified
of new files
• application parses new
files for hyperlinks
• application sends data
to storage component
(indexing done later)
System components:
• downloader: optimized HTTP client written in Python
(everything else in C++)
• DNS resolver: uses asynchronous DNS library
• manager uses Berkeley DB and STL for external and
internal data structures
• manager does robot exclusion by generating requests
to downloaders and parsing files
• application does parsing and handling of URLs
(has this page already been downloaded?)
Scaling the system:
• small system on previous pages:
3-5 workstations and 250-400 pages/sec peak
• can scale up by adding downloaders and DNS resolvers
• at 400-600 pages/sec, application becomes bottleneck
• at 8 downloaders manager becomes bottleneck
need to replicate application and manager
• hash-based technique (Internet Archive crawler)
partitions URLs and hosts among application parts
• data transfer via file system (NFS)
Scaling up:
• 20 machines
• 1500 pages/s?
• depends on
crawl strategy
• hash to nodes
based on site
(b/c robot ex)
Data Structures and Techniques
Crawling Application
• parsing using pcre library
• NFS eventually bottleneck
• URL-seen problem:
- need to check if file has been parsed or downloaded before
- after 20 million pages, we have “seen” over 100 million URLs
- each URL is 50 to 75 bytes on average
• Options: compress URLs in main memory, or use disk
- prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive)
- disk access with caching (Mercator)
- we use lazy/bulk operations on disk
• Implementation of URL-seen check:
- while less than a few million URLs seen, keep in main memory
- then write URLs to file in alphabetic, prefix-compressed order
- collect new URLs in memory and periodically reform bulk
check by merging new URLs into the file on disk
• When is a newly a parsed URL downloaded?
• Reordering request stream
- want to space ot requests from same subdomain
- needed due to load on small domains and due to security tools
- sort URLs with hostname reversed (e.g., com.amazon.www),
and then “unshuffle” the stream
provable load balance
Challenges and Techniques: Manager
• large stream of incoming URL request files
• goal: schedule URLs roughly in the order that they
come, while observing time-out rule (30 seconds)
and maintaining high speed
• must do DNS and robot excl. “right before”download
• keep requests on disk as long as possible!
- otherwise, structures grow too large after few million pages
(performance killer)
Manager Data Structures:
• when to insert new URLs into internal structures?
URL Loading Policy
• read new request file from disk whenever less than x
hosts in ready queue
• choose x > speed * timeout (e.g., 100 pages/sec * 30 sec)
• # of current host data structures is
x + speed * timeout + n_down + n_transit
which is usually < 2x
• nice behavior for BDB caching policy
• performs reordering only when necessary!
Experimental Results
• crawl of 120 million pages over 19 days
161 million HTTP request
16 million robots.txt requests
138 million successful non-robots requests
17 million HTTP errors (401, 403, 404 etc)
121 million pages retrieved
• slow during day, fast at night
• many downtimes due to attacks, crashes, revisions
• “slow tail” of requests at the end (4 days)
Experimental Results ctd.
bytes in
bytes out
Poly T3 connection over 24 hours of 5/28/01
(courtesy of AppliedTheory)
frames out
Experimental Results ctd.
• sustaining performance:
- will find out when data structures hit disk
- I/O-efficiency vital
• speed control tricky
- vary number of connections based on feedback
- also upper bound on connections
- complicated interactions in system
- not clear what we should want
• other configuration: 140 pages/sec sustained
on 2 Ultra10 with 60GB EIDE and 1GB/768MB
• similar for Linux on Intel
Other Work on Parallel Crawlers:
• Atrax: recent distributed extension to Mercator
- combines several Mercators
- URL hashing, and off-line URL check (as we do)
• Ubicrawler, Nutch crawler
• P2P crawlers (grub.org and others)
• Cho/Garcia-Molina (WWW 2002)
- study of overhead/quality tradeoff in paral. crawlers
- difference: we scale services separately, and focus on
single-node performance
- in our experience, parallel overhead low
2 - Optimized Query Processing:
How to optimize query throughput in large search engines.
The ranking function is often a combination of term-based
ranking and a global ordering such as Pagerank”
Approaches:
• index compression: use fast and good codes to decrease index size
• list caching: keep inverted lists in main memory as much as possible
• result caching: if same query has recently been issued, use result
• query pruning: compute best k results without scanning entire lists
Query processing in search engines
(single node case)
• inverted index
- a data structure for supporting text queries
- like index in a book
indexing
disks with
documents
• Boolean queries:
aalborg
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
zz
(zebra AND armadillo) OR alligator
unions/intersections of lists
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
inverted index
Ranking in search engines:
• scoring function: assigns score to each document with respect
to a given query
• top-k queries: return k documents with highest scores
• example cosine measure for query with terms t 0 to t m-1
• can be implemented by computing score for all documents
that contain any of the query words (union of inverted lists)
• in case of search engines: often intersection instead of union
• in large collections, lists are many MB for average queries
Combining Term-Based Methods and Pagerank
• recall the cosine measure:
• Pagerank assigns a fixed score to each page
• naïve way of integrating Pagerank value:
(independent of query)
• works for any global ordering of page scores (e.g., based on traffic)
• but some more details remain
Using Pagerank in ranking:
• how to combine/add pagerank score and cosine? (addition)
• use PR or log(PR) ?
• normalize using mean of top-100 in list (Richardson/Domingo)
Query Processing in Parallel Search Engines
• low-cost cluster architecture (usually with additional replication)
Cluster with global
index organization
query
broadcasts each query
integrator and combines the results
LAN
index
index
index
index
index
pages
pages
pages
pages
pages
• local index: every node stores and indexes subset of pages
• every query broadcast to all nodes by query integrator (QI)
• every node supplies top-10, and QI computes global top-10
• note: we don’t really need top-10 from all, maybe only top-2
Related Work on top-k Queries
• IR: optimized evaluation of cosine measures (since 1980s)
• DB: top-k queries for multimedia databases (Fagin 1996)
• does not consider combinations of term-based and global scores
• Brin/Page 1998: fancy lists in Google
Related Work (IR)
• basic idea: “presort entries in each inverted list by contribution to cosine”
• also process inverted lists from shortest to longest list
• various schemes, either reliable or probabilistic
• most closely related:
- Persin/Zobel/Sacks-Davis 1993/96
- Anh/Moffat 1998, Anh/deKretzer/Moffat 2001
• typical assumptions: many keywords/query, OR semantics
Related Work (DB)
(Fagin 1996 and others)
• motivation: searching multimedia objects by several criteria
• typical assumptions: few attributes, OR semantics, random access
• FA (Fagin’s algorithm), TA (Threshold algorithm), others
• formal bounds:
N
m 1
m
k
1
m
for k lists if lists independent
• term-based ranking: presort each list by contribution to cosine
Related Work (Google)
(Brin/Page 1998)
• “fancy lists” optimization in Google
• create extra shorter inverted list for “fancy matches”
(matches that occur in URL, anchor text, title, bold face, etc.)
chair
fancy list
rest of list with other matches
table
fancy list
rest of list with other matches
• note: fancy matches can be modeled by higher
weights in the term-based vector space model
• no details given or numbers published
Throughput and Latency for 16 Nodes
• top-10 queries on 16 machines with 120 million pages
• up to 10 queries/sec with reliable pruning
• up to 20 queries per second with first-30 scheme
Note: reliable pruning not implemented in purely incremental manner
3 - Link-based ranking techniques
• Ragerank (Brin&Page/Google)
“significance of a page depends on
significance of those referencing it”
• HITS (Kleinberg/IBM)
“Hubs and Authorities”
0.2
Pagerank
1/2
1/2
1
0.2
0.2
1/2
1/2
1/2
0.2
1
1/2
0.2
• initialize the rank value of each node to 1/n
(0.2 for 5 nodes)
• a node with k outgoing links transmits a 1/k fraction of
its current rank value over that edge to its neighbor
• iterate this process many times until it converges
• NOTE: this is a random walk on the link graph
• Pagerank: stationary distribution of this random walk
•Attempts to computationally determine hubs and
authorities on a particular topic through analysis of a
relevant subgraph of the web.
• Based on mutually recursive facts:
Hubs point to lots of authorities.
Authorities are pointed to by lots of hubs.
Hubs
Authorities
0.2
0.2
1/2
1/2
1/2
1
0.2
0.2
1/2
1
0.3
0.1
1/2
1/2
1/2
1/2
1/2
1/2
1
1/2
0.2
0.2
0.2
1
1/2
0.2
0.3
1/2
0.286
1/2
1
0.25
1/2
0.1
1/2
..
1/2
1
0.286
0.143
1/2
1/2
1/2
0.2
1/2
1
1/2
1/2
0.15
0.143
1
1/2
0.143
Convergence
Leak
• dealing with leaks:
- pruning, or
- adding back links
• dealing with sinks
- add a random jump to escape sink
- with prob. b = 0.15 jump to random node
• assume: if in a leak, always take random jump
• in this case, always converges to unique solution!
Sink
Matrix notation
A
0.2
1/2
1/2
1
0.2
0.2
1/2
1/2
1/2
0.2
1
1/2
0.2
0 1/2
0
0
1/2
0
0
1/2 0
1/2
0
0
0
0
0
1/2 0
1
0
0
• stationary distribution: vector
x
with
1
0
0
1/2
0
xA = x
• A is primitive, and x Eigenvector of A
• computed using Jacobi or Gauss Seidel iteration
(4) Mathematical Techniques
Other iterative techniques for link analysis
• “HITS” (Jon Kleinberg)
• query-dependent: first get 100 docs using term-based techniques
• build a subgraph on these nodes and their neighbors
• run iterative process on this subgraph
• each node has hub score and authority score
(4) Mathematical Techniques
• “HITS” (Jon Kleinberg)
•Authorities are pages that are recognized as
providing significant, trustworthy, and useful
information on a topic.
•In-degree (number of pointers to a page) is one
simple measure of authority.However in-degree treats
all links as equal.
•Hubs are index pages that provide lots of useful
links to relevant content pages (topic authorities).
(4) Mathematical Techniques
• For a query Q, let the set of documents returned by a standard
search engine be called the root set R.
• Initialize base set S to R.
• Add to S all pages pointed to by any page in R.
• Add to S all pages that point to any page in R.
S
R
(4) Mathematical Techniques
•To limit computational expense:
– Limit number of root pages to the top 100 pages retrieved
for the query.
•To eliminate purely navigational links:
– Eliminate links between two pages on the same host.
•To eliminate “non-authority-conveying” links:
– Allow only m (m  48) pages from a given host as
pointers to any individual page.
ap 
h
q:q p
q
hp 
a
q: pq
q
(4) Mathematical Techniques
Pagerank vs. HITS
• Pagerank does not attempt to capture the distinction
between hubs and authorities, but ranks just by authority
• Pagerank is applied to the entire web rather than a local
neighborhood of result pages
• HITS is sensitive to local topology
• HITS is noisy!
3
1
3
1
5
2
4
2
4
Authority scores of nodes 2 and 4 is lost, and node 4’s authority score vanishes
to zero compared to those of nodes 2 and 5.
(4) Mathematical Techniques
Other iterative techniques for link analysis
• many other techniques based on techniques such as
markov chains, SVD, principal component analysis
• web graph result of billions of decisions by millions
of independent web page creators (or not!)
• data mining of web graphs and web usage patterns
• techniques not really new
- analysis of social networks (Katz 1953)
- citation analysis (Garfield)
• security-related applications: call pattern analysis

similar documents