CS257_125_Ch.15.8 - Department of Computer Science

15.8 Algorithms using more than two passes
Presented By: Seungbeom Ma (ID 125)
Professor: Dr. T. Y. Lin
Computer Science Department
San Jose State University
Multipass Algorithms
 Previously , most of algorithms are required two passes.
 There is a case that we need more than two passes.
 Case : Data is too big to store in main memory.
 We have to hash or sort the relation with multipass algorithms.
 1. Multipass Sort-Based Algorithm
 2. Multipass Hash-Based Algorithm
Multipass sort-based algorithm.
 M: Number of Memory Buffers
 R: Relation
 B(R) : Number of blocks for holding relation.
 1. If R fits in M block (B (R) <= M).
 2. Reading R into main memory.
 3. Sorting R in the main memory with any sorting algorithm.
 4. Write the sorted relation to disk.
Multipass sort-based algorithm.
 1. If R does not fit into main memory then partitioning the
blocks hold R into M groups, which call R1, R2, …, RM
 2.Recursively sorting Ri from i =1 to M
 3.Once sorting is done, the algorithm merges the M sorted sublists.
Performance: Multipass Sort-Based Algorithms
1) Each pass of a sorting algorithm:
1.Reading data from the disk.
2. Sorting data with any sorting algorithms
3. Writing data back to the disk.
2-1) (k)-pass sorting algorithm needs
2k B(R) disk I/O’s
2-2)To calculate (Multipass)-pass sorting algorithm needs
= > A+ B
A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]
B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the
final pass]
Total: (2k-1)(B(R)+B(S)) disk I/O’s
Multipass Hash-Based Algorithms
1. Hashing the relations into M-1 buckets, where M is number of memory buffers.
2. Unary case:
It applies the operation to each bucket individually.
1.Duplicate elimination (δ) and grouping (γ).
1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table
2) Duplicate elimination: Distinct
If the relation fits in M memory block,
-> Reading relation into memory and perform the operations.
3. Binary case: It applies the operation to each corresponding pair of buckets.
Query operations: union, intersection, difference , and join
If either relations fits in M-1 memory blocks,
-> Reading that relation into main memory M-1 blocks
-> Reading next relation to 1 block at a time into the Mth block
Then performing the operations.
 If Unary and Binary relation does not fit into the main memory
Hashing each relation into M-1 buckets.
Recursively performing the operation on each bucket or
corresponding pair of buffers.
Accumulating the output from each buckets or pair.
Hash-Based Algorithms : Unary Operatiors
Perfermance: Hash-Based Algorithms
 R: Realtion.
 Operations are like δ and γ
 M: Buffers
 U(M, k): Number of blocks in largest relation with k-pass hashing
Performance: Induction
1. Assuming that the first step divides relation R into M-1 equal buckets.
2. The buckets for the next pass must be small enough to handle in k-1
3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).
Sort-Based VS Hash-Based
1. Sort-based can produce output in sorted order. It might be
helpful to reduce rotational latency or seek time
2. Hash-based depends on buckets being of equal size. For
binary operations, hash-based only limits size of smaller
relation. Therefore, hash-based can be faster than sort-based
for small size of relation.

similar documents