Report

15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University Multipass Algorithms Previously , most of algorithms are required two passes. There is a case that we need more than two passes. Case : Data is too big to store in main memory. We have to hash or sort the relation with multipass algorithms. Agenda 1. Multipass Sort-Based Algorithm 2. Multipass Hash-Based Algorithm Multipass sort-based algorithm. M: Number of Memory Buffers R: Relation B(R) : Number of blocks for holding relation. BASIS: 1. If R fits in M block (B (R) <= M). 2. Reading R into main memory. 3. Sorting R in the main memory with any sorting algorithm. 4. Write the sorted relation to disk. Multipass sort-based algorithm. INDUCTION: (B(R)> M) 1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R1, R2, …, RM 2.Recursively sorting Ri from i =1 to M 3.Once sorting is done, the algorithm merges the M sorted sublists. Performance: Multipass Sort-Based Algorithms 1) Each pass of a sorting algorithm: 1.Reading data from the disk. 2. Sorting data with any sorting algorithms 3. Writing data back to the disk. 2-1) (k)-pass sorting algorithm needs 2k B(R) disk I/O’s 2-2)To calculate (Multipass)-pass sorting algorithm needs = > A+ B A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists] B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass] Total: (2k-1)(B(R)+B(S)) disk I/O’s Multipass Hash-Based Algorithms 1. Hashing the relations into M-1 buckets, where M is number of memory buffers. 2. Unary case: It applies the operation to each bucket individually. 1.Duplicate elimination (δ) and grouping (γ). 1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table 2) Duplicate elimination: Distinct Basis: If the relation fits in M memory block, -> Reading relation into memory and perform the operations. 3. Binary case: It applies the operation to each corresponding pair of buckets. Query operations: union, intersection, difference , and join If either relations fits in M-1 memory blocks, -> Reading that relation into main memory M-1 blocks -> Reading next relation to 1 block at a time into the Mth block Then performing the operations. INDUCTION If Unary and Binary relation does not fit into the main memory buffers. 1. Hashing each relation into M-1 buckets. 2. Recursively performing the operation on each bucket or corresponding pair of buffers. 3. Accumulating the output from each buckets or pair. Hash-Based Algorithms : Unary Operatiors Perfermance: Hash-Based Algorithms R: Realtion. Operations are like δ and γ M: Buffers U(M, k): Number of blocks in largest relation with k-pass hashing algorithm. Performance: Induction Induction: 1. Assuming that the first step divides relation R into M-1 equal buckets. 2. The buckets for the next pass must be small enough to handle in k-1 passes 3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1). Sort-Based VS Hash-Based 1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time 2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation. THANKS