### notes

```Implementation of Relational
Operators/Estimated Cost
1. Select
2. Join
Cost of Selection Operators
 R. A op value ( R)
Sailors(sid, sname, rating, age)
– Each Sailors tuple is 50 bytes long (fixed length record
format)
– A page size is 4K bytes
– #tuples: 40,000
– All pages are full, unpacked bitmap; 96 bytes are reserved
for slot directory
– How many pages for Sailors?
• One page can contain at most
• Sailors occupies
 40000 
 80   500
 4096  96 
 80


50


pages
tuples
Factors to Consider
 R. A op value ( R)
•No index
•unsorted data
•sorted data
•Index
•tree index
•hash-based index
No index, unsorted data
 R. A = value (R)
Suppose R is Sailors
Best access path: File Scan
I/O Cost: 500 pages
I/O time cost: 500 * time to access each page
Complexity: O(|R|)
Notation: |R| is the number of pages in R
No Index, sorted file on R.A
 R. A = value (R)
Suppose R is Sailors
Sorted on R.A
Best Access Path:
•Binary search to locate the first tuple with
R.A=Value
•Scan the remaining records
I/O Cost:
log2(|R|)+Cost of scan for remaining tuples (0 ~ |R|)
Tree Index on R.A
 R. A = value (R)
Selection Cost =
cost of traversing from the root to the leaf
+
cost of retrieving the pages in the sequence set +
cost of retrieving pages containing the data records.
• Need to know
– Format of the data entries in the leaf node
– Clustered or unclustered
– Dense or sparse
Index entries
Data entries
Data Records
Page (node)
Alternatives for Data Entry k* in Index

A
40
B
20
C
33
D
51
E
63
F
10
G
15
H
27
I
33
J
40
K
55
L
97
Alternative 1: Data entry is the actual tuple
o
Tuples organized according to the search key value
o
Allow only one Alternative 1 index per relation
Root
A,
40
B,
20
F,
10
G,
15
H,
27
C,
33
D,
51
I,
33
J,
40
K,
55
E,
63
L,
97
Alternatives for Data Entry k* in Index

Alternative 2: Data entry is <k, rid of data record with search key value k>
o
Ex. <50, (1, 10)>: search key value = 50, rid tells that the data is in page 1,
slot 10
Root
40
10*
15*
20
33
20*
27*
51
33*
33*
40*
40*
51*
63
55*
63*
97*
A
40
C
33
E
63
G
15
I
33
K
55
B
20
D
51
F
10
H
27
J
40
L
97
Alternatives for Data Entry k* in Index

Alternative 3: Data entry is <k, list of rids with search key k>
o
Ex: <50, (1,10), (2,20), (3,1)>
search key value = 50; three record IDs
Root
40
10*
15*
20
33
20*
27*
51
40*
33*
51*
63
55*
63*
97*
A
40
C
33
E
63
G
15
I
33
K
55
B
20
D
51
F
10
H
27
J
40
L
97
Pros and Cons



Data entries are typically much smaller than data records. So,
Alternatives 2 and 3 are better than Alternative 1 with large data
records, especially if search keys are small.
If more than one index is required on a given file, at most one index can
use Alternative 1; the rest must use Alternatives 2 or 3.
Alternative 3 is more compact than Alternative 2, but leads to variable
sized data entries even if search keys are of fixed length.
Index entries
Formats of Data entries
Data Records
Dense or Sparse
DENSE INDEX
SPARSE INDEX
Index entries
direct search
for
data entries
Data entries
Data
entries
(Index File)
(Data file)
Data Records
•
•
Data Records
Dense: At least one data entry per data record
Sparse: At least one data entry per block/page
Pros and Cons:
• Dense: less space-efficient, but great for both equality and range search
• Sparse: more space-efficient, but need sequential search within a page
Dense or Sparse
Root
40
10*
15*
20
33
20*
27*
51
33*
33*
40*
40*
51*
63
55*
63*
97*
A
40
C
33
E
63
G
15
I
33
K
55
B
20
D
51
F
10
H
27
J
40
L
97
Dense or Sparse
Root
40
20
F
10
G
15
B
H
51
33
20
27
I
33
C
33
63
J
40
D
51
E
63
A
40
K
55
L
97
Clustered or Unclustered
CLUSTERED INDEX
UNCLUSTERED INDEX
Index entries
direct search
for
data entries
Data entries
Data
entries
(Index File)
(Data file)
Data Records
Data Records
Clustered Index :
• The ordering of data records is organized the same as or close
to the ordering of data entries in the index
• Sparse index is always clustered (why?)
• A clustered index does not have to be sparse (why?)
Pros and Cons:
• Clustered: maintenance cost high, but great for range search
• Unclustered: low maintenance cost, but high retrieval cost
•Retrieving one record may need to load one page
Index Classification
•
Primary vs. secondary: If search key contains
primary key, then called primary index.
–
Unique index: Search key contains a candidate key.
–
Non-unique index: Search key is a non-candidate attribute
o
Multiple records may be associated with a same
attribute value
B+Tree: the Most Widely Used Index for RDBMS
• Support equality and range-searches efficiently
• Balanced tree
• Search/Insert/delete at log F N cost (F = fanout, N = # leaf pages)
• F => 100
• The cost of traversing from root to data entries is usually regarded as 4
• Minimum 50% occupancy (except for root).
• Each node except root contains d <= m <= 2d entries. The root
node contains 1<= m <= 2d entries.
• The parameter d is called the order of the tree.
Index Entries
(Direct search)
Data Entries
("Sequence set")
B+ Tree Example
•
•
Search begins at root, and key comparisons direct it to a leaf.
Search for tuples whose search key = 15
• Follow the left pointer if the desired value is less than the
value in the node
• Otherwise, follow the right pointer
Root
13
2*
3*
5*
7*
14* 16*
17
24
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
B+ Tree Example
Root
13
2*
3*
5*
7*
14* 16*
dense
17
24
19* 20* 22*
30
24* 27* 29*
sparse
33* 34* 38* 39*
Dense/Sparse
Clustered/Unclustered
Root
13
2*
3*
5*
7*
14* 16*
17
24
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
B+Tree Index on R.A
 R. A = value (R)
–
–
–
–
–
Format of data entry: alternative 2
Size of data entry = 20 bytes;
Page size=4K bytes; 96 bytes are reserved
Total number of records = 100,000; record size = 40 bytes
Reduction Factor = 0.1
• #matching entries/#total entries
Total Cost =
Cost of traversing from the root to the leaf (assume 4 I/Os) +
Cost of retrieving the pages in the sequence set +
the cost of retrieving pages containing the data records
•
•
•
•
•
Format of data entry: alternative 2
Size of data entry = 20 bytes;
Page size=4K bytes; 96 bytes are reserved
Total number of records = 100,000; record size = 40 bytes
Reduction Factor = 0.1
• B+tree, Alternative 2, dense, unclustered
• I/O cost of retrieving pages of qualifying data
entries
– Matching data entries: 0.1*100000=10,000 entries
 4096  96 
– #Date entries per page:  20   200
– Pages of matching data entries = 10000/200 = 50
pages
• I/O cost of retrieving qualifying tuples
– 10,000 pages since the index is unclustered, the
UNCLUSTERED INDEX
qualifying tuples are not always in the same order
as the data entries.
– In the worst case, for each qualifying data entry,
one I/O is needed
Data entries
• Total I/O Cost = 4+ 50+10,000 pages
Data Records
• B+tree, Alternative 2, dense, clustered
• I/O cost of retrieving pages of qualifying data
entries
– Matching data entries: 0.1*100000=10000 entries
 4096  96 
– #Date entries per page:  20   200
10000 
– #Pages of matching data entries = 
  50
 200 
• I/O cost of retrieving qualifying tuples
– #Matching tuples: 10000
– Since the index is dense and clustered, the qualifying
tuples are also clustered
– # pages: 10000/100=100 due to (4096-96)/40=100
tuples per page
• Total I/O Cost = 4+ 50+100 pages
CLUSTERED INDEX
Data entries
Data Records
• B+tree, Alternative 2, sparse (must be clustered)
• I/O cost of retrieving qualifying tuples
– #Matching tuples: 0.1*100,000=10,000
– Since the index is clustered, the qualifying tuples are also
clustered
10000 
– # pages:  100  due to 100 tuples per page
• I/O cost of retrieving pages of qualifying data entries
– Matching data pages: 100
– #Data entries per page:  4096  96   200
20


 100 
– #Pages of matching data entries =  200  =1 page
• Total I/O Cost = 4+ 1+100 pages
CLUSTERED INDEX
SPARSE INDEX
Each entry links to one page
Data entries
Data Records
Hash-based Index
 R. A = value (R)
• Hash-based index on A is good for
equality search on attribute A
– Usually cannot support range search
Hashed index
Data records
Ashby, 25, 3000
Basu, 33, 4003
Bristow, 30, 2007
Cass, 50, 5004
Daniels, 22, 6003
Data entries
3000
3000
5004
5004
4003
2007
h2(sal)=00
h2
sal
h2(sal)=11
6003
6003
Jones, 40, 6003
Smith, 44, 3000
Tracy, 44, 5004
Data File
I/O costs
1) Cost for retrieving the matching
data entries
2) Cost for retrieving the qualifying
tuples
Total number of records: 100,000
Reduction Factor: 0.1
Cost for searching the matching data entry: 1.2 I/O
Assume dense and unclustered
I/O cost = cost for retrieving the matching data entries
+ cost for retrieving the qualifying tuples
• I/O cost of retrieving pages of matching data entries
– Matching data entries: 0.1*100000=10000 entries
– I/O cost = 10000*1.2 = 12000 I/Os
• I/O cost for retrieving the qualifying tuples = 12000 I/Os
• Total cost = 12000+12000 = 24000 I/Os
Factors to Consider
 R. A op value ( R)
•No index
•unsorted data
•sorted data
•Index
•tree index
•hash-based index
```