*printable version*

# Test 3 Review A

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

### Problem R3a.1.

Programs normally store sorted data in a red-black tree (as
with Java's TreeSet class) or some other form of balanced binary
tree. Why is this approach unsuitable for database management systems
to store their indexes (which leads to their using B^{+}-trees
instead)?

A node in a red-black tree has relatively little data,
taking up far less than a block of disk storage. As a result,
a lookup in the index will require reading many disk blocks to
reach a leaf; by contrast, a B^{+}-tree, which has a higher
branching factor with each block, requires many fewer disk
blocks to reach a leaf.

### Problem R3a.2.

For the below B^{+}-tree, each node contains up to three
keys and four pointers. Show the
B^{+}-tree that would result by inserting 4 as
described in class.

### Problem R3a.3.

For the below B^{+}-tree, each
node contains up to three keys and four pointers. Show the
B^{+}-tree that would result by inserting 5
as described in class.

### Problem R3a.4.

For the below B^{+}-tree, each
node contains up to three keys and four pointers. Show the
B^{+}-tree that would result by inserting 5
as described in class.

### Problem R3a.5.

For the below B^{+}-tree, each node contains up to three
keys and four pointers. Show the
B^{+}-tree that would result by inserting 4 as
described in class.

### Problem R3a.6.

For the below B^{+}-tree, each
node has up to three keys and four pointers. Show the
B^{+}-tree that would result by inserting 10
as described in class.

### Problem R3a.7.

For the below B^{+}-tree, each node contains up to three
keys and four pointers. Show the
B^{+}-tree that would result by deleting 11 as
described in class.

### Problem R3a.8.

Suppose that each node of a B^{+}-tree has up
to five keys and six pointers. If this tree has 100 keys represented
among its leaves, what is the maximum number of levels it might have
(that is, the maximum number of disk accesses to find any individual
key among its leaves)? Explain your answer.

Overall, the tree will have at most four levels. To get the deepest tree possible, we will need to maximize the number of nodes at each level.

Let us start from the bottom level — the leaves. Each leaf must have at least three keys, so the largest number of leaves possible is 100/ 3 = 33. (It does not divide exactly, but 33 is the largest possible; 34 leaves would require at least 34 ⋅ 3 > 100 keys.)

At the level above the leaves, each node must point to at least three leaves as children, so the largest number of nodes in this levels is 33 / 3 = 11.

The level above these will have at most 11 / 3 = 3 nodes.

The level above these will have only one node, the root.

### Problem R3a.9.

Suppose we have a B^{+}-tree indexing a table of 1,000,000
records, where each node can hold a maximum of 19 keys and 20 pointers.
How many disk accesses (at most) would be required to load a record with a
given key value into memory? Justify your answer.

There would be at most 8 block reads with each access. The root node will have at least two pointers, so reading it and traversing a pointer will go into a half of the tree with at most 500,000 records. The next node will have at least 10 pointers, so traversing a pointer will go into a subtree with at most one-tenth of the records, or 50,000. Likewise, reading the third node will reduce the number of records to 5,000; the fourth to 500; the fifth to 50; the sixth to 5; and the seventh read, which will be of a leaf, will give us the location of the block within the table. Once we get the location within the table, we then must look into the actual table to retrieve the record, and that will require an eighth disk read.

### Problem R3a.10.

The following pseudocode describes an algorithm that requires
`B`(`R`) + (`T`(`R`) `B`(`S`))
disk accesses. Explain how it can be changed to require only
`B`(`R`) + (`B`(`R`) `B`(`S`) / (`M` − 1))
accesses.

foreach tuplerinR:

foreach blockUinS:

foreach tuplesinU:

ifrandshave the same join key:

outputr,s

We use the other blocks of memory so we
read many blocks from `R`, and join all those tuples
with each tuple from `S`. We consequently end up going
through `S` many fewer times — in fact, we go through it
`B`(`R`) / (`M` − 1)
times. Each time requires
`B`(`S`) block reads,
so the total number of accesses is
`B`(`R`) + (`B`(`R`) `B`(`S`) / (`M` − 1)).

**while** more blocks remain in `R`**:**

read next `M` − 1 blocks from `R`

**for** each block `U` in `S`**:**

**for** each tuple `r` in the `M` − 1 blocks**:**

**for** each tuple `s` in `U`**:**

**if** `r` and `s` have the same join key**:**

output `r`, `s`

### Problem R3a.11.

We saw in class that a nested-loop join of two relations
`R` and `S` takes
`B`(`R`) + (`B`(`R`) `B`(`S`) / (`M` − 1))
disk accesses. But we also saw algorithms that take
`c` (`B`(`R`) + `B`(`S`))
disk accesses for some constant `c` like 3 or 5.
Select one such algorithm to describe; describe it, and explain
why we approximate the number of disk accesses by
`c` (`B`(`R`) + `B`(`S`)).

[We saw a few ways in class. I'll describe the simplest.]
Do a two-phase multiway merge-sort of `R` based on the join key;
this will require two passes through `R`.
Similarly, do a two-phase multiway merge-sort of `S` based on
the join key.
Then go through one more pass of the two sorted relations in parallel,
as if we were merging them; whenever we find two rows sharing
the same join key values, we output a row containing the
attributes of those two rows.

### Problem R3a.12.

Suppose our DBMS's algorithm for joining `R` and `S` to
get `R` × `S`
takes
`B`(`R`) + `B`(`R`) `B`(`S`) disk accesses. Suppose, moreover, that we wish
to compute `A` × `B` × `C`, where
the number of disk blocks in each relation (and possible joins of
relations) is as follows:

A | 100 | A × B | 100 | B × C | 40 | ||

B | 40 | A × C | 1,000 | A × B × C | 100 | ||

C | 10 |

How many disk accesses will it take for each of the following?

**a.** Compute `A` × `B`,
then join the result with `C`.

**b.** Compute `A` × `C`,
then join the result with `B`.

**c.** Compute `B` × `C`,
then join the result with `A`.

**a.**
Joining `A` and `B`
will take 100 + 100 ⋅ 40 = 4,100 accesses,
and joining this with `C`
will take 100 + 100 ⋅ 10 = 1,100 accesses,
for a total of 4,100 + 1,100 = 5,200.

**b.**
Joining `A` and `C`
will take 100 + 100 ⋅ 10 = 1,100 accesses,
and joining this with `B`
will take 1,000 + 1,000 ⋅ 40 = 41,000 accesses,
for a total of 1,100 + 41,000 = 42,100.

**c.**
Joining `B` and `C`
will take 40 + 40 ⋅ 10 = 440 accesses,
and joining this with `A`
will take 40 + 40 ⋅ 100 = 4,040 accesses,
for a total of 440 + 4,040 = 4,480.

### Problem R3a.13.

Suppose we want to sort a very large file of fixed-length records.
Let `L` be the number of blocks in the file, and let
`M` be the number of blocks available in memory.
Describe an algorithm that can sort the file using at most
4`L` disk accesses, assuming that
`M` < `L` ≤ `M`² − `M`.

runsof

`M`blocks each, and we do an in-memory sort on each run, storing the sorted run onto disk. This will require 2

`L`accesses: We'll access each block of the file to load it into memory, and then we'll store the sorted runs onto the disk. There will be at most

`M`− 1 runs.

In the second phase, we merge all the runs by loading the
first block of each run into memory and merging them together
into an additional block. As we merge together and fill this
destination block, we'll write that into the overall sorted
output; and as we deplete the first block from each run, we
replace that block of memory with the following block in the
run. This second phase will again load each block of each run,
requiring another `L` loads; and it will store the
sorted file, taking another `L` stores.

### Problem R3a.14.

Let `M` be the number of blocks available in memory. To use
two-phase multiway merge sort on a file, what is the maximum
size of the file? Explain your answer.

The file can have at most `M`² − `M` blocks,
so that it can be broken up into
`M` − 1 chunks of `M` blocks each.
In the first phase, we sort each chunk independently.
In the second phase, we will need to load the a block of
each chunk into memory as we merge the `M` − 1
chunks together (saving the remaining block for accumulating the
sorted output).

### Problem R3a.15.

Describe the concept of a *transaction* in database
management systems.

A transaction is a group of related database operations that should be completed as a group. That is, if any of the operaions in the transaction modify the databases, then all of the operations in the transaction must be completed.

### Problem R3a.16.

Identify and explain what the *A* stands for among the
ACID database properties.

* Atomicity* is the property that the operations requested by a
transaction will either all be completed or all have no effect.

### Problem R3a.17.

Identify and explain what the *I* stands for among the
ACID database properties.

* Isolation* means that transactions should be
completed as if they were in isolation, behaving as if one transaction
starts and finishes before another is allowed to start.

### Problem R3a.18.

Identify and explain what the *D* stands for among the
ACID database properties.

* Durability* means for each committed transaction, its
results will persist no matter what happens in the future. Even if the
computer crashes, the transaction's results should still endure. (This is
typically accomplished by writing enough information onto disk before
the transaction completes.)

### Problem R3a.19.

Suppose we have two transactions:

ID action operations 1. Decrease Aby 100.r_{1}(A),w_{1}(A)2. Increase Aby 100.r_{2}(A),w_{2}(A)

Suppose `A` starts at 500.
Write three different valid schedules for these transactions,
each resulting in a different final value for `A`,
and identify the final value for `A` for that
schedule.

`r_1`(`A`),`w_1`(`A`),`r_2`(`A`),`w_2`(`A`) (or`r_2`(`A`),`w_2`(`A`),`r_1`(`A`),`w_1`(`A`))`A`ends up at 500.`r_1`(`A`),`r_2`(`A`),`w_1`(`A`),`w_2`(`A`) (or`r_2`(`A`),`r_1`(`A`),`w_1`(`A`),`w_2`(`A`))`A`ends up at 600.`r_1`(`A`),`r_2`(`A`),`w_2`(`A`),`w_1`(`A`) (or`r_2`(`A`),`r_1`(`A`),`w_2`(`A`),`w_1`(`A`))`A`ends up at 400.

### Problem R3a.20.

Draw a precedence graph for determining whether the schedule below is conflict-serializable, labeling each edge with the operations that lead you to include that edge.

w_{1}(A),w_{2}(B),w_{3}(C),w_{1}(B),w_{2}(C),w_{3}(A)

With specific reference to your graph, explain whether you can conclude that the schedule is conflict-serializable.

Because the precedence graph contains a cycle
(1 → 3 → 2 → 1),
we can conclude that the schedule must **not** be
conflict-serializable.

### Problem R3a.21.

Draw a precedence graph for determining whether the schedule below is conflict-serializable.

w_{1}(A),r_{2}(A),w_{2}(B),r_{3}(A),w_{4}(B),w_{4}(C),r_{3}(C),r_{1}(C)

With specific reference to your graph, explain whether you can conclude that the schedule is conflict-serializable.

This graph indicates that the schedule is not conflict-serializable, due to the cycle 1 → 2 → 4 → 1. (There are no other cycles in this graph.)