Logging

Sections:
1. Buffer management
2. Undo logging
3. Redo logging

1. Buffer management

Permanent storage (disk) is much, much slower than temporary storage (memory). Thus, to get reasonable performance, a DBMS tries to do as much within memory as possible. However, any work it does in memory will be lost as soon as power is lost, so to ensure the durability ACID property, it must also be careful to ensure that all its committed work gets onto disk.

A DBMS includes a buffer manager module that deals with managing the disk-memory interface. The buffer manager breaks most of the DBMS memory into buffers, and it accepts a number of commands:

For example, if we have a transaction to transfer $100 from account A to account B, it might translate to the following:

READ(A, v)
v ← v − 100
WRITE(A, v)
READ(B, v)
v ← v + 100
WRITE(B, v)

However, if neither A nor B were in memory before the transaction, the buffer manager would need to insert LOAD(A) and LOAD(B) operations to make this viable — most likely, these would need to happen before the corresponding READ operations. For the transaction to endure after it is committed, the buffer manager would also need to insert FLUSH(A) and FLUSH(B) operations upon the transaction's completion; however, it might choose to do the FLUSH(A) operation earlier if it were running short on buffers.

The buffer manager basically has a fixed number of buffers that it manages. Eventually they will all become occupied, and it will need to choose an already-occupied buffer to replace with the next LOAD request. In that case, it will use an algorithm like LRU (least recently used) to choose which buffer should be thrown out.

2. Undo logging

There are two types of logs, undo logs and redo logs. We'll look at undo logs first.

In its basic form, an undo log contains a list of records, each falling into one of these forms:

START T
Transaction T has begun.
WRITE T, X, old
Transaction T has written a value to element X. Value old is the value that was replaced.
COMMIT T
Transaction T has committed.
ABORT T
Transaction T has aborted.

For efficiency, we will avoid storing the log on disk as much as viable, so we instead write into a memory buffer. Still, if the log is to have any use after the system crashes, we will have to store it to disk: This may happen because we want to ensure that whatever is in the log will be available should the power go out, or it may happen simply because the memory buffer has become full.

To provide durability, the DBMS must follow two basic rules:

  1. Before any database element is flushed to disk, any WRITE record in the log relating to that element must be flushed to disk.
  2. When a transaction requests to commit, all database elements that it has changed must be written to disk, and only then is a COMMIT record inserted into the log, and finally the log buffer is flushed to disk.

The following table provides an example.

operation v memory A memory B disk A disk B log record
transaction starts 500800 START T
LOAD(A) 500
READ(A, v) 500
v ← v − 100 400
WRITE(A, v) 400 WRITE T, A, 500
LOAD(B) 800
READ(B, v) 800
v ← v + 100 900
WRITE(B, v) 900 WRITE T, B, 800
commit request:
   FLUSH(LOG)
   FLUSH(A) 400
   FLUSH(B) 900
   log commit COMMIT T
   FLUSH(LOG)

The computer may crash at any point in this process. When the DBMS restarts, it will execute a recovery process to ensure that the database is restored to its state prior to the T starting. Following is pseudocode summarizing how this can work.

F ← empty set (will be finished transactions)
U ← empty set (will be unfinished transactions)
for each record in log, going in reverse order:
   if record is COMMIT T or ABORT T, then:
      Add T to F.
   if record is WRITE TXv and T ∉ F, then:
      Change value of X to v.
      Add T to U.
for each transaction T in U:
   Add ABORT T into log.
Flush log buffer.

One problem with this is that the log grows progressively larger, which causes the recovery time to get increasingly long — and since the DBMS cannot accept operations during the recovery, this means that a system failure becomes increasingly catastrophic.

We can get around this by periodically executing a checkpoint that essentially marks that all current transactions have completed, so the recovery process need not consider the portion of the log preceding that point. A simple implementation of checkpoints is the following:

  1. Temporarily stop accepting new transactions, while continuing to execute current transactions.
  2. Once all current transactions are complete, add CKPT record into log and flush the log.
  3. Resume accepting new transactions.

In the recovery process, then, we can stop going through the log as soon as we reach a CKPT record, since it is guaranteed that all transactions prior to that in the log have completed.

A downside to this simple checkpoint is that it requires refusing new transactions for a period of time. This leads to a more complex technique called nonquiescent checkpoints.

  1. Add BEGIN CKPT(T1 … Tn) to log, where T1 … Tn is a list of all active transactions. Flush this to disk.
  2. Continue executing transactions, including accepting new transactions.
  3. Once T1 … Tn have all completed, add END CKPT to log and flush to disk.

During recovery, we still start from the end of the log and work backwards. Eventually we will reach a checkpoint record (BEGIN CKPT or END CKPT). What we do depends on which we reach first.

3. Redo logging

A redo log takes the same basic form of an undo log. We have the same START, WRITE, COMMIT, and ABORT log records, but the WRITE record has an important difference: The value included in the record is the newly written value for the element (rather than the element's value previous to the write, as with an undo log).

The redo log also has a different set of rules:

  1. A database element may not be flushed to disk until all transactions that have modified the element have had their COMMIT records flushed to disk.

With a redo log, then, a transaction is allowed to modify the memory buffers as it likes, but none of these modified buffers are allowed to flush to disk until the transaction commits (and its COMMIT record has been written to disk).

operation v memory A memory B disk A disk B log record
transaction starts 500800 START T
LOAD(A) 500
READ(A, v) 500
v ← v − 100 400
WRITE(A, v) 400 WRITE T, A, 400
LOAD(B) 800
READ(B, v) 800
v ← v + 100 900
WRITE(B, v) 900 WRITE T, B, 900
commit request:
   log commit COMMIT T
   FLUSH(LOG)
FLUSH(A) 400
FLUSH(B) 900

For recovery, we use the following process.

  1. Scan through the log, determining the set C of transactions that were committed.
  2. Scan through the log again; for each WRITE record from a transaction in C, we re-write the logged value into the database element.
  3. For each transaction T in the log that was never committed or aborted, insert ABORT T into the log. Then flush the log.

Checkpointing must also be done differently.

  1. Insert into the log BEGIN CKPT(T1 … Tn) where T1 … Tn are the currently active transactions.
  2. Ensure that each buffer that was modified by a transaction that had committed before the BEGIN CKPT record is flushed to disk.
  3. Insert END CKPT into the log and flush the log.

In recovering, we look for the last END CKPT record. (If we find a BEGIN CKPT record that appears following the last END CKPT, we simply ignore it — we really need a corresponding pair of BEGIN CKPT and END CKPT records.) Once we find the last END CKPT record, we find the preceding BEGIN CKPT(T1 … Tn) record. We only need to process records in the log pertaining to a transaction that started since the BEGIN CKPT record or are among T1 … Tn. However, we must go back to the first START Ti and process the log from there.