Serializability

Sections:
1. ACID properties
2. Schedules
3. Conflict serializability
3.1. Definition
3.2. Testing for conflict-serializability

1. ACID properties

We think of the database as executing transactions: Sequences of operations that are packaged together, that must be executed as a whole. We want the DBMS to provide four properties, called the ACID properties:

A. Atomicity: Either all operations of the transaction complete, or none of them do. We're particularly want to avoid a transaction transaction that changes some of the entries that it wants but the DBMS fails before changing all of the entries.
C. Consistency: The information in the database must be kept in a consistent state. This is largely the responsibility of the database administrator writing the queries, but the DBMS is also sometimes responsible: For example, if we have one table B that includes keys from another table A (such as bank account transactions that have account IDs referencing rows in a table of bank accounts), then the DBMS shouldn't allow deleting a row from A without deleting those rows from B referring to the account being deleted.
I. Isolation: The transactions shouldn't have some complex interplay between each other: Conceptually, transactions should have the same effect as if they were done one a a time (though in fact the DBMS will need to do them concurrently).
D. Durability: If the DBMS reports that it successfully completed a transaction, then all effects of that transaction should be permanent — even if the DBMS crashes (maybe due to a power outage) immediately after the transaction completes.

For the moment, we'll concentrate on isolation.

2. Schedules

Consider the following two transactions for a bank: The first is meant to deduct $100 from an account, while the second adds 0.5% interest to every account in the bank.

Transaction 1 Transaction 2
UPDATE accounts SET balance = balance - 100 WHERE acct_id = 31414 UPDATE accounts SET balance = balance * 1.005

Transaction 1		Transaction 2
`UPDATE accounts SET balance = balance - 100 WHERE acct_id = 31414`		`UPDATE accounts SET balance = balance * 1.005`

We'll want to talk about these transactions in the abstract. We'll summarize their interactions with the DBMS in the following form:

Transaction 1: r₁(A), w₁(A)
Transaction 2: r₂(A), w₂(A), r₂(B), w₂(B)

The notation r_i(X) indicates that transaction i reads the value for database element X, and r_i(X) indicates that transaction i writes a new value for database element X. [We're being intentionally vague about what is meant by “database element.” You can think of it as being one row of a table, but a DBMS might think in terms of each individual cell in the table being an element, or it might think of the entire table as being an element. It might group several rows together into just one element.]

A schedule is some interleaving of the operations from the two transactions (without violating the order of operations within any individual transaction). So we can meaningfully ask: What is the outcome of the following order of operations:

Schedule S: r₁(A), r₂(A), w₁(A), w₂(A), r₂(B), w₂(B)

If account A starts with $200, and account B starts with $100, we can trace what would happen with Schedule S.

A B

(initial:) 200.00 100.00

r₁(A):

r₂(A):

w₁(A): 100.00

w₂(A): 201.00

r₂(B):

w₂(B): 100.50

	`A`	`B`
(initial:)	200.00	100.00
`r`₁(`A`):
`r`₂(`A`):
`w`₁(`A`):	100.00
`w`₂(`A`):	201.00
`r`₂(`B`):
`w`₂(`B`):		100.50

Schedule S is very bad! (At least, it's bad if you're the bank!) We withdrew $100 from account A, but somehow the database has recorded that our account now holds $201.

What's a good schedule look like? Well, our ideal is a serial schedule, in which all operations by a transaction are grouped together. For our two transactions, there are only two ways to arrange their operations to get a serial schedule:

Serial schedule 1: r₁(A), w₁(A), r₂(A), w₂(A), r₂(B), w₂(B)
Serial schedule 2: r₂(A), w₂(A), r₂(B), w₂(B), r₁(A), w₁(A)

In practice, a serial schedule isn't realistic, because it means we must wait for one transaction to complete before starting another. We would really prefer to interleave them — but we need to interleave the transactions so that they work the same as some serial schedule.

We call a schedule serializable if it has the same effect as some serial schedule regardless of the specific information in the database. That last clause “…regardless of the specific information…” comes from peculiarities that may arise based on precisely what the database contains. As an example, consider Schedule T, which has swapped the third and fourth operations from S:

Schedule T: r₁(A), r₂(A), w₂(A), w₁(A), r₂(B), w₂(B)

We can try tracing this on two different examples.

A is $100 initially A is $200 initially

A B

(initial:) 100.00 100.00

r₁(A):

r₂(A):

w₂(A): 100.50

w₁(A): 0.00

r₂(B):

w₂(B): 100.50

A B

(initial:) 200.00 100.00

r₁(A):

r₂(A):

w₂(A): 201.00

w₁(A): 100.00

r₂(B):

w₂(B): 100.50

Looking just at the first example, we see that the outcome is the same as the serial schedule where the withdrawal happens first and then the interest is credited. But that's just a peculiarity of the data, as revealed by the second example, where the final value of A can't be the consequence of either of the possible serial schedules.

So neither S nor T are serializable. What's a non-serial example of a serializable schedule? We could credit interest to A first, then withdraw the money, then credit interest to B:

Schedule U: r₂(A), w₂(A), r₁(A), w₁(A), r₂(B), w₂(B)

3. Conflict serializability

Our definition of serializability is a bit difficult to handle: How can we test for the same effect regardless of data? To come up with an answer, we'll create a stricter definition of serializability, called conflict-serializability.

3.1. Definition

First, though, we'll define conflict-equivalence: Two schedules are conflict-equivalent if one can be reached from the other through a series of swaps of adjacent operations, where no swap falls into one of the following patterns:

the operations are by the same transaction
the operations use the same database element, and at least one is a write

For example, Schedule U is conflict-equivalent to Serial Schedule 2, as shown by the following series of swaps.

Schedule U: r₂(A), w₂(A), r₁(A), w₁(A), r₂(B), w₂(B)
swap w₁(A) and r₂(B): r₂(A), w₂(A), r₁(A), r₂(B), w₁(A), w₂(B)
swap w₁(A) and w₂(B): r₂(A), w₂(A), r₁(A), r₂(B), w₂(B), w₁(A)
swap r₁(A) and r₂(B): r₂(A), w₂(A), r₂(B), r₁(A), w₂(B), w₁(A)
swap r₁(A) and w₂(B): r₂(A), w₂(A), r₂(B), w₂(B), r₁(A), w₁(A)

A schedule is conflict-serializable if it is conflict-equivalent to some serial schedule. We've just shown that Schedule U is conflict-serializable.

You may wonder: Are all serializable schedules conflict-serializable? As you might expect, though, the answer is no. Consider the following schedule for a set of three transactions.

w₁(A), w₂(A), w₂(B), w₁(B), w₃(B)

We can perform no swaps to this: The first two operations are both on A and at least one is a write; the second and third operations are by the same transaction; the third and fourth are both on B at at least one is a write; and so are the fourth and fifth. So this schedule is not conflict-equivalent to anything else — and certainly not any serial schedules.

However, since nobody ever reads the values written by the w₁(A), w₂(B), and w₁(B) operations, the schedule has the same outcome as the serial schedule:

w₁(A), w₁(B), w₂(A), w₂(B), w₃(B)

3.2. Testing for conflict-serializability

Using the definition of conflict-serializability to show that a schedule is conflict-serializable is quite cumbersome. There's a much more efficient algorithm:

Build a directed graph, with a vertex for each transaction.
Go through each operation of the schedule.
- If the operation is of the form w_i(X), find each subsequent operation in the schedule also operating on the same data element X by a different transaction: that is, anything of the form r_j(X) or w_j(X). For each such subsequent operation, add a directed edge in the graph from T_i to T_j.
- If the operation is of the form r_i(X), find each subsequent write to the same data element X by a different transaction: that is, anything of the form w_j(X). For each such subsequent write, add a directed edge in the graph from T_i to T_j.
The schedule is conflict-serializable if and only if the resulting directed graph is acyclic. Moreover, we can perform a topological sort on the graph to discover the serial schedule to which the schedule is conflict-equivalent.

As an example, consider the following schedule:

w₁(A), r₂(A), w₁(B), w₃(C), r₂(C), r₄(B), w₂(D), w₄(E), r₅(D), w₅(E)

We start with an empty graph with five vertices labeled T₁, T₂, T₃, T₄, T₅.

We go through each operation in the schedule:

`w`₁(`A`):	`A` is subsequently read by `T`₂, so add edge `T`₁ → `T`₂
`r`₂(`A`):	no subsequent writes to `A`, so no new edges
`w`₁(`B`):	`B` is subsequently read by `T`₄, so add edge `T`₁ → `T`₄
`w`₃(`C`):	`C` is subsequently read by `T`₂, so add edge `T`₃ → `T`₂
`r`₂(`C`):	no subsequent writes to `C`, so no new edges
`r`₄(`B`):	no subsequent writes to `B`, so no new edges
`w`₂(`D`):	`C` is subsequently read by `T`₂, so add edge `T`₃ → `T`₂
`w`₄(`E`):	`E` is subsequently written by `T`₅, so add edge `T`₄ → `T`₅
`r`₅(`D`):	no subsequent writes to `D`, so no new edges
`w`₅(`E`):	no subsequent operations on `E`, so no new edges

We end up with the following directed graph.

This graph has no cycles, so the original schedule must be serializable. Moreover, since one way to topologically sort the graph is T₃–T₁–T₄–T₂–T₅, one serial schedule that is conflict-equivalent is

w₃(C), w₁(A), w₁(B), r₄(B), w₄(E), r₂(A), r₂(C), w₂(D), r₅(D), w₅(E)

Transaction 1:	`r`₁(`A`), `w`₁(`A`)
Transaction 2:	`r`₂(`A`), `w`₂(`A`), `r`₂(`B`), `w`₂(`B`)

Serial schedule 1:	`r`₁(`A`), `w`₁(`A`), `r`₂(`A`), `w`₂(`A`), `r`₂(`B`), `w`₂(`B`)
Serial schedule 2:	`r`₂(`A`), `w`₂(`A`), `r`₂(`B`), `w`₂(`B`), `r`₁(`A`), `w`₁(`A`)

Schedule `U`:	`r`₂(`A`), `w`₂(`A`), `r`₁(`A`), `w`₁(`A`), `r`₂(`B`), `w`₂(`B`)
swap `w`₁(`A`) and `r`₂(`B`):	`r`₂(`A`), `w`₂(`A`), `r`₁(`A`), `r`₂(`B`), `w`₁(`A`), `w`₂(`B`)
swap `w`₁(`A`) and `w`₂(`B`):	`r`₂(`A`), `w`₂(`A`), `r`₁(`A`), `r`₂(`B`), `w`₂(`B`), `w`₁(`A`)
swap `r`₁(`A`) and `r`₂(`B`):	`r`₂(`A`), `w`₂(`A`), `r`₂(`B`), `r`₁(`A`), `w`₂(`B`), `w`₁(`A`)
swap `r`₁(`A`) and `w`₂(`B`):	`r`₂(`A`), `w`₂(`A`), `r`₂(`B`), `w`₂(`B`), `r`₁(`A`), `w`₁(`A`)