Persistent data structures

by Carl Burch, Hendrix College, August 2012

Creative Commons License
Persistent data structures by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
Based on a work at www.cburch.com/books/persist/.

Contents

1. Stacks
2. Binary search trees
3. Queues
4. Random-access structures

A persistent data structure is one in which no operations result in permanent changes to the underlying structure. It's called “persistent” because as the structure goes through successive operations, all versions of the structure persist over time. The notion of a structure with no permanent changes may sound like an oxymoron: The simplest example, a stack, will illustrate.

(Many of the structures described in this chapter are selected from Chris Osaki's book Purely Functional Data Structures (1998). The book is an adaptation of his Ph.D. dissertation.)

1. Stacks

A classic stack has four operations, and each happens to map precisely to Haskell operations on lists. Following are the Java calls and the Haskell equivalents.

Java methodHaskell equivalent
stack.peekTop();head stack
stack.isEmpty();null stack
stack.push(value);value : stack
stack.pop();tail stack

Of course, none of the Haskell functions actually change the stack, but if we have a Java program using a stack, we would represent the operations as in the expressions.

Consider, for example, the problem of writing a method to check whether a string contains parentheses and braces that match: That is, we want the string “{ab(c)}” to pass, but not “{ab(c})” or “{ab(c)).” Such a program is easy to write in Java using a stack that tracks which unmatched opening brackets we've found so far.

public boolean bracketsMatch(String str) {
    for (int i = 0i < str.length(); i++) {
        char c = str.charAt(i);
        if (c == '(' || c == '}') {
            stack.push("" + c);
        } else if (c == ')') {
            if (stack.isEmpty() || !stack.pop().equals("(")) return false;
        } else if (c == '}') {
            if (stack.isEmpty() || !stack.pop().equals("{")) return false;
        }
    }
    return stack.isEmpty();
}

The Haskell translation is as follows. You can see that the places in the code that correspond to stack operations use the Haskell functions illustrated earlier.

bracketsMatch str = sub str []
  where sub []         stack        = null stack
        sub ('(':reststack        = sub rest ('(':stack)
        sub ('{':reststack        = sub rest ('{':stack)
        sub (')':rest) ('(':bottom= sub rest bottom
        sub (')':rest_            = False
        sub ('}':rest) ('{':bottom= sub rest bottom
        sub ('}':rest_            = False
        sub ( _ :reststack        = sub rest stack

The value of stack technically never changes — that is, we don't change any pointers in the stack, for after all that's impossible to accomplish within Haskell. The stack parameter is different, though, for different levels of recursion, each value referring to a different version of the stack.

Since functional programming doesn't allow changes to pointers, persistent data structures are essential to writing substantive programs in functional languages. But the fact that functional languages need the concept shouldn't lead us to believe that their usefulness is restricted to them. In imperative languages, it's sometimes useful for a program to have a structure in which several different versions of the structure can be stored simultaneously. Persistent structures are ideal for this.

It's easy to translate the persistent-stack concept into Java.

public class Stack {
    private Object value;
    private Stack rest;

    public Stack(Object valueStack rest) { this.value = valuethis.rest = rest; }
    public Stack()                         { this.rest = this; }

    public Stack push(Object value)        { return new Stack(valuethis); }
    public Stack pop()                     { return rest; }
    public boolean isEmpty()               { return rest == this; }
    public Object peekTop()                { return value; }
}

2. Binary search trees

A tree is a somewhat more complex example of a data structure that adapts conveniently into a persistent structure. First we define the data type and a couple of query functions.

data Tree a = Empty | Node a (Tree a) (Tree a)

isEmpty Empty = True
isEmpty _     = False

contains value Empty = False
contains value (Node nval left right| value == nval = True
                                      | value  < nval = contains value left
                                      | value  > nval = contains value right

The traditional implementation of insertion, where we search down to where the inserted value belongs along the tree's fringe and change the pointer there to point to a node, won't work in a persistent structure, because of that “change the pointer” step. Insertion, then, is written differently: Instead of changing the pointer, we'll create a leaf and re-create the nodes leading down to the leaf, but maintain the subtrees unvisited during the insertion process. In the figure below, note how after the insertion, both the original tree and the inserted tree are available (though with different roots).

The code to implement tree insertion is as follows.

insert value Empty = Node value Empty Empty
insert value (Node nval left right)
    | value < nval = Node nval (insert value leftright
    | otherwise    = Node nval left (insert value right)

Removing a node is slightly more complicated, as it is with non-persistent binary search trees. Following is an implementation.

remove value Empty = Empty
remove value (Node nval left right)
    | value < nval  = Node nval (remove value leftright
    | value > nval  = Node nval left (remove value right)
    | isEmpty right = left
    | isEmpty left  = right
    | otherwise     = Node sval left newright
        where (svalnewright= removeSmallest right
              removeSmallest (Node mval Empty mright= (mvalmright)
              removeSmallest (Node mval mleft mright=
                  (val'Node mval left' mright)
                where (val'left'= removeSmallest mleft

If the tree maintains an O(log n) depth, then insertion and deletion in this persistent tree takes O(log n) time, which is equivalent to the time taken by a non-persistent tree (although the constant factor will admittedly be larger). (If the assumption of O(log n) depth disturbs you, it should. But adapting the techniques to a balanced binary search tree (e.g., red-black trees) is not difficult.) However, there is a loss in terms of memory consumption: Whereas a standard tree requires O(1) memory for insertion and deletion, the addition of persistence comes at the price of O(log n) memory per operation.

3. Queues

Representing a queue is a more significant challenge than a stack or a tree. With a queue, it's important to maintain linear structure but still maintain the ability to modify both ends of the queue. We cannot do this through a straightforward translation of the non-persistent structure.

One approach is to “buffer” items added onto the end of the queue until they are needed. In this scheme, the queue consists of two lists: a list of items found at the front of the queue, and a list of the items found after them in the queue, listed in reverse order. In such a scheme, the front of the queue is the first thing in the first list (or if that list is empty, the last thing in the second list); and the rear of the queue is the first thing in the second list (or if that list is empty, the last thing in the first list).

type Queue a = ([a], [a])

emptyQueue = ([], [])

isEmpty ([], []) = True
isEmpty (_,  _)  = False

peekFront ((val:_), _= val
peekFront ([], second= last second

Finding the front of the queue is problematic when the first list is empty: Note that we call last, which will end up stepping through the entire list to find whatever is at the end. So that this doesn't happen, in what follows we'll be careful to maintain the invariant that the first of the two lists is never empty unless the entire queue is empty. This means that the second case of peekFront will never occur, and so the function will always take O(1) time.

Enqueuing a value in this structure is simple: Since the value at the front of the second list is the queue's rear, we want simply to add the enqueued value there. The first case is to guarantee the invariant (that the first list is empty only when the entire queue is empty).

enqueue value ([],    [])     = ([value], [])
enqueue value (firstsecond= (first, (value:second))

In dequeuing, we want to simply remove the first item from the first list. If this empties that list, then we need to refill it from the second to maintain the invariant.

dequeue ([],      _)      = error "queue is empty"
dequeue (_:[],    second= (reverse second, [])
dequeue (_:firstsecond= (firstsecond)

Note, in the first case, that the invariant implies that an empty first list means the entire queue is empty.

Given the invariant, peekFront, isEmpty, and enqueue all take O(1) time. The dequeue function takes O(n) time, in the case that the first list contains only one item. One would hope that we could guarantee that this would happen only rarely, rarely enough that dequeue takes O(1) time on average. Such a guarantee would be possible were it true that any given version of the queue would have only one operation performed on it. But such is not the case. If we had a queue in which the first list held only one element, then we could call dequeue twice on that same queue several times, and each time the second dequeue would take O(n) time. (It may seem that the first dequeue in such a situation would take O(n) time, since that's the one for which reverse is called; but Haskell is lazy, and so it would not complete the call to reverse until the result is needed, during the second call to dequeue.)

We could guarantee that dequeue takes O(1) time on average if we could maintain the invariant that the second list is never longer than the first list. Developing structures that maintain this invariant are left as an exercise to the reader.

4. Random-access structures

A random-access structure is the abstract data type corresponding to an array: It is an indexed structure supporting two operations, one to retrieve a value based on an integer index, and the other to mutate the structure based on an integer index and a value.

get :: RandomAccess -> Int -> a
set :: RandomAccess -> Int -> a -> RandomAccess

The frequent usage of arrays in imperative languages would lead one to believe that random-access structures are extraordinarily important. It's easy to exaggerate this importance, though: Often, imperative programs use arrays that are used only for iterating through elements, and this can be done as easily and as efficiently iterating through lists. (We're talking about efficiency here in big-O terms — there would likely be a constant-factor slowdown.) Still, there are those other times when we genuinely want a random-access structure, and lists turn out not to be an adequate substitute.

From imperative programming, we are accustomed to using arrays to implement the random-access structure, which gives O(1) time both for reading and writing. The array concept, however, does not carry over cleanly as a persistent structure, as the O(1) set function relies on being able to change existing memory. The straightforward way of adding persistence is to have the set function create a clone of the array (with the one requested position altered), but the cloning would take O(n) time.

One way to reduce the set time is to add a “buffer” accumulating the changes in an auxiliary list structure. We'll clone the array only once this list reaches a length of O(√n). As a result, we'll pay the O(n) clone cost only once every O(√n) times, and other calls to set will take O(1) time, for an average cost of O(√n). Unfortunately, the get function would also need to be modified to search through the list before looking into the array, and searching the list will take O(√n) time. This buffer approach ends up trading off set time for get time.

However, trees provide a very different approach that works even better. In this approach, we'll maintain a nearly-complete binary tree, and individual array elements can be maintained at the leaves. We define the array type as follows.

data ArrayNode a = Leaf a | Node (ArrayNode a) (ArrayNode a)
type Array a = (IntArrayNode a)

An array here is a tuple holding the array length followed by the root of a tree. Within the tree, each leaf holds a value, and each internal node has two subtrees.

When we create an array, we'll pass in the array size and a function that maps each array index to an array value. In this way, a call to “makeArray 4 (\i -> 3*i)” generates the array represented by the list [0,3,6,9].

makeArray size f = (sizesub 0 size)
  where sub i 1 = Leaf (f i)
        sub i n = Node (sub i lsz) (sub (i + lsz) (n - lsz))
          where lsz = n `div2

The tree generated by the sub subfunction here will be complete except for the bottom level. For example, a tree with 13 nodes, as you can see, will have 6 nodes in the left subtree and 7 in the right subtree.

The get function simply traverses down this tree. As it traverses down, it will need to track the index desired within the current subtree, as well as the size of the current subtree.

get (sizerootindex = sub index size root
  where sub 0 _ (Leaf x= x
        sub _ _ (Leaf x= error "index out of bounds"
        sub i n (Node left right=
            if i < lsz then sub i lsz left
                       else sub (i - lsz) (n - lszright
          where lsz = n `div2

Finally, the set function will do the same, except that it will have to sew the tree back up, just as we saw with binary search trees previously.

set (sizerootindex value = sub index size root
  where sub 0 _ (Leaf x= Leaf value
        sub _ _ (Leaf _= error "index out of bounds"
        sub i n (Node left right=
            if i < lsz then Node (sub i lsz leftright
                       else Node left (sub (i - lsz) (n - lszright)
          where lsz = n `div2

Since the tree is complete, the tree will have ⌈lg n⌉ levels, and so both reading and writing in this tree-based implementation take O(log n) time.

We can summarize the performance of the three techniques we've examined here in a table.

structurereadingwriting
arrayO(1)O(n)
array with buffered writesO(√n)O(√n)
treeO(log n)O(log n)

There's no real reason to use the array with buffered writes, as the tree-based implementation beats it on both operations. The regular array implementation is still useful, though, in cases involving many more reads than writes.

This is not the end of the story with persistent random-access structures, however. How close one can come to O(1) time is an active research question. One technique, too complex to describe here, claims O(1) time for writing and but still O(log n) time for reading [O'Neill and Burton, “A new method for functional arrays,” Journal of Functional Programming, 1997, 487–513]. An even more complex technique improves the read time further to O(log log n) while maintaining O(1) access time [Dietz, “Fully persistent arrays (extended abstract),” Lecture Notes in Computer Science vol. 382, Springer-Verlag, 1989, 67–74].