Lambda calculus

by Carl Burch, Hendrix College, September 2012

Lambda calculus by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
Based on a work at www.cburch.com/books/lambda/.

1. Fundamentals of lambda calculus
1.1. Notation and definitions
1.2. Theoretical results
2. Evaluation order
2.1. Implementing lazy parameters
2.2. Strictness analysis
2.3. Lazy lists
2.4. Comparison
3. Defining mathematics
3.1. Recursion
3.2. Numerals

The lambda calculus, developed in the 1930's by Alonzo Church and predating the first computers in the 1940's, is arguably the oldest programming language of all. As a logician, Church was looking for a way to talk about computation because he wanted to investigate the limits of what mathematical theorems could be “computed.”

1. Fundamentals of lambda calculus

1.1. Notation and definitions

The term lambda calculus is a daunting term to describe what is in the end a very simple concept: It refers to a simple way of describing a function without having to give the function a name. Traditionally, a mathematician defines a function through giving it a name, saying for example that f(x) is the function where f(x) = x + 1. In the lambda calculus, though, we write “λx.x + 1” to describe a function that takes a parameter x and returns the value of x + 1. (The Greek letter lambda (λ) is not a function name: It serves as a mathematical symbol, serving something like the summation symbol ∑ or the “for all” symbol ∀.) The lambda calculus allows us to talk about an unnamed function, though of course we could give a function a name by saying that we define f as the function λx.x².

(It may strike you as weird to use the term calculus for this — isn't calculus the study of limits, differentials, and integrals, the course one normally takes at the beginning of college or the end of high school? Historically, though, the term calculus refers to any system for calculation, and the system of calculating with differentials is traditionally called the differential calculus, and likewise there is the integral calculus.)

Note that lambda/period operation has the lowest order of precedence: When we write λx.x + 1, we mean λx.(x + 1), not (λx.x) + 1 (which is good, since adding one to a function wouldn't make sense).

Each variable appearing in an expression is called either free or bound. A bound variable is a symbol serving as a parameter, while a free variable is another symbol that has meaning outside the function. For example, in the lambda expression λx.x + y, the variable x is bound and the variable y is free.

Conventionally in the lambda calculus, when we want to talk about passing a value into a function, we simply write the function next to the argument, as in “f 3” — as opposed to the parentheses used in traditional mathematics when writing “f(3)”. With f defined as λx.x + 1, we could simplify the expression f 3.

f 3 ⇒ (λx.x + 1) 3 ⇒ 3 + 1 ⇒ 4

The rule we use to simplify an application, as in transposing (λx.x + 1) 3 to 3 + 1, is called a β-reduction (beta-reduction).

Function application, as in f 3, has the highest order of precedence, so that f 3 × 2 is the same as (f 3) × 2 (not f (3 × 2)). Function application associates left to right: The expression f g 3 is the same as (f g) 3 rather than f (g 3). Or to examine another example:

(λf.λg.g f) (λz.z) i,

We would begin simplifying this by passing λz.z into λf.λg.g f to arrive at λg.g (λz.z), and then we'd pass i into this to conclude that our original expression is the same as i (λz.z). We cannot start by applying i to λz.z) to arrive at i and then apply this result to λf.λg.g f to get λg.g i; as you can see, going in this direction ends up with a different result.

The lambda calculus defines three “reduction” rules for mechanically translating lambda expressions without changing their value.

β (beta) reduction, which we've already seen, is the most important of the three reduction rules. It can be summarized as follows.

(λx.A) B ⇒ A_x → B

The notation A_x → B represents the expression resulting from substituting all free occurrences of x in A with B instead. Technically, beta reduction can only be used when no free variables occurring in B are bound within A (so that (λx.λy.x y) (λz.y + z) does not translate to λy.(λz.y + z) y).
α (alpha) reduction is simply a change of parameter name; after all the parameter name is arbitrary in terms of how the function works.

λx.A ⇒ λy.A_x→ y

It can be applied only when y is not a free variable of A.
η (eta) reduction is the rule for removing the lambda in the special case where the parameter is used only once, and then for the final function application.
λx.A x ⇒ A
It applies only when x is not a free variable in A. For example, given λx.f x, we can simplify it to f since after all two functions are equivalent if they are the same regardless of the argument passed in, as is true in this case (for any argument z, (λx.f x) z = f z).

If we want to talk about a function g taking multiple arguments, then we would write “g 3 4.” This allows us to define a multi-argument function as a function that returns a function: For example, we might define g as λx.λy.(x² + y²)^½. This says that g takes a parameter x and returns a function that takes a parameter y and returns (x² + y²)^½. We see this when we try to simplify:

`g` 3 4	⇒	(λ`x`.λ`y`.(`x`² + `y`²)^½) 3 4
	⇒	(λ`y`.(3² + `y`²)^½) 4
	⇒	(3² + 4²)^½ = 5

Note that, if we have g defined thus, then g 1 is the function that takes a parameter y and returns (1 + y²)^½. We could easily define h to be g 1, which would refer to the function λy.(1 + y²)^½.

The lambda calculus is particularly useful when we want to talk about functions whose parameters are functions or which return functions. For example, suppose we define s according to the following:

s = λh.λz.h (h z).

Here, s takes a function f as a parameter and returns the function that returns the result of applying f twice to its argument. Thus, if f is λx.x + 1, then we can try to determine what function s f represents:

`s` `f`	⇒	(λ`h`.λ`z`.`h` (`h` `z`)) (λ`x`.`x` + 1)
	⇒	λ`z`.(λ`x`.`x` + 1) ((λ`x`.`x` + 1) `z`)
	⇒	λ`z`.(λ`x`.`x` + 1) (`z` + 1)
	⇒	λ`z`.(`z` + 1) + 1
	⇒	λ`z`.`z` + 2

Thus, s f is a function that returns two more than its argument.

1.2. Theoretical results

These definitions lead to a few important facts that can be proven concerning the lambda calculus, which we'll glance at without proof. The more fundamental is the Church-Rosser theorem.

(Church-Rosser Theorem) Suppose an expression A can be reduced by a sequence of reductions to an expression B, and it can be reduced by another sequence of reductions to another expression C. Then there exists some expression D that can be reached from a sequence of reductions from B and also from a sequence of reductions from C.

Essentially, this theorem says that no reduction will ever be a wrong turn. As long as we can find a reduction to perform, then it will still be possible to reach whatever destination somebody else can find.

We call an expression irreducible if there are no reductions that can be performed on the expressions, such as 1 or λx.x or λf.f (λy.y), but not (λy.y) f, which can be reduced to f. (We aren't counting α-reductions as reductions here.) Not all expressions can be reduced to irreducible form. One of the simplest is

(λx.x x) (λx.x x).

An application of beta-reduction to this expression simply returns us to the expression we already have. Even worse is the expression

(λx.x x x) (λx.x x x),

which will get longer each time we try to reduce it. An irreducible expression is sometimes said to be in normal form.

The Church-Rosser Theorem implies that there cannot be two different irreducible forms of an expression. After all, if A could be reduced to two distinct irreducible forms, B and C, then the theorem says we would be able to reduce both B and C, and so they are actually not irreducible.

A natural question to ask is: Is there a technique for always reaching irreducible form when it exists? One important evaluation order is eager evaluation (or sometimes applicative order of evaluation or strict evaluation), in which an argument is always reduced before it is applied to a function. This is the ordering used in most programming languages, where we evaluate the value of an argument before passing it into a function.

(λ`x`.`x` + 1) ((λ`y`.2 × `y`) 3)	⇒	(λ`x`.`x` + 1) (2 × 3) ⇒ (λ`x`.`x` + 1) 6
	⇒	6 + 1 ⇒ 7

Unfortunately, eager evaluation does not always reach irreducible form when it exists. Consider the expression

(λx.1) ((λx.x x) (λx.x x)).

Using eager evaluation, we would first try to reduce the argument, but that simply reduces to itself. (Before trying to reduce (λx.x x) (λx.x x), though, we'd first have to examine the argument, λx.x x. In this case, though, there are no reductions to perform.) Yet this expression can reduce to irreducible form, for if we apply the argument to λx.1 immediately, we would reach 1 without needing to reduce the argument at any time. Eager evaluation, though, would never get us there.

Alternatively, lazy evaluation order (sometimes called the normal order of evaluation) has us always pass an argument into a function unsimplified, only reducing the argument when needed.

(λ`x`.`x` + 1) ((λ`y`.2 × `y`) 3)	⇒	((λ`y`.2 × `y`) 3) + 1
	⇒	(2 × 3) + 1 ⇒ 6 + 1 ⇒ 7

It turns out, mathematicians have proven that lazy evaluation does guarantee that we reach irreducible form when possible.

If an expression can be reduced to an irreducible expression, then lazy evaluation order will reach it.

Due to this theorem, this evaluation order is sometimes called normal order (since an irreducible expression is said to be in normal form).

(Technically, we'll subtly distinguish the terms lazy evaluation and normal evaluation, as described in Section 2.1.)

2. Evaluation order

One of the central issues in functional language design is how to handle parameters. That is, should arguments be reduced to simplest terms before being passed into functions (called eager, strict, or applicative evaluation), or should they be passed into functions and be reduced only once that function needs the value (called normal evaluation)?

This was never a question that was seriously asked until the 1980s. Although normal evaluation has been understood since the introduction of the lambda calculus, languages simply used eager evaluation. For example, Scheme uses eager evaluation, being as it is a creature of the 1970s. (Not that I have anything against creatures of the 1970s. I happen to be one myself.) There are two reasons for this: First, implementing eager evaluation well is straightforward to implement, whereas lazy evaluation is more difficult technically. And second, early functional languages included some imperative constructs (particularly for I/O), which meant that programmers needed to understand the evaluation order of a program, and applicative order is more intuitive.

During the 1980s, though, language researchers began thinking about lazy evaluation more carefully. They developed ways to implement argument passing well, and they also worked on avoiding cases where imperative constructs were useful, through the introduction of such concepts as monads and persistent data structures. These developments have made lazy evaluation more feasible, and it has been incorporated into some modern functional languages, including Haskell — though modern eager-evaluation languages like Caml are also available.

2.1. Implementing lazy parameters

One of the issues surrounding normal-order evaluation is that it can lead to some duplicated computation. The following reduction sequence, for example, illustrates normal-order evaluation as it would properly proceed.

(λ`x`.`x` × `x` × `x`) ((λ`y`.`y` + 1) 2)	⇒	((λ`y`.`y` + 1) 2) × ((λ`y`.`y` + 1) 2) × ((λ`y`.`y` + 1) 2)
	⇒	(2 + 1) × ((λ`y`.`y` + 1) 2) × ((λ`y`.`y` + 1) 2)
	⇒	(2 + 1) × (2 + 1) × ((λ`y`.`y` + 1) 2)
	⇒	(2 + 1) × (2 + 1) × (2 + 1)
	⇒	3 × (2 + 1) × (2 + 1)
	⇒	3 × 3 × (2 + 1)
	⇒	3 × 3 × (2 + 1)
	⇒	9 × (2 + 1)
	⇒	9 × 3
	⇒	27

Notice that this normal-order evaluation ended up adding 1 to 2 three times, when an applicative-order evaluation would end up performing this arithmetic only once.

(λ`x`.`x` × `x` × `x`) ((λ`y`.`y` + 1) 2)	⇒	(λ`x`.`x` × `x` × `x`) (2 + 1)
	⇒	(λ`x`.`x` × `x` × `x`) 3
	⇒	3 × 3 × 3
	⇒	9 × 3
	⇒	27

Notice that normal-order evaluation required 10 steps, while applicative-order evaluation required only 5. One might validly look at this and think, “What a waste! Why would I ever want to use normal-order evaluation?”

The answer is that it is a waste, and that's why no good functional language systems use proper normal-order evaluation: Instead, they use lazy evaluation, where a parameter passed into a function is stored in a single location in memory so that the parameter need only be reduced once. That single memory location is called a thunk.

The following is a rough illustration of this process. Notice that it contains only 5 reduction steps — just like applicative order.

(λ`x`.`x` × `x` × `x`) ((λ`y`.`y` + 1) 2)	⇒	`a` × `a` × `a` where `a` = (λ`y`.`y` + 1) 2
	⇒	`a` × `a` × `a` where `a` = `b` + 1 and `b` = 2
	⇒	`a` × `a` × `a` where `a` = 3 and `b` = 2
	⇒	9 × `a` where `a` = 3 and `b` = 2
	⇒	27

As a result, with lazy evaluation, every parameter is evaluated at most once — maybe zero times, if it happens that the parameter's value is never needed.

The implementation of lazy parameter passing bears some consideration, since it is a little tricky. In a lazy-evaluation language, each parameter is passed as a two-element record: The first portion refers to the code to compute the parameter's value, and the second portion contains the address of the variable mappings to use when executing that code. The second element is important: Without it, we would have just one list of variable bindings, and we could run into the following situation.

(λ`x`.(λ`x`.λ`y`.`x` × `y`) 3 ((λ`z`.`x` + `z`) 2)) 1	⇒	(λ`x`.λ`y`.`x` × `y`) 3 ((λ`z`.`x` + `z`) 2) where `x` = 1
	⇒	(λ`y`.`x` × `y`) ((λ`z`.`x` + `z`) 2) where `x` = 3
	⇒	`x` × `y` where `x` = 3 and `y` = (λ`z`.`x` + `z`) 2
	⇒	`x` × `y` where `x` = 3 and `y` = `x` + `z` and `z` = 2
	⇒	`x` × `y` where `x` = 3 and `y` = 5 and `z` = 2
	⇒	15

In fact, though, y should attain the value 3 rather than 5 since the first beta-reduction should plug 1 into x's spot in the expression. For this reason, a lazy parameter must preserve the current variable context with each reduction, remembering in this case that x = 3 within the expression λy.x × y but maintaining the fact that x = 1 outside the expression.

When the system completes the reduction of a parameter, it overwrites the record referring to the parameter calculation procedure with the calculated value, so that subsequent requests for the same value will result in no additional computation.

2.2. Strictness analysis

This process ensures that no parameter is reduced more than once, which is a dramatic improvement over a naïve implementation of normal evaluation order. But how does it compare to applicative evaluation order?

It doesn't compare favorably. One point in favor of lazy parameters is that parameters will not be evaluated unless their values are truly needed; but in practice programmers rarely write such programs, so this isn't a huge advantage for lazy evaluation. Lazy evaluation, though, has a notable performance disadvantage: When a function's code accesses a parameter, it cannot simply access the value directly as is possible with eager evaluation: It must look through to the parameter's memory location, confirm that it is already evaluated (evaluating it if necessary), and then load the parameter value. This process is significantly slower than otherwise necessary. We could call this the “lazy parameter penalty.”

The lazy parameter penalty is a significant concern as we talk about lazy evaluation systems. Parameter access happens quite frequently in functional programming — as often as variable accesses in imperative programming — and being several times slower hurts performance markedly. How to reduce the lazy parameter penalty is a significant issue for compiler writers to worry about.

Luckily, there is something for compiler writers to do. One simple strategy for reducing the lazy parameter penalty is to determine which parameter accesses are always preceded by another access to the same parameter. For these accesses, there is no need for the generated code to waste time verifying that the parameter has already been evaluated: It will have been. For example, we might write the following simple function to compute the absolute value of the parameter.

abs x = if x >= 0 then x else -x

The generated code for abs may incur the lazy parameter penalty once (when testing whether x ≥ 0), but there is no need for the code to incur the penalty again in the then or else: The generated code can simply retrieve the value of x directly in those cases, for the compiler knows that x will already have been reduced during the test condition.

But good compilers for lazy languages go further: They use an optimization called strictness analysis to determine for each function which parameters are “strict.” A strict parameter is one that will always be evaluated. Consider the following somewhat contrived example.

mult m n = if m == 0 then 0 else m * n

This function is strict in m but not in n: We know m will always be evaluated, but it is possible that n may not.

When a compiler determines that a parameter is strict, it can compile the function using applicative evaluation order for that parameter without sacrificing the correctness of the program. That is, code calling that function first reduces the argument before entering the function, and it passes the value rather than as the code-context pair used for lazy parameters. The receiving function now will not need to incur the lazy parameter penalty for that parameter. To see that this approach is valid only for strict parameters, consider the following.

let infinity = infinity + 1 in mult 0 infinity

In a lazy system, this expression should return 0, but were the system were to attempt to reduce the second parameter before calling mult, it would become stuck in the process of trying to compute infinity.

Good strictness analysis systems go farther than the simple one-function analysis above: They will go across function calls. Consider the following.

hyp a b = sqrt (mult a a + mult b b)

Looking strictly at hyp, we cannot tell whether either of its parameters is strict: Without looking at mult, we can't tell whether the value of a or b will ever be computed. But a good strictness analysis system would look more deeply to observe that mult is strict in its first parameter, and therefore hyp must be strict in both parameters.

An interesting case arises when we consider the strictness analysis of recursive functions.

fold f b lst = if null lst then b else f (head lst) (fold f b (tail lst))

This function is obviously strict in the third argument lst. Equally obvious is that fold is not strict in its first argument f, since when lst is empty, f is unnecessary. But what about its second argument? The second case does not obviously require b's value, but if you think about it, the recursion will eventually reach the base case where the value is required — and hence we might say that this function is strict in its second argument as well as its third.

(This analysis is not completely correct: As we will soon see, the list could be infinite, and then fold will never reach its base case. In such a case, however, evaluating b cannot hurt, because in a purely functional language the only ill effect of evaluating b unnecessarily would be to enter an infinite computation, and fold is going to enter an infinite computation anyway.)

Identifying all strict parameters perfectly is technically impossible, as this would imply a solution to the halting problem. The best compilers can do is to analyze a program to see which parameters it can prove are strict and to treat all others as non-strict parameters. In practice, this identifies nearly all parameters that are in fact strict.

And in practice, a large proportion of parameters in real functional program are strict, and so strictness analysis goes a long way toward eliminating the lazy parameter penalty.

2.3. Lazy lists

One of the interesting things about lazy evaluation is that it allows one to talk about infinite data structures. For example, we can create a function nums that takes a parameter i and returns the list [i, i + 1, i + 2, …] of all the numbers starting at i.

nums i = i : nums (i + 1)

Notice that this is a recursive function with no base case. This is not a problem, though, as long as we don't attempt to perform any operations that require reaching the list's end.

Suppose, for example, that we have a contains function defined as follows.

contains val lst | null lst          = False

                 | val == (head lst) = True

                 | otherwise         = contains val (tail lst)

Now the computation of contains 4 (nums 2) proceeds as follows.

`contains 4 (nums 2))`	=	`contains 4 (2 : nums (2 + 1))`	(according to definition of `nums`)
	=	`contains 4 (nums (2 + 1))`	(according to definition of `contains`)
	=	`contains 4 (cons 3 : nums (3 + 1))`	(according to definition of `nums`)
	=	`contains 4 (nums (3 + 1))`	(according to definition of `contains`)
	=	`contains 4 (4 : nums (4 + 1))`	(according to definition of `nums`)
	=	`True`	(according to definition of `contains`)

Note, however, that if we computed contains 2 (nums 4), the computation would never complete, as it would never reach the end. (If the language used a fixed number of bits for integers with no errors on wraparound, then eventually the list would “wrap around” and eventually reach 2 — and the answer would be true.)

The following is a more interesting example of an infinite list. It creates a list of all the prime numbers by applying a filter to the list of all integers.

primes = 2 : filter isPrime (nums 3)



isPrime n = sub primes

  where sub (p:ps) = p * p > n || n `rem` p == 0 && sub ps

Note how isPrime? works: To determine whether n is prime, it tests all the prime numbers p for which p ≤ √n to verify that none divide n. To iterate through the prime numbers p. it uses the primes list that isPrime itself filters! This works because primes, in determining whether n is in primes, needs only elements of that list that precede n.

You might look at the primes definition and wonder why the initial 2 is explicitly added to the list. Why not simplify it to the following?

primes = filter isPrime (nums 2)

The function definition would still seem to make sense, but then when we tried to extract the first element of the list, the system would evaluate isPrime 2 and from there sub primes, which requires extracting the first element of primes, which after all is why we called isPrime 2 in the first place. The computation would not terminate.

2.4. Comparison

Eager evaluation has two major advantages over lazy evaluation: Variables can always be accessed directly, resulting in greater efficiency of access, and the evaluation order is more intuitive. In a purely functional language, having an intuitive evaluation order is not important; but if the language contains constructs that intentionally produce side effects, like output or variable assignment, then understanding the order of those side effects is important.

Yet, for a language with no side effects (i.e., a purely functional language), lazy evaluation has several benefits. One is that programs will never evaluate an expression whose value until it is needed, which may never happen. This can lead to a simpler implementation of some algorithms where the “intuitive” implementation involves wasted computation: For example, if we want to find the 10 smallest items in a list, then the intuitive algorithm would run Quicksort to sort the list in increasing order and then extract the first 10 items in the list. If this were implemented in an eager-evaluation language, the procedure would take O(n log n) time to sort the list. However, if the same program were executed using lazy evaluation, the recursive calls to sort elements beyond the tenth would in fact never need to be executed, and the process would take O(n) time.

In addition, as we've seen, lazy evaluation enriches the expressiveness of the language by allowing some cases where we can talk about infinite structures. Such structures are never actually generated. Though in practice infinite lists are not all that common, they are something we can talk about with lazy evaluation that cannot feasibly be expressed without it.

Another minor advantage of lazy languages is that less about the language needs to be “special.” In a language using applicative order, the if expression must be categorically different from other operations, since the “then” and “else” blocks must not both be executed before choosing which is the value of the if expression. This is true even if the language is purely functional, since one of the blocks could conceivably require infinite computation. Consider the factorial function again.

fact n = if n == 0 then 1 else n * fact (n - 1)

If if were a regular function in an eager-evaluation language, then the system would first evaluate the “else” parameter before entering the if function. In the case of fact, this evaluation of the “else” parameter is recursive, and so fact would be infinitely recursive. Similarly, the && and || operations automatically provide short-circuit behavior.

You should not read the above analysis and conclude based on all the words spent on explaining the advantages of lazy evaluation that there are “more” advantages to lazy evaluation than eager evaluation. In fact, the advantages of eager evaluation are easy to understand, while the advantages of lazy evaluation are less obvious.

We can conclude, then, that if we must use a language with imperative features, eager evaluation is probably necessary because programmers would find the execution order of lazy evaluation confusing. But for a purely functional language, lazy evaluation is a viable alternative, as Haskell demonstrates.

3. Defining mathematics

3.1. Recursion

It should strike you as somewhat peculiar that we can describe the lambda calculus as a programming language. These are just straightforward mathematical functions. How can we describe any interesting procedures? As an example, suppose we want to describe a factorial function in the lambda calculus. Here we go:

λn.if n = 0 then 1 else n × ?? (n − 1).

We're lost, though, at what we can put in for ??, since functions in the lambda calculus do not have names that we can use to refer to them.

The approach to dealing with this is a bit obscure, based on the notion of a fixed point. A fixed point of a function f is a value x for which f x = f. For example, the cosine function (cos) has a fixed point close to 0.739085. (This is easy to compute on a calculator, where you can start at any number and repeatedly press the cosine button (in radians). The calculator will quickly converge on the fixed point.) Some functions can have several fixed points, such as λx.x² with its two numerical fixed points at 0 and 1, or λx.x with all values being fixed points. Others have no numerical fixed points (like λx.x + 1).

But, it turns out, all functions have a fixed point in the lambda calculus! What's more, the fixed point is easy to write down: Any function f has a fixed point P at

(λx.f (x x)) (λx.f (x x))

To confirm this, we simply apply a beta-reduction to P and observe that we arrive back to f P, and so f P = P.

`P`	⇒	(λ`x`.`f` (`x` `x`)) (λ`x`.`f` (`x` `x`))
	⇒	`f` ((λ`x`.`f` (`x` `x`)) (λ`x`.`f` (`x` `x`)))
	⇒	`f` `P`

Thus, while λx.x + 1 may have no numerical fixed points, it still has a fixed point, namely

(λx.(x x) + 1) (λx.(x x) + 1).

This lambda expression doesn't simplify to normal form, and we can't figure out what it means, but it is indeed a fixed point of the function.

We can define a function, traditionally named Y, that takes a function as a parameter and returns the fixed point of a function.

Y = λf.(λx.f (x x)) (λx.f (x x))

This is called the Y combinator or the fixed-point combinator. (The term combinator refers to an expression with no free variables.) We'll use the Y combinator to compute the factorial function.

Doing this is a two-step process: First, we'll add a new parameter to our first attempt at writing the recursive function, and we'll write that parameter in places where we want to make a recursive call. Let us use g to refer to the result when we do this for our factorial function.

g = λf.λn.if n = 0 then 1 else n × f (n − 1)

Then, we'll apply the Y combinator, and that will end up yielding the recursive function.

fact = Y g

Notice that this is an entirely mechanical process that we can apply to write down any recursive function in the lambda calculus.

We can confirm that this gives us the correct value by induction. For our base case, where we want to show that fact 0 = 1, we use the fact that Y g computes a fixed point of g, and so g (Y g) = Y g.

`fact` 0	⇒	`Y` `g` 0
	⇒	`g` (`Y` `g`) 0
	⇒	(λ`n`.if `n` = 0 then 1 else `n` × (`Y` `g`) (`n` − 1)) 0
	⇒	if 0 = 0 then 1 else 0 × (`Y` `g`) (0 − 1)
	⇒	1

Now we can verify that for the subsequent cases, where m > 0, we have fact m = m × fact (m−1)

`fact` `m`	⇒	`Y` `g` `m`
	⇒	`g` (`Y` `g`) `m`
	⇒	(λ`n`.if `n` = 0 then 1 else `n` × (`Y` `g`) (`n` − 1)) `m`
	⇒	if `m` = 0 then 1 else `m` × (`Y` `g`) (`m` − 1)
	⇒	`m` × (`Y` `g`) (`m` − 1)
	⇒	`m` × `fact` (`m` − 1)

By induction, then, we have proven that fact indeed matches the factorial function.

The fact that we can implement recursion is an important one to the lambda calculus: It means that we can now talk about functions that involve iteration, and so we can conceivably talk about being able to do anything in the lambda calculus that we can do in regular programs. In fact, this is possible: In the 1930's, logicians proved that the lambda calculus is computationally equivalent to Turing machines, and computer scientists showed early in the development of computers that Turing machines can compute anything that computers can compute. Thus, the lambda calculus can represent any program that can be executed on a computer, although it may be that the lambda-calculus translation would be too unwieldy for a human to interpret.

3.2. Numerals

Until now, we have behaved as if the lambda calculus included basic arithmetic and even if expressions. Such a lambda calculus is called an applied lambda calculus. This distinguishes it from the pure lambda calculus, which does not include anything except for the most basic elements: Lambda expressions consist only of variables, of lambda abstractions, and of function applications.

Church noticed that the pure lambda calculus is strong enough to represent all of arithmetic. That is, technically, we don't need such complicated concepts as numbers to be a built-in concept. All we need are functions.

Naturally, since numbers are a part of arithmetic, we need some way of talking about numbers within the lambda calculus if it will be able to represent all of arithmetic. Church proposed the following system, called the Church numerals:

0	=	λ`s`.λ`z`.`z`
1	=	λ`s`.λ`z`.`s` `z`
2	=	λ`s`.λ`z`.`s` (`s` `z`)
3	=	λ`s`.λ`z`.`s` (`s` (`s` `z`))
	:

Each “numeral” n is in fact a function that takes two arguments, and it will apply the first argument to the second argument n times. We can think of the first argument as representing a “successor” function, and the second argument as representing “zero,” although we would only think of them this way to understand why such a definition would make these numerals correspond to the whole numbers to which we are accustomed.

Once we have our way of representing numbers, we can ask how we can define common operations on them. One simple problem is defining the succ function, which takes a numeral and returns the succeeding numeral (that is, the numeral representing the number that is one more than the number representing by the argument numeral). This is easy enough to define:

succ = λm.λs.λz.s (m s z)

As an example, we can compute succ 2 and verify that 3 comes out.

`succ` 2	=	(λ`m`.λ`s`.λ`z`.`s` (`m` `s` `z`)) (λ`s`.λ`z`.`s` (`s` `z`))
	⇒	λ`s`.λ`z`.`s` ((λ`s`.λ`z`.`s` (`s` `z`)) `s` `z`)
	⇒	λ`s`.λ`z`.`s` ((λ`z`.`s` (`s` `z`)) `z`)
	⇒	λ`s`.λ`z`.`s` (`s` (`s` `z`)) = 3

This is not the only possible definition of succ. In some sense, the above definition works by pasting an additional s onto the beginning of the applications of s to z. We can instead paste the s onto the end.

succ = λm.λs.λz.m s (s z)

We use the same example to see this working.

`succ` 2	=	(λ`m`.λ`s`.λ`z`.`m` `s` (`s` `z`)) (λ`s`.λ`z`.`s` (`s` `z`))
	⇒	λ`s`.λ`z`.(λ`s`.λ`z`.`s` (`s` `z`)) `s` (`s` `z`)
	⇒	λ`s`.λ`z`.(λ`z`.`s` (`s` `z`)) (`s` `z`)
	⇒	λ`s`.λ`z`.`s` (`s` (`s` `z`)) = 3

Using similar techniques, it's not too difficult to define a + function that takes two numerals and computes their sum, or a × function that takes two numerals and computes their product. Defining these is an exercise left to the reader: The key is to remember that a numeral is indeed a function taking a function s and a start-value z as an argument, and you can choose whatever s and z you want.

Accomplishing subtraction is more difficult: We would like some inverse to succ, which we'll call pred. The numerals aren't set up to easily find the preceding value. We want to somehow “peel” off one of the applications of s. Here is one definition; it involves quite a bit of insight, and it's not easy to understand… but it works.

pred = λm.λs.λz.m (λf.λg.g (f s)) (λg.z) (λx.x)

We can look at example of it working, but it's not very illustrative.

`pred` 2	=	(λ`m`.λ`s`.λ`z`.`m` (λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`) (λ`x`.`x`)) (λ`s`.λ`z`.`s` (`s` `z`))
	⇒	λ`s`.λ`z`.(λ`s`.λ`z`.`s` (`s` `z`)) (λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`) (λ`x`.`x`)
	⇒	λ`s`.λ`z`.(λ`z`.(λ`f`.λ`g`.`g` (`f` `s`)) ((λ`f`.λ`g`.`g` (`f` `s`)) `z`)) (λ`g`.`z`) (λ`x`.`x`)
	⇒	λ`s`.λ`z`.(λ`f`.λ`g`.`g` (`f` `s`)) ((λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`)) (λ`x`.`x`)
	⇒	λ`s`.λ`z`.(λ`g`.`g` ((λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`) `s`)) (λ`x`.`x`)
	⇒	λ`s`.λ`z`.(λ`x`.`x`) ((λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`) `s`)
	⇒	λ`s`.λ`z`.(λ`f`.λ`g`.`g` (`f` `s`)) (λ`g`.`z`) `s`
	⇒	λ`s`.λ`z`.(λ`g`.`g` ((λ`g`.`z`) `s`)) `s`
	⇒	λ`s`.λ`z`.`s` ((λ`g`.`z`) `s`)
	⇒	λ`s`.λ`z`.`s` `z` = 1

We might also want to define Boolean values. We'll decide to represent Boolean values as functions that take two arguments; true will return the first, while false will return the second.

`true`	=	λ`x`.λ`y`.`x`
`false`	=	λ`x`.λ`y`.`y`

This definition makes defining an if function simple, since the function should take a Boolean value and two other values, and return the first of the other two if the Boolean is true (which is what true does anyway) and the second if the Boolean is false (which is what false does anyway).

if = λb.λx.λy.b x y

Now suppose we want to represent the factorial function in the pure lambda calculus. The one missing component is a is0 function that takes a numeral and returns true if it is zero and false otherwise. We can do this by using the numeral where the function applied repeatedly (s) is the constant-false function, and the zero value (z) is true.

is0 = λn.n (λx.false) true

Since all numerals except 0 apply s at least once, the result will be false for them; but for 0, the z value of true will result.

Having this, we can now write our entire factorial function in the pure lambda calculus. Naturally, you would substitute in all the definitions of the free variables here.

fact = Y (λf.λn.if (is0 n) 1 (× n (f (pred n))))