ARM subroutines & program stack

by Carl Burch, Hendrix College, October 2012

Creative Commons License
ARM subroutines & program stack by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
Based on a work at


1. The link register
2. The program stack
3. Calling conventions
4. Case study: Linked lists

To make large programs more manageable, programmers break programs into smaller pieces. In C and Python, these pieces are called functions; in Java, they are called methods; and in assembly language, they are called subroutines. We'll now turn to seeing how to write subroutines for ARM's ISA. Through learning about subroutines, we'll also learn more about how methods/functions in higher-level languages are actually executed.

1. The link register

We've seen already that each processor has a register called its program counter to track the address of the instruction it is about to execute. The ARM processor uses R15 for this purpose. In fact, assembly programs refer to R15 as PC instead, though the two are synonymous.

In invoking a subroutine, a program must store where the processor should return after completing the subroutine. The ARM processor uses R14 for this purpose; it is called the link register, and is usually referenced as LR in programs.

Below is an example illustrating how this works, using our strcpy example from Section 3.2. By the end of this current section, we'll see that one uses the BL opcode rather than the two-instruction sequence used below, but first we'll look at how calling a subroutine works without using that special-purpose instruction.

        ; Note: This way of calling subroutines is stylistically poor.
        ADD LRPC#after  ; place into LR the address to return to
        B strcpy
after   ; main code continues its work here after subroutine completes

strcpy  LDRB R2, [R1], #1
        STRB R2, [R0], #1
        TST R2R2          ; repeat if R2 is nonzero
        BNE strcpy
        MOV PCLR          ; return back into the code calling strcpy

This fragment first loads into LR the address of the instruction following the call to strcpy, labeled after in the above program. In the next instruction, it branches into strcpy, and the subroutines starts its work. Once complete, the subroutine copies LR back into PC. Since PC is used to determine which instruction should be loaded and executed next, changing its value leads the processor to execute its next instruction at the after label.

The process of calling subroutines is so common that the ARM designers felt that the two-instruction process in the above illustration was too cumbersome. So they created a new opcode explicitly for this purposes, called BL for Branch and Link. Thus, rather than the first two instructions above, we'd write simply BL strcpy. This instruction loads into LR the address of the instruction following the BL instruction, and it sets PC so that the next instruction executed is the one at strcpy.

2. The program stack

Often our subroutines will themselves call subroutines. But that brings up a problem: How should we avoid the situation where one subroutine needs the values of some registers, but another subroutine that it calls will change those same registers?

The natural solution is to save registers' values into memory before the called subroutine does its work, and then to restore the values from memory once that subroutine has completed. In fact, we'll do this with something called the program stack. When a subroutine starts, it will allocate a new block of memory on the top of the stack; and when it returns, the subroutine will release the block from the top, restoring the stack to its older state.

We implement the stack using a a register called the stack pointer, which holds the address of where the top value in the stack can be found. In the ARM processor, R12 is conventionally used for the stack pointer, which we can reference this as SP.

The stack pointer will start at a large address, but as we add things into the stack, the stack pointer will decrease; thus, SP points to the top of the stack, while SP + 4 is the address of the next-to-top value, while SP + 8 points to the third-from-top value, and so on. For the ARM, we'll be able to push several things onto the stack using the STMDB instruction; for example, the instruction STMDB SP!, {R4,R5} decreases SP by 8, effectively growing the stack by 8 bytes, and R4 and R5 are saved into those 8 bytes. When we want to remove values from the top of the stack, we can use the LDMIA instruction, as with STMIA SP!, {R4,R5}.

An example of such a subroutine is below. This subroutine, an adaptation of our earlier fragment for adding the numbers of an array, needs two additional registers beyond the registers containing the parameters. Since any code that calls this subroutine may not be able to afford those registers, we instead opt to write the subroutine so that it saves both registers onto the stack as soon as it is called; and just before returning, it restores the registers to their previous values.

; sumArray: Places sum of entries in array into R0. On entry, R0
; should be address of first array element, R1 should be array length.
sumArray STMDB SP!, { R4R5 }  ; push R4 and R5 onto stack
         MOV R4#0
sumLoop  LDR R5, [R0], #1
         ADD R4R4R5
         SUBS R1R1#1
         BNE sumLoop
         MOV R0R4
         LDMIA SP!, { R4R5 }  ; restore R4 and R5 from stack
         MOV PCLR             ; return back to after sumArray call

One of the most common registers a subroutine will want to save is the link register, since the subroutine will often want to modify the link register itself as it calls other subroutines. There is a handy trick involving this: When we restore the registers at the subroutine's end, we can easily restore the link register LR's saved value into the program counter PC instead. As a result, we won't need the MOV PCLR instruction.

subName STMDB SP!, { R4-R5,LR }
        ; code within subroutine goes here, with perhaps some calls
        ; to other subroutines (thus changing LR)
        LDMIA SP!, { R4-R5,PC } ; loading into PC returns out of subroutine

3. Calling conventions

To write large assembly programs, we need a standard system for passing parameters, returning values, and allocating registers between subroutines. After all, if each subroutine created its own system of using registers, things would quickly get very confusing as we try to remember each subroutine's system. A standard system is called a calling convention, and it's important enough to have this convention that ISA designers typically define one even though it is not technically something that is built into a manufactured processor.

For the ARM processor, we'll follow the standard calling convention that parameters are passed by placing the parameter values into registers R0 through R3 before calling the subroutine, and a subroutine returns a value by placing it into R0 before returning. In the rare situation that a subroutine wants more than four parameters, we'd place any additional parameters onto the stack before entering the subroutine (with the earlier parameters pushed last onto the stack, so that the fifth parameter is on the stack's top (referenced by SP).

Each subroutine is allowed to alter R0 through R3 as it wishes; but if it uses R4 through R12, it must restore them to their previous values. It must also restore the stack pointer R13, effectively removing everything from the stack. It may change the link pointer R14.

The registers are thus divided into caller-save registers and callee-save registers. Caller-save registers are those that the subroutine may change, such as R0 through R3 in the ARM convention described above: They are caller-save because since a caller of a subroutine must save the registers' values if it wants the values after the subroutine completes. Callee-save registers are those that a subroutine must leave unchanged, like R4 through R12 is the convention described above: Upon being called, the subroutine (the callee) must save the registers' values if it wishes to use them.

It's beneficial for a calling convention to designate both caller-save registers and callee-save registers. If the convention designated all registers as callee-save, then subroutines would not be able to use any registers at all without saving them onto the stack first — which would be a waste, since some of the saved registers would be transient values that the calling subroutine did not care about long-term. And if the convention designated all registers as caller-save, then programmers would be forced to save many registers before every call to a subroutine and to restore them afterwards, lengthening the amount of time to call a subroutine.

4. Case study: Linked lists

Let's get a bit more practice with ARM assembly programming by looking at some more examples. In particular, we'll look at examples for processing a linked list of integers. In C, we'd define this as a struct with two fields, an int field named value and a struct node* field named next. In assembly language, if R0 contains an address of a node of the linked list, then the node's integer value (node->value) would be at address R0 and the address of the following node (node->next) would be at address R0 + 4.

We'll begin by writing a subroutine for adding up all the values in a linked list. In C, we'd write a function to do this as follows.

int sumList(struct node *head) {
    struct node *cur = head;
    int sum = 0;
    while (cur != NULL) {
        sum += cur->value;
        cur = cur->next;
    return sum;

Following is a translation of this into ARM assembly language.

sumList      MOV R3R0       ; R3 holds "cur"
             MOV R0#0       ; R0 holds "sum"
sumList_loop TST R3R3
             MOVEQ PCLR     ; if cur is NULL, return: R0 already holds sum
             LDR R2, [R3]     ; load cur->value into R2, add it to R0
             ADD R0R0R2
             LDR R3, [R3#4; update cur to be cur->next
             B sumList_loop

In this translation, we were somewhat clever in deciding conveniently to use R0 for holding the accumulated sum: By the time we're ready to return from the subroutine, the return value is already in R0, so we simply need to return using the MOVEQ instruction.

Now let's look at a subroutine that adds a new node at the front of a linked list. The C function is below.

struct nodeaddFront(struct node *headint val) {
    struct node *n = (struct node*) malloc(sizeof(struct node));
    n->value = val;
    n->next = head;
    return n;

This will be harder to translate into assembly language since the addFront subroutine will need to call another subroutine (malloc). What's more, the value of the R0, R1, and LR registers will need to be saved before we enter malloc, since these are all caller-save registers and addFront will need their values after malloc returns. (We need to save LR since the instruction BL malloc modifies it.)

In the below translation, we end up pushing these registers onto the stack and later popping these values into different registers; thus, the head argument originally in R0 ends up in R2, and the val argument originally in R1 ends up in R3. We do this because malloc places its return value into R0, and we wish to keep it in R0 for the return value of addFront.

addFront     STMDB SP!, {R0R1LR}  ; push R0, R1, LR onto stack
             MOV R0#8               ; call malloc(8)
             BL malloc
             LDMIA SP!, {R2R3LR}  ; pop stack into R2, R3, LR
             STR R3, [R0]             ; n->value = val (pushed R1 popped into R3)
             STR R2, [R0#4]         ; n->next = head (pushed R0 popped into R2)
             MOV PCLR               ; return from subroutine

Finally, an example that's complex enough to involve some real manipulation of the stack: Removing a number from the linked list.

struct node *remove(struct node *headint val) {
    struct node *cur = head;
    struct node *prev = NULL;
    while (cur != NULL) {
        if (cur->value == val) {
            if (cur == head) {
                head = cur->next;
            } else {
                prev->next = cur->next;
            return head;
        prev = cur;
        cur = cur->next;
    return head;

Our implementation ends up needing more than just the registers R0 through R3, so we'll need to use a callee-save register. In our implementation below, you can see that the callee-save register R4 is saved onto the stack initially and restored when we return from the subroutine. It also saves R0 onto the stack, since the value of the head parameter is needed following the free subroutine. However, in the innermost if statement, we must modify head; we do this by storing the intended value into the stack so that the new value is restored after free completes.

remove      STMDB SP!, {R0R4LR; Save "head", R4, LR onto stack.
            MOV R2R0         ; Use R2 for "cur".
            MOV R3R0         ; Use R3 for "prev".
remove_loop TST R2R2         ; If cur == NULL, return with R0 unchanged.
            LDMEQIA SP!, {R0R4PC}
            LDR R4, [R2]       ; Load cur->value,
            CMP R4R1         ;   compare to "val" argument,
            BEQ remove_find    ;   and go to remove_find if they match.
            MOV R3R2         ; Otherwise prev = cur,
            LDR R2, [R2#4]   ;   cur = cur->next,
            B remove_loop      ;   and repeat.
remove_find LDR R4, [R2#4]   ; Load cur->next into R4.
            CMP R2R0         ; Check whether cur == head:
            STREQ R4, [SP]     ;   if so, replace "head" on stack with R4;
            STRNE R4, [R3#4;   if not, save R4 into prev->next.
            MOV R0R2         ; Now free(cur).
            BL free
            LDMIA SP!, {R0R4PC; Return "head" saved on stack.

The ARM processor has many more details than we have been able to discuss here. Notably missing from our discussion is any interaction with input or output devices. But one commonly writes subroutines for this (usually included as part of an operating system) without needing to think about this interaction again. But we've now seen all the pieces we need for writing computational code in the ARM assembly language.