Assembly Language - Part 8

Chris Johnson

Multiple load and store

In the previous part, we investigated how to transfer the contents of a single register to or from a memory location in RAM. There are many occasions when we need to carry out such an operation on more than one register at the same time. We could write specific instructions, using an appropriate addressing mode, as required, but the ARM instruction set includes instructions to make such a task easier, and also reduce the amount of code generated. These instructions are LDM (load multiple registers) and STM (store multiple registers). In order to understand the options available for these instructions it is very helpful to be familiar with the operation of stacks.

Stacks

Stacks are very common, not only in computing, but in everyday life as well. Haven't you ever sat at a desk surrounded by piles of books or papers? If you are like me, then piles of paper seem to get ever higher and more jumbled, making the retrieval of specific items more difficult. A practical demonstration of increase in entropy perhaps. Efficient use of a stack requires very careful management, otherwise the whole system will break down or crash.

In computer terms, in its simplest sense, a stack is an area of memory used for the temporary storage of register contents. Items are added to the stack, and are removed again in the reverse order. In other words, the last item added is always the first to be removed. Thus a stack is said to be LIFO, i.e. last in, first out. One should never add or remove items from the middle of a stack, although some clever programmers seem to find it necessary, usually resulting in much grief for the unfortunate users of the software at some critical time. You should always remove everything you put on, and not remove anything that is not yours.

Since gravity is an inescapable fact of life for we earthbound readers, piles of books start at the bottom and grow upwards. Storage in computer memory is not subject to the same constraint, so we can just as easily have a stack that grows downwards, as one that grows upwards. These are known as descending and ascending stacks respectively.

Every stack has, associated with it, a stack pointer, often shortened to SP, which tells us where to put the next item, or to retrieve the last item. Here again we have two choices. The pointer may point to the first unused (or empty) location. If this is so, then the stack is said to be of an empty type.

To store the next item, we store it in the location pointed at by the SP and then update the SP to point at the next empty slot. Alternatively, we could point at the last used location. The stack is then said to be of a full type. To store the next item, we must first change the SP to point to the first empty location, and then store the item. Readers who can recall part 7 of this seriesmay relate these operations to the post- and preindexed addressing modes, with positive or negative offsets. Similar considerations apply when removing items from the stack. Perhaps the following diagrams will help.

Thus we have four possible types of stack.

FA Full Ascending
FD Full Descending
EA Empty Ascending

These four two letter codes are used to specify the type of stack in the instruction. There is no inherent reason for preferring one of these types over the others; it is the programmer's choice. However, once we have chosen one type for a particular stack, we have to stick with that choice. If the stack is going to be 'public', and available to more than one process, then it is essential that all the processes assume the same type, otherwise there will be immediate disaster! Acorn has opted to use full descending (FD) stacks in their implementation of Basic, for example.

The instructions

The general form of these two instructions is as follows.

STM{condition}<type> <base>{!},<register list>
LDM{condition}<type> <base>{!},<register list>

The <type> will be a two character code, such as those given above (but see later for some alternatives). The condition code is optional, and the possibilities have been covered in detail in an earlier part of this series. The <base> is the register used to contain the stack pointer. Acorn has generally standardised on R13 as stack pointer, for Basic, for example. The ! added to the base register, causing write back (see part 7) of the updated stack pointer, will be required for all normal stack operations, since we want the pointer to be updated. The register list, enclosed in curly {} brackets, may be a simple list {R1, R4, R7}, or a range {R1 - R6}, or a combination {R1 - R4, R7, R14}. For example

STMFD    R13!, {R1 - R4}
LDMFD    R13!, {R2, R4, R5}
LDMEQEA  R13!, {R3-R5, PC}

The order registers are stored

Although not necessary for using these commands, it is useful to know how the ARM actually stores the register contents in memory, since there is a trap for the unwary. The order in which we specify the registers in the list is not important. The assembler sorts the registers out and sets the appropriate flags in the assembled instruction. To the ARM, when executing the code, {R1, R3, R6} is exactly the same as {R3, R6, R1}. The ARM always stores the registers in the same order in memory, irrespective of the order in which the register names appear in the assembler listing, and irrespective of whether the stack is an ascending or descending one. The ARM always stores the lowest numbered register at the lowest address, followed by the next lowest numbered register at the next higher address and so on. When retrieving data from the stack, the contents of the lowest memory will go into the lowest numbered register, and so on. Another diagram should help. Suppose we execute the instructions

STMFA R13!, {R5, R1, R7}
STMFD R13!, {R7, R1, R5}

the stacks might look as follows.

Note how, in both cases, the registers are stored with the lowest numbered ones in the lowest memory location, and so on. Therefore if we carried out the following sequence

STMFD R13!, {R1, R2, R3, R4, R8}
LDMFD R13!, {R4, R2, R8, R1, R3}

the contents of the registers would be the same as when we started, the contents would not be swapped around.

It is, of course, permissible to use different registers in corresponding STM and LDM instructions, as long as you remember the registers will be restored in order. For example,

STMFD R13!, {R1, R3, R7, R5, R9}
LDMFD R13!, {R4, R2, R8, R0, R6}

would result in R0 containing the original contents of R1, R2 containing the contents of R3, and so on.

Example of use

A common use for such instructions is to save register contents when entering a sub-routine, and restore the contents when returning. For example,

.subroutine
STMFD R13!, {R2 - R5, R14}
  ; Save registers R2, R3, R4, R5
  ; and the return address (R14) on the
  ; stack now carry out subroutine 
  ; using, and corrupting, the
  ; original contents of R2-R5.
  ; When subroutine is finished use
LDMFD R13!, {R2 - R5, PC}
  ; this restores the original contents
  ; of R2-R5 and transfers the return
  ; address into PC (R15).

Subroutines can call other subroutines or make SWI calls, and as the subroutines return, register contents can be restored.

Using the program counter in multiple register operations

When we do a multiple store which includes the program counter/status register R15 in the register list, the full 32-bit value is always written to the stack.

STMFD R13!, {R2 - R5, R15}

will write the full 32-bit value of R15, including program counter and status flags, to the stack.

However, the corresponding multiple load instruction

LDMFD R13!, {R2 - R5, R15}

would only restore the program counter, bits 2 - 25, into register 15, the status flags would be unaffected. This is the common way we use the instruction. If we specifically wish to restore the original values of the status bits, we must append the ^ character to the instruction, i.e.

LDMFD R13!, {R2 - R5, R15}^

In this case, all 32 original bits will be loaded into R15.

Alternative instruction types

When we are dealing with stacks, it makes sense to use the types I have described above, since the actual operations carried out are handled by the CPU without the programmer having to be concerned with the fine detail. However, there are many other operations that may be much more efficient when implemented with multiple load/store rather than single registers. For example, block transfers may be much faster when carried out using say eight registers (32 bytes) at a time, rather than a single register (4 bytes). Alternative assembler instructions are provided for such cases - the actual operations carried out by the CPU are the same, we just think of them in an alternative way.

To exemplify, let us consider what the CPU actually does when we carry out these two instructions:

STMFD R13!, {R0}
LDMFD R13!, {R0}

We are dealing here with a (F)ull (D)escending stack. When the operation is started, the SP is pointing at the last value stored. Therefore we must decrement the SP before using it to store the value held in R0. The new value of SP is then written back to the appropriate register (R13).

When we do the LDMFD, however, the SP is already pointing at the last value on the stack (the one we are about to remove). Therefore we increment the SP after we have loaded the value into R0. Perhaps the diagrams will help.

Of course, in practice, the Item 3 is not actually deleted, it is just overwritten the next time we write to the stack at this location. We could work through similar scenarios for ascending and empty stacks. This leads to the alternative descriptions, where we use (D)ecrement/(I)ncrement and (B)efore/(A)fter to describe the type of transfer.

To finish then, let us list the full set of equivalent instructions to see how they relate. I have listed them as the pairs you would use for stores and loads, using one or other of the alternative descriptions. Note how, in the full/empty description, the type must be the same for loads and stores, e.g. FD, whereas in the alternative description, the exact opposite applies for corresponding loads and stores. In the first method, we simply specify the type of stack being used, whereas in the second method we actually specify the operation carried out.

Another example

Here is a code snippet showing the use of different multiple instructions, both to temporarily stack registers and to do a simple copy operation of 32 bytes (it is not complete!).

.subroutine
STMFD  R13!, {R0 - R7, R14]
        ; save work registers and return
 ; address on stack
ADR    R0, source_data%
ADR    R1, destination_data%
LDMIA  R0!, {R3 - R6}
STMIA  R1!, {R3 - R6}
        ; this copies four words (16 bytes) 
 ; in one step with write back to the
 ; two pointers
LDMIA  R0, {R3 - R6}
STMIA  R1, {R3 - R6}
        ; this copies the next four words (16
 ; bytes) we do not need write back to
 ; "stack pointers" R0 and R1 if this
 ; is last transfer.
LDMFD  R13!, {R0 - R7, PC}
        ; restores registers and transfers
 ; return address to PC.

It does not take much effort to modify this skeleton to transfer a much larger block in chunks of 16 bytes, using a loop.

Contents - The Archives - Archive Articles