This page demonstrates some of the optimisations which are possible when programming in ARM assembler. It's based on my experience of optimising other people's code, so it's real examples of tricks that people overlook. This page assumes a familiarity with the ARM instruction set, and programming in assembler in general.

Use of conditionals

Conditional execution of instructions is one of the best things about the ARM instruction set, and yet people overlook it so often. For example, returning with the V bit set from a function is often used to indicate an error occurred:

	STMFD	r13!, {r14}
	BL	func1
	LDMVSFD	r13!, {pc}
	ADD	r0, r1, r2
	BL	func2
	LDMVSFD	r13!, {pc}
	SUB	r1, r2, r0
	BL	func3
	LDMVSFD	r13!, {pc}
	LDMFD	r13!, {pc}^

can be rewritten as:

	STMFD	r13!, {r14}
	BL	func1
	ADDVC	r0, r1, r2
	BLVC	func2
	SUBVC	r1, r2, r0
	BLVC	func3
	LDMVCFS	r13!, {pc}^
	LDMFD	r13!, {pc}

A saving of 8 bytes in space, and 3 instructions in execution time for the case when no errors occur - which is hopefully the more common! I also find it more readable and it's easier to trace the flow of execution. Naturally, if you have to do comparisons then this is not possible and you'll have to either exit or branch over the conditional segment of code. It can be more efficient to do this if you have a long procedure exit sequence, but normally exiting is merely a matter of unstacking the registers and writing the appropriate value to the program counter which is one instruction.

Use of the correct comparison

The different types of comparison available often confuse the novice programmer. Use of signed comparisons instead of unsigned is the most common error which is made, which can be merely inefficient (and here's an example), but often leads to bugs when memory ranges are suddenly greater than 2GB.

  Unsigned   Signed
   HS (CS)    GE
   HI         GT
   LS         LE
   LO (CC)    LT

PL and MI directly test bit 31 of the result (ie the N flag is a copy of bit 31). They are subtly different to GE and LT and I'd like someone to give me a case where they should be used instead of GE or LT.

Don't misuse the stack

Obviously, don't stack registers that aren't required to be preserved - which these are depends on whether you are writing APCS conforming code, or whether you have your own Procedure Call Standard, or whether you work on an ad hoc basis. But it's not necessary to stack r14 if the function is a leaf function (calls no other functions). However, if you do find yourself in need of more registers than you have available, then r14 should be the first register you stack on the grounds of efficiency. Consider the two functions:

	STMFD	r13!, {r14}
	MUL	r0, r1, r2
	LDMFD	r13!, {pc}

	STMFD	r13!, {r1, r2}
	LDR	r1, [r12, #8]
	LDR	r2, [r12, #12]
	MUL	r0, r1, r2
	LDMFD	r13!, {r1, r2}
	MOVS	pc, r14

Better would be:

	MUL	r0, r1, r2
	MOVS	pc, r14

	STMFD	r13!, {r14}
	LDR	r14, [r12, #8]
	LDR	r0, [r12, #12]
	MUL	r0, r14, r0
	LDMFD	r13!, {pc}

In the second function, restoring registers is combined with returning from the procedure call. Since r14 must be corrupted by the function call, the caller is not expecting its value to be any particular value at the end of the function. This has been exploited in at least one program which I know of to return function values in r14, but you really need to be aware of exactly what you're doing with that technique, and I've never dared try it myself.

A further optimisation which can be used is to use a single store rather than a multiple store if you're only storing r14. The above code then becomes:

	STR	r14, [r13, #-4]!
	LDR	r14, [r12, #8]
	LDR	r0, [r12, #12]
	MUL	r0, r14, r0
	LDR	pc, [r13], #4

Beware that you can't tell LDR to restore the program counter flags. This is not an issue if you are using 32-bit APCS-3, but most routines on 26-bit ARMs preserve the flags over a function call.

It can sometimes be a win to only stack registers conditionally. Here's an example:

	LDR	r2, [r12, #0]
	TEQ	r0, r2
	MOVEQ	pc, r14
	STMFD	r13!, {r14}
	BL	func
	SUB	r0, r0, #1
	LDMFD	r13!, {pc}

This can be more efficient if r0 is often equal to r2; for example if this were an error check.

Another way in which the stack can be abused is to store data _below_ the stack pointer. This will appear to work but you will get random crashes. This is because interrupt code is entitled to use space on the stack temporarily; if you have left the stack pointer in the wrong place, you will have problems.

References to data within a big program

When your program gets to more than 4k, you start to run the risk of not being able to address data. This is because LDR has a maximum range of +/-4095 bytes. ADR (which is a pseudoinstruction for ADD rN, pc, x, which takes into account the fact that the program counter is 2 instructions ahead of the currently executing instruction) uses the normal 8 bits rotated by a multiple of 2 scheme, so this can get you further away but at the cost of not being able to directly access every address. Many people have FNadrl macros and at least two assemblers that I know of have ADRL instructions. They are implemented as something like:

	ADR rN, (offset AND &FF)
	ADD rN, ((offset - P%)AND &FF00)

with slight modifications to cope with a negative offset, and the cunning ones actually try various possibilities such as

	ADR rN, (offset AND &3FC)
	ADD rN, ((offset - P%) AND &3FC00)

to gain them more distance.

Nevertheless, these pseudo-instructions take 2 instructions, and ADRX (for constants which can't be reached with ADRL) takes 3. It is better to avoid using these if possible. Try moving data around so that it's nearer the functions which reference it. If that's not possible, try moving functions around.

If you have data which are used throughout the program, consider allocating a register to point to their address throughout the execution of the program. In RISC OS modules, this is assisted by Acorn by pointing r12 at a block of private workspace.

Under no circumstances consider doing the following:

	BL	getvar
...
.getvar
	LDR	r0, var
	MOVS	pc, r14
.var
	EQUD	0

since this involves two pipeline flushes and wrecks most caching strategies so it is guaranteed to be slower. It also goes against the principle of keeping data and code separate which is more important on StrongARM processors with their split instruction/data caches.

Working with large constants

By large constants, I don't mean constants with a large magnitude like 2^31, I mean constants which don't fit in the ARM's scheme for immediate constants; ie 8 bits rotated by a multiple of 2. This example is from IscaFS:

	LDR	r0, [r6, #24]
	BIC	r0, r0, #&ff000000
	BIC	r0, r0, #&00ff0000
	MOV	r1, r1, #&53
	ORR	r1, r1, #&ef00
	TEQ	r0, r1

I replaced this with:

	LDR	r0, [r6, #24]
	MOV	r0, r0, LSL #16
	EOR	r0, r0, #&ef000000
	TEQ	r0, #&00530000

Shifting r0 left by 16 automatically zeroes the leftmost 16 bits and discards the previous top 16 bits, which is the desired effect. The crucial step is noticing that TEQ is the same as EORS, without a destination register. So if the top byte of r0 is not &ef then the second test can never be true. This trick saves 2 instructions and one register.

A similar technique can apply to other situations, for example checking that a 16 bit value is less than a given value can be done by

	CMP	r0, #&xy00
	CMPEQ	r0, #&00za

is the same as

	MOV	r1, #&xy00
	ORR	r1, #&00za
	CMP	r0, r1

but takes one instruction fewer and uses one register fewer.

If it's utterly unavoidable, you may need to put a large constant into a register. It is thought that there are no constants which cannot be produced in 3 instructions, though no-one has an algorithm for producing arbitrary constants (to the best of my knowledge). On a StrongARM, it is normally quicker to load an instruction instead of using 3 to synthesise it. On an ARM6, it is quicker to use 3 instructions. Take your pick.

Strength reduction

This term covers a number of optimisations. One is to reduce the cost of per-iteration calculations.

	MOV	r0, #200
	MOV	r4, #8
.loop
	MUL	r1, r0, r4
	...
	SUBS	r0, r0, #1
	BNE	loop

becomes

	MOV	r1, #1600
.loop
	...
	SUBS	r1, r1, #8
	BNE	loop

This saves a MUL instruction in the loop, which is a slow operation on many variants of the ARM. In this case, it also saves 2 registers, though in practice this is not often achieved.

Count down, not up

The above example also illustrates that it's better to count down in a loop than up. Here's a worse example:

	MOV	r1, #0
.loop
	...
	ADD	r1, r1, #8
	CMP	r1, #1600
	BNE	loop

The SUBS used above combines the test-for-end with the loop-count, saving an instruction.

Unrolling loops

Since the instructions used per-loop are not doing useful work, it is often better to unroll the loop a little. This can add extra overhead in places, so you should be cautious. It may also pay to unroll the loop entirely, as the C compiler may sometimes do with memset. To reuse the example above:

	MOV	r1, #1600
.loop
	...
	SUB	r1, r1, #8
	...
	SUBS	r1, r1, #8
	BNE	loop

In this example, since I haven't specified what is in `...', it's not possible to combine the first SUB in with it which might be possible in real code. This optimistaion saves half an instruction (and a cache line flush) per loop - 800 in total. Against that, it takes more cache lines, and more RAM in general. The only way to find if this is a win or not is to benchmark the code in question.

Dividing

There are already several good algorithms out there and I don't think I have anything to contribute myself. Please see This page and this one for good examples.

Some further examples

That's about all the coding tricks I can remember for the moment. If you want to see a real example of code which I've improved, try looking at IscaFS which was originally written by Phil Norman.

I'll leave you with some more fictional examples:

	STMFD	r13!, {r14}
	CMP	r0, #5
	BGT	not
	CMP	r0, #0
	BLT	not
	LDR	r0, [r12, #8]
	B	over
.not
	MOV	r0, #0
	SUB	r0, r0, #1
.over
	LDMFD	r13!, {pc}

Can be optimised to:

	CMP	r0, #5
	LDRLS	r0, [r12, #8]
	MVNHI	r0, #0
	MOV	pc, r14

This example demonstrates several points:

Use of unsigned instead of signed comparisons
Use of conditionals to avoid branches
Not storing the link register if avoidable
Remembering one of the more `esoteric' instructions

Thanks for reading, I appreciate feedback and if you have any tricks you'd like to share with the ARM programming community at large then please send them to me.

You probably want to look at these pages for further tips: Robin Watts' ARM programming page

I'd like to thank Phil Norman and Peter Burwood for their help.

Matthew Wilcox