STMFD r13!, {r14} BL func1 LDMVSFD r13!, {pc} ADD r0, r1, r2 BL func2 LDMVSFD r13!, {pc} SUB r1, r2, r0 BL func3 LDMVSFD r13!, {pc} LDMFD r13!, {pc}^can be rewritten as:
STMFD r13!, {r14} BL func1 ADDVC r0, r1, r2 BLVC func2 SUBVC r1, r2, r0 BLVC func3 LDMVCFS r13!, {pc}^ LDMFD r13!, {pc}A saving of 8 bytes in space, and 3 instructions in execution time for the case when no errors occur - which is hopefully the more common! I also find it more readable and it's easier to trace the flow of execution. Naturally, if you have to do comparisons then this is not possible and you'll have to either exit or branch over the conditional segment of code. It can be more efficient to do this if you have a long procedure exit sequence, but normally exiting is merely a matter of unstacking the registers and writing the appropriate value to the program counter which is one instruction.
Unsigned Signed HS (CS) GE HI GT LS LE LO (CC) LTPL and MI directly test bit 31 of the result (ie the N flag is a copy of bit 31). They are subtly different to GE and LT and I'd like someone to give me a case where they should be used instead of GE or LT.
STMFD r13!, {r14} MUL r0, r1, r2 LDMFD r13!, {pc} STMFD r13!, {r1, r2} LDR r1, [r12, #8] LDR r2, [r12, #12] MUL r0, r1, r2 LDMFD r13!, {r1, r2} MOVS pc, r14Better would be:
MUL r0, r1, r2 MOVS pc, r14 STMFD r13!, {r14} LDR r14, [r12, #8] LDR r0, [r12, #12] MUL r0, r14, r0 LDMFD r13!, {pc}In the second function, restoring registers is combined with returning from the procedure call. Since r14 must be corrupted by the function call, the caller is not expecting its value to be any particular value at the end of the function. This has been exploited in at least one program which I know of to return function values in r14, but you really need to be aware of exactly what you're doing with that technique, and I've never dared try it myself.
A further optimisation which can be used is to use a single store rather than a multiple store if you're only storing r14. The above code then becomes:
STR r14, [r13, #-4]! LDR r14, [r12, #8] LDR r0, [r12, #12] MUL r0, r14, r0 LDR pc, [r13], #4Beware that you can't tell LDR to restore the program counter flags. This is not an issue if you are using 32-bit APCS-3, but most routines on 26-bit ARMs preserve the flags over a function call.
It can sometimes be a win to only stack registers conditionally. Here's an example:
LDR r2, [r12, #0] TEQ r0, r2 MOVEQ pc, r14 STMFD r13!, {r14} BL func SUB r0, r0, #1 LDMFD r13!, {pc}This can be more efficient if r0 is often equal to r2; for example if this were an error check.
Another way in which the stack can be abused is to store data _below_ the stack pointer. This will appear to work but you will get random crashes. This is because interrupt code is entitled to use space on the stack temporarily; if you have left the stack pointer in the wrong place, you will have problems.
ADR rN, (offset AND &FF) ADD rN, ((offset - P%)AND &FF00)with slight modifications to cope with a negative offset, and the cunning ones actually try various possibilities such as
ADR rN, (offset AND &3FC) ADD rN, ((offset - P%) AND &3FC00)to gain them more distance.
Nevertheless, these pseudo-instructions take 2 instructions, and ADRX (for constants which can't be reached with ADRL) takes 3. It is better to avoid using these if possible. Try moving data around so that it's nearer the functions which reference it. If that's not possible, try moving functions around.
If you have data which are used throughout the program, consider allocating a register to point to their address throughout the execution of the program. In RISC OS modules, this is assisted by Acorn by pointing r12 at a block of private workspace.
Under no circumstances consider doing the following:
BL getvar ... .getvar LDR r0, var MOVS pc, r14 .var EQUD 0since this involves two pipeline flushes and wrecks most caching strategies so it is guaranteed to be slower. It also goes against the principle of keeping data and code separate which is more important on StrongARM processors with their split instruction/data caches.
LDR r0, [r6, #24] BIC r0, r0, #&ff000000 BIC r0, r0, #&00ff0000 MOV r1, r1, #&53 ORR r1, r1, #&ef00 TEQ r0, r1I replaced this with:
LDR r0, [r6, #24] MOV r0, r0, LSL #16 EOR r0, r0, #&ef000000 TEQ r0, #&00530000Shifting r0 left by 16 automatically zeroes the leftmost 16 bits and discards the previous top 16 bits, which is the desired effect. The crucial step is noticing that TEQ is the same as EORS, without a destination register. So if the top byte of r0 is not &ef then the second test can never be true. This trick saves 2 instructions and one register.
A similar technique can apply to other situations, for example checking that a 16 bit value is less than a given value can be done by
CMP r0, #&xy00 CMPEQ r0, #&00zais the same as
MOV r1, #&xy00 ORR r1, #&00za CMP r0, r1but takes one instruction fewer and uses one register fewer.
If it's utterly unavoidable, you may need to put a large constant into a register. It is thought that there are no constants which cannot be produced in 3 instructions, though no-one has an algorithm for producing arbitrary constants (to the best of my knowledge). On a StrongARM, it is normally quicker to load an instruction instead of using 3 to synthesise it. On an ARM6, it is quicker to use 3 instructions. Take your pick.
MOV r0, #200 MOV r4, #8 .loop MUL r1, r0, r4 ... SUBS r0, r0, #1 BNE loopbecomes
MOV r1, #1600 .loop ... SUBS r1, r1, #8 BNE loopThis saves a MUL instruction in the loop, which is a slow operation on many variants of the ARM. In this case, it also saves 2 registers, though in practice this is not often achieved.
MOV r1, #0 .loop ... ADD r1, r1, #8 CMP r1, #1600 BNE loopThe SUBS used above combines the test-for-end with the loop-count, saving an instruction.
MOV r1, #1600 .loop ... SUB r1, r1, #8 ... SUBS r1, r1, #8 BNE loopIn this example, since I haven't specified what is in `...', it's not possible to combine the first SUB in with it which might be possible in real code. This optimistaion saves half an instruction (and a cache line flush) per loop - 800 in total. Against that, it takes more cache lines, and more RAM in general. The only way to find if this is a win or not is to benchmark the code in question.
I'll leave you with some more fictional examples:
STMFD r13!, {r14} CMP r0, #5 BGT not CMP r0, #0 BLT not LDR r0, [r12, #8] B over .not MOV r0, #0 SUB r0, r0, #1 .over LDMFD r13!, {pc}Can be optimised to:
CMP r0, #5 LDRLS r0, [r12, #8] MVNHI r0, #0 MOV pc, r14This example demonstrates several points:
You probably want to look at these pages for further tips: Robin Watts' ARM programming page
I'd like to thank Phil Norman and Peter Burwood for their help.