Cache programming ===================== Erik H. Bakke/Bakke SoftDev How to program the cache on MC680x0 processors. This text covers the 68020 processor, and contains details on the dual caches in the 68030 processor. The 68040 cache implementation is not described in this text, but in general, they can be viewed as an extension of the 68030 caches, only larger, and with added functionality. General: ================== The cache registers are accessed as cpu control registers with the MOVEC instruction. The MOVEC instruction is a supervisor instruction that appears on the 68010, where it was used to access the VBR, SFC, DFC, and USP registers. On the 68020 the control registers for MSP, ISP, and the cache registers CAAR and CACR were added. Unused bits in CACR is always read as zero, and MUST AT ALL TIMES be written as ZEROS. Many unused bits are used in the 68030/40/60 processors. If you follow this guideline, you won't be in for big surprises on different processors. CACR register: ================== 68020: ------ The 68020 has an instruction cache of 256 bytes, organized as 32x 2 longwords. The CACR (CAche Control Register) is a 32-bit register which on the 68020 processor looks as follows: 31...............3..2..1..0. ============================ ...............| C|CE| F| E| ============================ Bit 0: E = Enable Cache This bit enables the caching on the 68020 If this bit is cleared, the processor will use external memory for all instruction stream fetches. When the processor is reset, this bit is cleared, and the cache disabled. Most operating systems sets this bit as a part of their initialization routine. Bit 1: F = Freeze Cache This bit effectively locks the cache in its current state. If the bit is cleared, the cache operates in normal mode, loading data into the instruction cache whenever a cache miss occurs. When this bit is set, the cache is checked for hits as usual, but a miss will not load any new data into the cache. With intelligent use of this bit, you can keep a specific (short) routine in the cache even if the routine calls other functions. Bit 2: CE= Clear entry This is a write-only bit. It always reads as zero. By setting this bit a specific cache entry can be made invalid. When the bit is set, the cache entry with index specified in CAAR bits 2-7. Use MOVEC to initialize CAAR. Bit 3: C = Clear cache This is a write-only bit. It always reads as zero. By setting this bit the cache is cleared. All subsequent cache tests will result in a miss and new data loaded into the cache. Example: How to clear entry no. 27 in the cache. move.l #%1101100,d0 ;27<<2 movec d0,CAAR move.l #%100,d0 ;Set CE to clear entry movec d0,CACR 68030: ------ On the 68030 the cache functions are extended. The 68030 has dual caches, one for instruction and one for data. The data cache is what is called a write-through cache. This means that if a cache hit is detected on a data write to memory, both the memory and the data in the cache is updated. The instruction cache operates as on the 68020. Cache loading is optimized on the 68030, and can be done in two ways: 1... Burst-fill: The cache is loaded with 4 contiguous long words in one operation. 2... Standard: The cache is loaded one longword at a time, just as with the 68020. The caches are still 256 bytes each, organized as 32 8-byte entries The bits for the instruction cache is identical to the 68020 bits, only suffixed with an I to identify them as instruction cache bits. So, the bits are EI, FI, CEI, and CI (See the paragraph about the 68020 for a description). In addition each cache has a Burst-Enable (BE) bit. BE : This bit controls the burst-fill mode of the cache. When set, the cache is filled 4 longwords at a time (One cache line) There are one bit for each cache, IBE for the instruction cache, and DBE for the new data cache. The final new bit is the write allocate bit. This bit controls the operation of the caches upon a write. If WA is clear, and a cache miss occurs on a write to memory cycle, the write does not update the data in the cache. If WA is set, the cache is always updated on a write to memory cycle, regardless of the write causing a cache hit or miss. The WA bit really only applies to the data cache, as the processor cannot write instructions, only data (which may later be read as instructions) The 68030 CACR register is laid out as follows: 31......14..13..12..11..10...9...8...7...6...5...4...3...2...1...0. =================================================================== ..........|.WA|DBE|.CD|CED|.FD|.ED|...........|IBE|.CI|CEI|.FI|.EI| =================================================================== | =========*========= =========*========= | | | The write allocate | The control bits for instruction cache | The control bits for data cache WA = Write Allocate DBE = Data Cache Burst Enable CD = Clear Data Cache CED = Clear Entry in Data Cache FD = Freeze Data Cache ED = Enable Data Cache IBE = Instruction Cache Burst Enable CI = Clear Instruction Cache CEI = Clear Entry in Instruction Cache FI = Freeze Instruction Cache EI = Enable Instruction Cache How to use the cache effectively ================================ On a 68020, where instructions are loaded longword by longword into the cache as the program is executed, it is rather easy to utilize the cache to its fullest. Just remember to freeze the cache before you branch to a subroutine outside of a rather busy loop. If the cache is not frozen, the subroutine will be loaded into the cache, and your loop will have to be reloaded every time the subroutine returns. Frozen caches will lead to a slight slowdown in the subroutine, but there is always a tradeoff when optimizing for speed. On a 68030, there are two caches. Both of these are able to burst- read information from memory. Such a burst read is done 4 longwords at a time. This is a fact that can be used to optimize your code. The clue to performance here, is to maximize the hit/miss ratio of the instruction cache. This means that the processor must find as many instructions as possible in the cache. This will lead to a problem with branches, as the PC points to another memory location and the cache probably have to be reloaded from there. This causes more memory accesses than what is really necessary. (Memory access tends to slow down a processor) Now, align your branches at the end of each 4 longword segment. Optimizing for the data cache: Same kind of optimizes as for the 68020 instruction cache applies to this cache if the bursting mode is disabled. Freezing the caches to contain your heavily accessed memory structure increases performance in your loop. If bursting is enabled, remember that the processor reads 4 longwords of data in each burst. Memory accesses slow the processor down, so you'd need to minimize these. This is achieved if your code accesses memory in contiguous accesses. move.l d0,(a0) move.l d0,(4,a0) move.l d0,(8,a0) move.l d0,(12,a0) move.l d0,(16,a0) can be much faster than move.l d0,(a0) move.l d0,(20,a0) move.l d0,(8,a0) move.l d0,(32,a0) move.l d0,(4,a0) In the worst case, the first example needs only two bursts (10 cycles), but the second one could need as much as 5 bursts (25 cycles) Usually, both code fragments will only need 2 bursts, but always optimize even for a worst case scenario. Warnings on cache usage ======================= When using caches, it is important to clear these when the task context switches, or virtual memory is utilized. Just imagine: Two processes share the same processor, they run from different physical addresses, but have the same logical address space. When the operating system switches from one task to another, the caches still contain data from the old process. This can lead to erroneous processing if the caches is not flushed by the operating system. ... ... E.H. Bakke Bakke SoftDev 1994