Cache programming
        =====================
     Erik H. Bakke/Bakke SoftDev


How to program the cache on MC680x0 processors.
This text covers the 68020 processor, and contains
details on the dual caches in the 68030 processor.
The 68040 cache implementation is not described in
this text, but in general, they can be viewed as an
extension of the 68030 caches, only larger, and with
added functionality.

General:
==================

The cache registers are accessed as cpu control registers with
the MOVEC instruction.
The MOVEC instruction is a supervisor instruction that appears on
the 68010, where it was used to access the VBR, SFC, DFC, and USP
registers.  On the 68020 the control registers for MSP, ISP, and
the cache registers CAAR and CACR were added.
Unused bits in CACR is always read as zero, and MUST AT ALL TIMES
be written as ZEROS.  Many unused bits are used in the 68030/40/60
processors.  If you follow this guideline, you won't be in for big
surprises on different processors.


CACR register:
==================

68020:
------

The 68020 has an instruction cache of 256 bytes, organized as 32x
2 longwords.
The CACR (CAche Control Register) is a 32-bit register which on the
68020 processor looks as follows:

31...............3..2..1..0.
============================
...............| C|CE| F| E|
============================

Bit 0:  E = Enable Cache
        This bit enables the caching on the 68020
        If this bit is cleared, the processor will use external
        memory for all instruction stream fetches.
        When the processor is reset, this bit is cleared, and the
        cache disabled.  Most operating systems sets this bit as
        a part of their initialization routine.

Bit 1:  F = Freeze Cache
        This bit effectively locks the cache in its current state.
        If the bit is cleared, the cache operates in normal mode,
        loading data into the instruction cache whenever a cache
        miss occurs.  When this bit is set, the cache is checked
        for hits as usual, but a miss will not load any new data
        into the cache.  With intelligent use of this bit, you can
        keep a specific (short) routine in the cache even if the
        routine calls other functions.

Bit 2:  CE= Clear entry
        This is a write-only bit.  It always reads as zero.
        By setting this bit a specific cache entry can be made
        invalid.  When the bit is set, the cache entry with index
        specified in CAAR bits 2-7.  Use MOVEC to initialize CAAR.

Bit 3:  C = Clear cache
        This is a write-only bit.  It always reads as zero.
        By setting this bit the cache is cleared.  All subsequent
        cache tests will result in a miss and new data loaded into
        the cache.


Example:

How to clear entry no. 27 in the cache.

        move.l #%1101100,d0   ;27<<2
        movec  d0,CAAR
        move.l #%100,d0       ;Set CE to clear entry
        movec  d0,CACR


68030:
------

On the 68030 the cache functions are extended.
The 68030 has dual caches, one for instruction and one for
data.  The data cache is what is called a write-through
cache.  This means that if a cache hit is detected on a
data write to memory, both the memory and the data in the
cache is updated.  The instruction cache operates as on the
68020.
Cache loading is optimized on the 68030, and can be done in
two ways:
1... Burst-fill:  The cache is loaded with 4 contiguous
                  long words in one operation.
2... Standard:    The cache is loaded one longword at a time,
                  just as with the 68020.

The caches are still 256 bytes each, organized as 32 8-byte
entries

The bits for the instruction cache is identical to the 68020
bits, only suffixed with an I to identify them as instruction
cache bits.  So, the bits are EI, FI, CEI, and CI (See the
paragraph about the 68020 for a description).
In addition each cache has a Burst-Enable (BE) bit.

BE  :   This bit controls the burst-fill mode of the cache.
        When set, the cache is filled 4 longwords at a time
        (One cache line)

There are one bit for each cache, IBE for the instruction cache,
and DBE for the new data cache.

The final new bit is the write allocate bit.  This bit controls
the operation of the caches upon a write.
If WA is clear, and a cache miss occurs on a write to memory cycle,
the write does not update the data in the cache.
If WA is set, the cache is always updated on a write to memory
cycle, regardless of the write causing a cache hit or miss.
The WA bit really only applies to the data cache, as the processor
cannot write instructions, only data (which may later be read as
instructions)

The 68030 CACR register is laid out as follows:

31......14..13..12..11..10...9...8...7...6...5...4...3...2...1...0.
===================================================================
..........|.WA|DBE|.CD|CED|.FD|.ED|...........|IBE|.CI|CEI|.FI|.EI|
===================================================================
            |  =========*=========             =========*=========
            |           |                               |
   The write allocate   |    The control bits for instruction cache
                        |
          The control bits for data cache

WA  = Write Allocate
DBE = Data Cache Burst Enable
CD  = Clear Data Cache
CED = Clear Entry in Data Cache
FD  = Freeze Data Cache
ED  = Enable Data Cache
IBE = Instruction Cache Burst Enable
CI  = Clear Instruction Cache
CEI = Clear Entry in Instruction Cache
FI  = Freeze Instruction Cache
EI  = Enable Instruction Cache


How to use the cache effectively
================================

On a 68020, where instructions are loaded longword by longword into the
cache as the program is executed, it is rather easy to utilize
the cache to its fullest.  Just remember to freeze the cache
before you branch to a subroutine outside of a rather busy loop.
If the cache is not frozen, the subroutine will be loaded into the
cache, and your loop will have to be reloaded every time the
subroutine returns.  Frozen caches will lead to a slight slowdown
in the subroutine, but there is always a tradeoff when optimizing for
speed.

On a 68030, there are two caches.  Both of these are able to burst-
read information from memory.  Such a burst read is done 4 longwords
at a time.  This is a fact that can be used to optimize your code.
The clue to performance here, is to maximize the hit/miss ratio of
the instruction cache.  This means that the processor must find as
many instructions as possible in the cache.  This will lead to a
problem with branches, as the PC points to another memory location
and the cache probably have to be reloaded from there.  This causes
more memory accesses than what is really necessary.  (Memory access
tends to slow down a processor)  Now, align your branches at the end
of each 4 longword segment.
Optimizing for the data cache:
Same kind of optimizes as for the 68020 instruction cache applies to
this cache if the bursting mode is disabled.  Freezing the caches to
contain your heavily accessed memory structure increases performance in
your loop.
If bursting is enabled, remember that the processor reads 4 longwords
of data in each burst.  Memory accesses slow the processor down, so
you'd need to minimize these.  This is achieved if your code accesses
memory in contiguous accesses.

   move.l   d0,(a0)
   move.l   d0,(4,a0)
   move.l   d0,(8,a0)
   move.l   d0,(12,a0)
   move.l   d0,(16,a0)

can be much faster than

   move.l   d0,(a0)
   move.l   d0,(20,a0)
   move.l   d0,(8,a0)
   move.l   d0,(32,a0)
   move.l   d0,(4,a0)


In the worst case, the first example needs only two bursts (10 cycles),
but the second one could need as much as 5 bursts (25 cycles)
Usually, both code fragments will only need 2 bursts, but always
optimize even for a worst case scenario.


Warnings on cache usage
=======================

When using caches, it is important to clear these when the task context
switches, or virtual memory is utilized.  Just imagine:
Two processes share the same processor, they run from different physical
addresses, but have the same logical address space.  When the operating
system switches from one task to another, the caches still contain data
from the old process.  This can lead to erroneous processing if the
caches is not flushed by the operating system.


...
...

E.H. Bakke
Bakke SoftDev
1994