# Chapter **Cache**





#### More and faster

Ideally one would desire an infinitely large memory capacity such that any particular word ... would be immediately available .... We are .... forced to recognise the possibility of constructing a hierarchy of memories, each of which has a greater capacity then the preceding but which is less accessible.

#### A.W. Burks, H.H. Goldstine A.Von Neumann

Preliminary Discussion of the logical design of an Electronic Computing machine

Baby 1948: Limited by memory access time. Apple II (1977) CPU: 1000 ns; DRAM: 400 ns







#### Accessing

#### Fast access

*Aim computer architect:* To provide sufficient memory at an economic price.

Fast memory tends to be expensive and volatile. Static RAM (SRAM)

0.5ns – 2.5ns, \$2000 – \$5000 per GB Dynamic RAM (DRAM)

50ns – 70ns, \$20 – \$75 per GB Magnetic disk

5ms – 20ms, \$0.20 – \$2 per GB Magnetic Tape:

> Needs loading – sequential read. Seconds – minutes

Ideal memory: cheap, fast, reliable, permanent.

*Hierarchy :*Lots of cheap storage – decreasing amounts of more expensive memory. Move from cheap to more expensive as required.

Word line

Pass transistor

Bit line

Capacitor

#### Implementation

Implementation details of RAM, help to explain why we use both DRAM and SRAM

DRAM is very simple – 1 transistor and 1 capacitor per bit.

Charged capacitor equals 1, uncharged 0.

Capacitor charge slowly leaks away.

Need to refresh (c.f. Baby). Read contents and write back.

Refresh done on chip.

Multi-megabytes time problem, structure to allow to refresh a row with a single read/write. Refresh overhead small.

Memory access is stored in a square array. Access is done by specifying row number and column number. Chip takes row and column through input chips. *(Pin count)* 

Access time is in the 10s of nanosecond. It stays in step with SRAM performance but a factor 10 down.

*ECC – Error correcting codes.* On chip correction

Simple structure – cheaper to make per cell. Much smaller per cell. Used for cache

Readout on rising and falling clock edge. DDR Also how many bits per cycle. Data path width DDR3 is 64bit

#### SRAM

#### Implementation

SRAM access times are 1-5ns depending on the chip memory size.



SRAM/DRAM cache, lots of development

Sufficient address lines to address all memory. Memory smaller and used for main memory – so speed is vital. Needs ~6 transistors per cell and so Drive the clock too fast is more expensive and takes up more and memory becomes space than DRAM & more power unreliable Why not run just SRAM? At almost any price point a combination of DRAM and SRAM will give better performance. Low power SRAM typically same speed as DRAM Again minimum times are given by minimum setup times, write enable is a not edge triggered. Pulse with minimum width for stable operation. Cache Output via shared line

#### SRAM

#### Tri-state buffer



| Errors                           | Error Correcting codes ECC                                                                                                                        |                                                               |  |
|----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|--|
|                                  | Memory bits can end up in the wrong state.                                                                                                        |                                                               |  |
|                                  | Cosmic rays can cause a bit to flip.<br>Higher memory density $\rightarrow$ smaller cells $\rightarrow$ less charge to flip a bit. More errors.   |                                                               |  |
| Number of 1's is even<br>or odd. | ECC memory has extra information to detect bad<br>data.<br>Simplest is a parity bit. 1 per 8 bits. Makes the<br>parity sum odd (or even).         | Error detecting                                               |  |
|                                  | Any single bit error can be detected (but not corrected). Flipping two bits produces a valid word.                                                |                                                               |  |
|                                  | Error correcting codes are more complex.<br>They add bits and define only certain bit patterns<br>as being valid                                  | Language usually<br>allows us to detect and                   |  |
|                                  | In particular single errors produce codes which<br>are only 1 flip from the correct sequence, but two<br>flips from any other valid sequence.     | sometimes recover from<br>single character errors<br>Computer |  |
|                                  | Most common are Hamming codes                                                                                                                     | computer<br>commuter                                          |  |
|                                  | Memory controllers can implement forward error<br>correction, where the bad data is corrected<br>before the data is transferred without referring |                                                               |  |
|                                  | back to memory                                                                                                                                    | Cache                                                         |  |





#### **Overview**

Fetch from main memory is slow. Fast memory is expensive.

Solution a hierarchy which has large amounts of cheap memory and closer to the processor, one or two stages of progressively faster memory the cache

But data still has to flow all the way down from main memory to the CPU. How does this save time?

Spatial locality Temporal locality

How does the CPU/MMU know where to look in the cache for addresses which locate the data/instruction in main memory?

#### Locality

#### Locality

Move from cheap to more expensive as required ... if data is accessed only once or completely at random this would have little benefit.

In general programs memory access is rather highly constrained by the (observed) principle of locality.

#### **Temporal locality**

Items tend to be accessed repeatedly During execution of a loop

#### **Spatial locality**

If the next item is not the same as the last it is likely to be nearby.

Code tends to be executed sequentially.

Data items in arrays are arranged sequentially.

## Memory accesses as a function of time

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)



Memory Address (one dot per access)



```
Cache
                    Cache performance
                    When the CPU wants something already in cache
                    that is a hit. If it is not in cache that is a miss.
                    Cache miss rate is a measure of how well the
                    system is doing.
                    Miss rate (per memory access) =
                                                          Misses
                                                                         A Number of possible
                                                      Hits + Misses
                                                                         measures
                    May also refer to
                    Misses per instruction = Miss Rate x _Accesses_
                                                          Instruction
                    But how bad is a miss?
                    <memory access time>
                             = (Hit time) x(hit rate) + (Miss rate)*(Miss
                    Penalty)
                    Where Miss penalty is the time to retrieve the
                    item from further up the tree.
                    <memory access time> is still not as good a
                    measure as execution time, but is useful for
                                                                            Cache
                    feature comparisons.
```

#### Misses

#### **Cache Actions**

On cache hit, CPU proceeds normally

On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy

Instruction cache miss Restart instruction fetch

Data cache miss Complete data access

Cache can be multi-level. Up to three cache levels are available on some modern systems. Transfers between adjacent levels. Consider only two levels at a time.

Miss means that data is copied from the next level down into this level of cache.

Levels normally expense and speed of technology. At constant "technology speed" larger caches are slower. May split just for speed. A cache miss at one level down can only be triggered by the level above cache or CPU.

Cache

| Cache Effect                                                                                                                                             | Execution performance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                   |
|----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|                                                                                                                                                          | A cache miss, means the CPU needs to wait a certain<br>number of cycles.                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                   |
|                                                                                                                                                          | Miss Penalty = Access time + transfer time                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Increase transfer size                                                            |
| Improve CPU<br>Misses become more<br>important. Decrease base CPI:<br>larger proportion on<br>stalls Increase Clock<br>more cycles for a<br>memory stall | <ul> <li>Execution Time <ul> <li>(Instructions + (Memory Stall Cycles))*period</li> <li>(Instructions + (Memory Stall Cycles))*period</li> <li>(Instruction)*<u>Accesses</u> x (miss rate) x (penalty)</li> <li>Instruction</li> </ul> </li> <li>Approximation – miss penalty is different for reads and writes.</li> <li>Take an average ratio of reads/writes <ul> <li>Differs between programs.</li> </ul> </li> <li>Read rate and penalty, write rate and penalty. <ul> <li>Added complexity.</li> </ul> </li> </ul> | Increase transfer size.<br>May reduce miss rate,<br>will increase miss<br>penalty |
|                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                   |

#### Example

### Pentium 4

Level 1 (D-Cache): Capacity=16K, Access=4 cycles

```
Level 2 (D-Cache): Capacity=1024, Access
=18cycles
```

Main memory (D-Cache) Access = 180cycles

Level 1 is longer than one would want in a MIPS machine.

#### Caching terminology

**Block** (line): Unit of storage in the cache Memory is divided into blocks – which map into cache locations.

#### Data Referenced:

Hit: Data in cache Miss: Data not in cache – fetch it from higher level May mean replacing something in cache

#### **Design considerations:**

Placement: where to place a block in cache Replacement: what to remove (overwrite) Granularity: block size, uniformity Writes: action when value in cache is written to Instructions & Data: Uniform or split cache

| Miss types | Sources of misses                                                                                                                                                                                   |                                  |
|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
|            | <i>Compulsory:</i> First time access and the block cannot be in the cache.                                                                                                                          |                                  |
|            | <i>Capacity:</i> If the programme needs more memory than is present in the cache, then some blocks will be discarded and later retrieved,                                                           |                                  |
|            | <i>Conflict:</i> If the cache organisation means that two blocks of main memory are written to the same area of cache. One might over write the other <b>even</b> while there is room in the cache. | A Number of possible<br>measures |
|            | <i>Coherency:</i> requirement to keep multiple caches consistent.                                                                                                                                   |                                  |
|            |                                                                                                                                                                                                     |                                  |
|            |                                                                                                                                                                                                     |                                  |
|            |                                                                                                                                                                                                     |                                  |
|            |                                                                                                                                                                                                     | Cache                            |



Memory locations 0 to N





Cache





#### Deconstructing the address

Index: 6 bits that means there are 64 blocks in cache.

Offset: 4 bits which mean every block is 16 bytes long (could be words)

The TAG is the rest of the rest of the address.

For a given value of index and offset – any value of the tag corresponds to a position in memory which will be stored at that location of the cache.



#### Total size for a cache

This showed blocks being 1 word – normally they are bigger – transfer extra data to utilise locality. Not too many or transfer time increases and hence miss penalty.

So each block has K words. K is a power of two for addressing. K words =  $2^2 K$  bytes =  $2^{(k+2)}$ The cache has  $2^m$  blocks – again a power of 2. Block is then  $2^{(m+k)}$  words or  $2^{(m+k+2)}$  bytes. 32 bit address space –  $2^32$  bytes –  $2^30$  words The cache is overloaded by  $2^32/2^{(m+k+2)}$ So the Tag must have  $2^{(32-m-k-2)}$  bits –which come from the high order of the address.

The byte offset inside the block is the first k bits of the address – and the intervening bits give the block address inside the cache.

Total cache size then needs to include in addition to the data/instruction word the tag bits and the valid bit



64 blocks: 4 words 16 bytes/block Overload is 2^(30-8) = 2^32

Cache

#### Cache Size

#### **Cache Optimisation**

*Spatial locality* indicates we should transfer large blocks of data.

Large cache is clearly good, **but** if size is fixed.

Large blocks mean fewer blocks: may overwrite block before it is finished with *temporal locality* May lead to a higher miss rate; more to be transferred so larger miss penalty

Cache too large and we will see degradation **Cache hit/miss** 

Hit ... access data

Miss

Need longer to fetch so stall the pipeline Fetch required data from the next level *Miss Type:* Instruction --- restart instruction fetch

Data --- Complete data access



#### Comparison

#### Improvements with associativity

Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% Be c

4-way: 8.3%

8-way: 8.1%

#### And increased complexity



Be careful when comparing systems.

Block size improves hit rate Associativity improves hit rate Total size has the largest performance.

Comparisons must be on same size memory, not same number of blocks

Cache



time



Memory address

| Block       |  |
|-------------|--|
| replacement |  |

#### **Replacement algorithm**

Cache is full, which block to throw out?

Direct mapped No choice. Simplest hardware.

#### Fully associative:

*Random:* easiest to implement. (Can use pseudorandom replacement, so it is predictable and makes testing easier)

Most common

```
Least-recently used (LRU): exploiting temporal locality.
Need to store access time.
```

```
First in First out (FIFO) :
```

Easier to implement that LRU. Two blocks, LRU is just

LRU becomes harder as size increases.

I Two blocks, LRU is just alternate. Finish a block. Get another address, go to the other block, either the address is there or write it there.

For a large cache there is no difference. For a small cache LRU is better, but not by much. Rate ~ 11% Difference < 0.5%



| Block<br>replacement (ii) | LR | U               |    |     |  |
|---------------------------|----|-----------------|----|-----|--|
|                           | Or | rewrite numbers |    |     |  |
|                           | 3  | 010             | 4  | 010 |  |
|                           | 6  | 111             | 7  | 111 |  |
|                           | 2  | 000             | 3  | 000 |  |
|                           | 1  | 100             | 2  | 100 |  |
|                           | 7  | 110             | 1  | 110 |  |
|                           | 1  | 100 6           | 10 | 00  |  |
|                           | 4  | 010             | 5  | 010 |  |

You still need to overwrite a number of locations (increment mostly)

And now when you are throwing one out you have to search for the largest number. Miss penalty getting larger.

One might look to throw out blocks whose contents have not been changed – but that depends on the strategy for updating values in cache that are changed.

| Block<br>replacement | Replacement algorithm |  |
|----------------------|-----------------------|--|
| 1                    |                       |  |
|                      |                       |  |

| As     | soc: | 2-way |                | 8-way |       |       |       |
|--------|------|-------|----------------|-------|-------|-------|-------|
| Size   |      | LRU   | Ran            | LRU   | Ran   | LRU   | J     |
| Ran    |      |       |                |       |       |       |       |
| 16 KB  | 5.2% | 5.7%  | 4.7%           | 5.3%  | 4.4%  | 6     | 5.0%  |
| 64 KB  | 1.9% | 2.0%  | 1.5%           | 1.7%  | 1.4%  | 6     | 1.5%  |
| 256 KB | 1.   | 15%   | $1.17^{\circ}$ | %     | 1.13% | 1.13% | 1.12% |
| 1.12%  |      |       |                |       |       |       |       |

Least recently used is intellectually appealing **but** it is more complicated to implement and makes very little difference.

Increasing associativity has a small effect, cache size dominates.

64 to 256 is rather small and there will be a time penalty for the increased size

| Approximations | NMRU                                                  |
|----------------|-------------------------------------------------------|
|                | Not most recently used – just needs 1 bit             |
|                | Victim – Next Victim                                  |
|                | 2 blocks status tracked in each associativity set     |
|                | V (victim) – NV (next victim) – others are O(rdinary) |
|                | On cache hit                                          |
|                | Promote NV to V                                       |
|                | Promote an O at random to NV                          |
|                | Return V to 0                                         |
|                | On cache miss                                         |
|                | Replace V                                             |
|                | Promote NV to V                                       |
|                | Promote an O at random to NV                          |
|                |                                                       |

NMRU – means you are guaranteed not to kick out the last used block and V/NV means that you won't kick out either of the last two.

Locality means that correlations are only short range.

#### Actions on a cache miss

A dedicated controller which: Performs the memory access and fills the cache Creates a stall (cf pipeline stall), not an interrupt. Stall whole processor – easier than pipeline stall, but much more costly for performance. After a miss the instruction register does not

contain a valid instruction. So:

1.Set the PC to PC -4; points at instruction which generated the miss

2.Request read from memory and wait for completion (many cycles)

3.Write the data to cache; writing the upper portion of address (from ALU) to the tag field, set the valid bit

4. Restart execution.

Instruction is resent and this time results in a hit.

Interrupts requiree register store

| Writing | <b>cache</b><br>A ded<br>Perforr | <b>example</b><br>icated controller which:<br>ns the memory access and fills t | the cache                                                                          |
|---------|----------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------|
|         |                                  | Write-Through                                                                  | Write-Back                                                                         |
| Policy  |                                  | Data written to cache also<br>written to lower level<br>memory                 | Write data only to the<br>cache. Update lower level<br>when block falls out of the |

|                                               |      | cache |
|-----------------------------------------------|------|-------|
| Debug                                         | Easy | Hard  |
| Do read misses produce writes?                | No   | Yes   |
| Do repeated writes<br>make it to lower level? | Yes  | No    |

|  |  |  |  | Cache |  |
|--|--|--|--|-------|--|

| <b>C</b> ache writing                                           | Cache Hit<br>Write through cache<br>Options<br>Update the cache But memory inconsistent                                                                |                                      |
|-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|
| Assume all other<br>instructions are single<br>cycle            | Could update memory as well.<br>But time consuming                                                                                                     |                                      |
| = $0.88*1 + 0.12 \times 100$<br>= 12.9<br>Significant slow down | Base CPI is 1, then write to memory is around<br>100 cycles 12% of instructions are stores                                                             | SpecInt92 benchmarks<br>on Intel x86 |
|                                                                 | <b>Solution</b> :<br>write buffer: holds data to be written to memory<br>CPU has no need to wait<br>stalls on write only occur if write buffer is full |                                      |
|                                                                 | <b>Write back</b><br>If there is a write hit, only update the cache<br>Keep track of <i>dirty</i> blocks                                               |                                      |
|                                                                 | When a dirty block is overwritten<br>Write it back to memory                                                                                           |                                      |
|                                                                 | <b>Cache miss</b><br><b>Allocate on miss</b> : read the block into cache<br><b>Write around:</b> don't fetch the block.                                |                                      |
|                                                                 |                                                                                                                                                        | Cache                                |



#### Separate Instructions and Data (eg MIPS)

#### I-Cache

#### **D-Cache**

Different requirements Eg no write-back Can access both instruction and data simultaneously

I-Cache miss rate 0.4% D-Cache miss rate 11.4% Spec2000 benchmarks on Embedded MIPS



| Memory<br>organisation                                   |         |           | <b>Higher Performance</b><br>Four bank interleaved – no increase in data path   |             |                  | Processor               |                                                  |                  |  |
|----------------------------------------------------------|---------|-----------|---------------------------------------------------------------------------------|-------------|------------------|-------------------------|--------------------------------------------------|------------------|--|
|                                                          |         |           |                                                                                 |             |                  |                         |                                                  |                  |  |
|                                                          |         |           | Other possible improvements <i>Burst mode</i> : only need full access time for  |             |                  | Bus                     |                                                  |                  |  |
|                                                          |         |           | <i>DDR</i> : transfer on the rising and falling cl<br>edges. (Double Data Rate) | ter<br>lock | Memory<br>bank 0 | Memory<br>bank 1        | Memory<br>bank 2                                 | Memory<br>bank 3 |  |
| Year                                                     | Size    | \$/GB     | <i>QDI</i> , separate inpat and satpat on DDI                                   |             |                  |                         |                                                  |                  |  |
| 1980                                                     | 64Kbit  | \$1500000 | <b>Quantitative</b><br>Need not transfer rate memory to cache, but on overall   |             |                  |                         |                                                  |                  |  |
| 1983                                                     | 256Kbit | \$500000  | performance.                                                                    |             |                  |                         |                                                  |                  |  |
| 1985                                                     | 1Mbit   | \$200000  | Memory stall cycles from cache misses =                                         |             |                  |                         | DDR transfers on rising<br>edge and falling edge |                  |  |
| 1989                                                     | 4Mbit   | \$50000   |                                                                                 |             |                  |                         |                                                  |                  |  |
| 1992                                                     | 16Mbit  | \$15000   | Memory accesses/program * Miss rate * Miss penalty<br>equivalent                |             |                  | QDR separate inputs and |                                                  |                  |  |
| 1996                                                     | 64Mbit  | \$10000   | Instructions/program * Misses/instructions * penalty                            |             |                  |                         |                                                  |                  |  |
| 1998                                                     | 128Mbit | \$4000    | Average memory access time (AMAT)                                               |             |                  |                         |                                                  |                  |  |
| 2000                                                     | 256Mbit | \$1000    | = Hit time*Hit rate + Miss rate ×                                               |             |                  |                         |                                                  |                  |  |
| 2004                                                     | 512Mbit | \$250     | Miss penalty                                                                    |             |                  |                         |                                                  |                  |  |
| 2007                                                     | 1Gbit   | \$50      | Eg hit time is 1 miss penalty is 20. Miss rate is x                             |             |                  |                         |                                                  |                  |  |
| Source: Patterson<br>Computer Organisation<br>and DEsign |         |           | What rate doubles the memory access time?<br>$1^*(1-x) + 20x = 2$<br>19x = 1 x  | = 0.05      | 52               | Ca                      | ache                                             |                  |  |

#### Penalty

#### **Effect on Performance**

When CPU performance increased Miss penalty becomes more significant

Decreasing base CPI Greater proportion of time spent on memory stalls

Increasing clock rate Memory stalls account for more CPU cycles

Can't neglect cache behavior when evaluating system performance.

So in all improvements in processor performance will not translate into execution time improvements unless the cache performance keeps pace with the CPU

| Optimisation                               | Cache optimisation                                                                                                                                                                      |                                    |  |
|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|--|
|                                            | How can we distribute our resources in the cache<br>so as to improve system performance.<br><average access="" time=""> =<br/>(Hit time)*(Hit rate) + (Miss time)*(Miss rate)</average> |                                    |  |
| Rate is a fraction                         | 1.Reduce miss rater2.Reduce miss penaltyT3.Reduce hit timet                                                                                                                             | Hennessey has a simplified version |  |
|                                            | <t> = t.(1-r) + T.r<br/>= t - rt + Tr<br/>If t(1-r) = Tr equal contributions</t>                                                                                                        |                                    |  |
|                                            | t - tr = Tr -> $1 - r = r.T/t$                                                                                                                                                          |                                    |  |
|                                            | 1/r - 1 = T/t -> $1/r = T/t + 1$                                                                                                                                                        |                                    |  |
|                                            | r = t/(t+T) for t< <t< td=""><td></td></t<>                                                                                                                                             |                                    |  |
| r > t/T. Reduce r or t<br>r < t/T reduce T | r~ t/T                                                                                                                                                                                  |                                    |  |
|                                            |                                                                                                                                                                                         | Cache                              |  |