

# Hardware based

Multiple instruction machines find branches an impediment to creating ILP

Branch prediction: fetches and issues instructions Speculation: fetches, issues and executes instructions

## Speculation involves

dynamic branch prediction, chooses which instructions to execute speculation actually executes instructions before the branch is resolved dynamic scheduling to deal with different combinations of basic blocks

Principle – predicted flow of data values determines when to execute instructions. *Data flow execution*. Operations proceed as soon as their operands are available.

Before we split execution into issue and execute. Here we split execution into execute and commit. We insist on *in-order* commit.

## Commit

Principle is instructions are issued and execute and pass their results onto further instructions.

**But** not to make irreversible updates. Such as raising an exception.

To hold these results, computed but not committed a further set of buffers is required.

### **Re-order buffer ROB**

Holds result between complete and commit.

ROB thus must provide operands to subsequent instructions, until commit

| <b>Rob Entry</b><br>Instruction<br>type | destination<br>field     | value<br>field           | ready<br>field        |
|-----------------------------------------|--------------------------|--------------------------|-----------------------|
| branch<br>Store<br>load or ALU          | register #<br>memory loc | value<br>until<br>commit | yes<br>if<br>complete |

### **Execution with Speculation**

#### **I**ssue

get an instruction from the instruction queue Issue if there is space in the *reservation station* (holds instructions between issue and execute) and space in ROB (holds between execution and commit). Update to show buffers in use. A ROB entry is allocated for the result, this tags the instruction - its value is sent to the reservation station.

else stall

There must be somewhere to hold the output state during the next clock cycle. Where ever data must be stable during a cycle there must be a buffer. Execute/commit break means we provide another buffer.

*Execute* Wait for any outstanding operands to become available. Check for RAW hazards

## **Execution with Speculation**

### Write result

Broadcast the result with the ROB tag and write to ROB plus any reservation stations waiting on result. Reservation station for this instruction is released.

If the command is a store and available write to the value field. Not available, then wait for broadcasted value

## Commit

ROB entry reaches head of the queue Is the instruction a store, an incorrect branch, other instruction.

If result is present in the buffer update the target register with the result . For a store update the memory. Incorrect branch, ROB flushed and execution restarted at correct point.

ROB slot reclaimed If the ROB fills execution issue stalls.

Reservation Station (beginning) ROB (End) Separate checks required. Head of queue implies time for in order commit

| 6 Speculation |  |
|---------------|--|
|---------------|--|

# Alternative to ROB

Register renaming

End of execution copy result to buffer. Hold in buffer until make permanent, then copy to register.

Suppose bypass the buffer and copy straight to the register.

Same thing as long as the register is untouched until the commit stage is reached. Better only one data copy.

But will rapidly run out of registers. Increase the number of registers. 20-80 A register may be a *temporary* register or an *architecturally visible* register.

WAW & WAR handled by renaming. Speculation because physical register is not architectural until commit.

When a physical register is mapped to an architectural register the physical register previously mapped is *freed* 

Head of queue implies time for in order commit

# **Register renaming**

When to free a physical register? When no longer needed.

When another instruction that writes to the same architectural register commits.

Must be free or else the other register could not have arrived at the head of the speculation queue.

May be earlier – when no functional unit designates the physical register as a source and has been superseded as architectural register.

Architectural register commit is later, but easier to implement.

## **Speculation disadvantages**

Additional on chip resources. Space and power Recovery takes resources. Erroneous speculation takes expensive resources which are pointless – such as second level cache misses.

Processors may stall speculative events when they require expensive resources until not speculative.

Note difference in allocation and write

| 8 Performance             | <ul> <li>Which is better speed or clever ILP (instruction level parallelism</li> <li>Speed comes from</li> <li>1.Underlying material physics: Si, Ge</li> <li>2.Material processing: feature size. Smaller size faster processing</li> <li>3.Power: higher power faster switching</li> <li>4.Complexity: high complexity slower.</li> </ul> |                                         |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
|                           | It depends !                                                                                                                                                                                                                                                                                                                                |                                         |
|                           | Driving feature size were the foundaries, this was<br>fixed when SUN/HP/IBM/Dec Were competing.<br>Intel (AMD) still have to decide where to work                                                                                                                                                                                           | Note difference in allocation and write |
|                           | Power. Is it better to run two low powered<br>processors or one high powered. Performance<br>requirement / application specific                                                                                                                                                                                                             |                                         |
| X-box v.<br>Playstation 3 | At given power and feature size, should we go for<br>simpler and faster architecture.<br>Or clever algorithms with much more hardware.<br>Still it depends – affects clock rate. CPI v GHz.<br>8 cores v. 10 cores<br>Release in March v September<br>Yield of 50% v 60%?                                                                   |                                         |
|                           |                                                                                                                                                                                                                                                                                                                                             | Speculation                             |

| <b>9</b> Performance (i) | Limits : Ideal machine                                                                                                                            |                             |
|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|
|                          | What is a perfect ILP machine.                                                                                                                    |                             |
|                          | <i>1.Register renaming</i> – infinite virtual registers<br>2.=> all register WAW & WAR hazards are avoided                                        |                             |
|                          | 2. Branch prediction – perfect; no bad predictions                                                                                                |                             |
|                          | 3. Jump prediction – all jumps perfectly predicted (returns, case statements)                                                                     | A bit like a Carnot         |
|                          | <ol> <li>Memory-address alias analysis – addresses<br/>known &amp; a load can be moved before a store<br/>provided addresses not equal</li> </ol> | engine in<br>thermodynamics |
|                          | 5. <i>Perfect caches</i> – 1 cycle latency                                                                                                        |                             |
|                          | 2&3 no control dependencies; perfect speculation & an unbounded buffer of instructions available                                                  |                             |
|                          | 1&4 only true data dependencies                                                                                                                   |                             |
|                          | 5 depends on ILP to hide cache misses. Possible but optimistic                                                                                    |                             |
|                          |                                                                                                                                                   | Speculation                 |

### **10** Performance

#### Ideal machine performance



Numbers from Hennessy on Spec 95

Average ~ 70

Window size Set of instructions examined for simultaneous execution

# Perfect machine is not infinitely fast.

| s | Instr per clock   | Ideal $\infty$ | actual 4      |
|---|-------------------|----------------|---------------|
| ~ | Instr Window size | Ideal $\infty$ | actual 200    |
|   | Rename registers  | Ideal $\infty$ | 48 int, 40 FP |
|   | Branch prediction | Ideal 100%     | 94%-98%       |
|   | Cache             | perfect        | Levels        |
|   | Memory alias      | perfect        | <u>;</u> ;    |
|   |                   |                |               |

How do the different imperfections affect the performance. How far are we from ideal

Actual from '05. Instr per clock is 4 issue, upto 6 initiates.

200 instructions in flight!

| <b>11</b> Performance | Window size and maximum issue                                                                                                                                                                                                                                                                                                             |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                       | Perfect machine                                                                                                                                                                                                                                                                                                                           |
|                       | <ul> <li>1.Looks arbitrarily far ahead to find instructions<br/>with perfect branch prediction</li> <li>2.Rename all registers to avoid WAR and WAW</li> <li>3.Find data dependencies</li> <li>4.Find memory dependencies</li> <li>5.Have enough functional units</li> </ul>                                                              |
|                       | Each instruction in the window must be<br>permanently in the cpu. Hard.<br>Worse we must check every instruction every cycle.<br>Number of checks is window size $\times$ operations in<br>flight $\times$ operands per instruction.<br>For the example $6 \times 200 \times 2 = 2400$ .<br>Increasing window size has serious overheads. |
|                       |                                                                                                                                                                                                                                                                                                                                           |

| <b>12</b> How much possible | Best                                                                                                                                                                                                                                                                                                                                                       |
|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -                           | Started by observing an average of 70 improvement possible.                                                                                                                                                                                                                                                                                                |
|                             | Running at less than 2% efficiency.                                                                                                                                                                                                                                                                                                                        |
|                             | Saw a couple of places where improves of 3 may be possible.                                                                                                                                                                                                                                                                                                |
|                             | People are looking at various other ways of improving things.                                                                                                                                                                                                                                                                                              |
|                             | Speculating along multiple branches. It seems<br>likely that the system would run faster if we could<br>examine all possible branches. Branches must be<br>parallel, but of course the same variables may exist<br>in more than one branch, which causes problems.<br>The number of paths clearly grows exponentially,<br>limiting its use in complicated. |
|                             | Real but unnecessary dependences. Loop counter for instance.                                                                                                                                                                                                                                                                                               |
|                             |                                                                                                                                                                                                                                                                                                                                                            |

| <b>13</b> Conclusions | How much better can we do?                                                                            |  |
|-----------------------|-------------------------------------------------------------------------------------------------------|--|
|                       | Feature size still shrinking.<br>natural speed up<br>Moore's Law still good                           |  |
|                       | So more transistors means more functional units,<br>more parallel execution – more wasted effort      |  |
|                       | Faster – but green issues                                                                             |  |
|                       | Smaller units – ubiquitous computing                                                                  |  |
|                       | Latest AMD chip coming out this year. Duplication<br>of functional units looks very like the CDC6600. |  |
|                       | Not exactly revolutionary.                                                                            |  |
|                       | Computers are faster – but designers are no<br>smarter!                                               |  |
|                       |                                                                                                       |  |
|                       |                                                                                                       |  |