# Preventing Stalls: 1



| 2 ] | PipeLine |
|-----|----------|
|-----|----------|

| Pipeline efficier                                                                                                                                                                          |                                                                                                                       |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|--|
| Pipeline CPI                                                                                                                                                                               | <ul> <li>Ideal pipeline CPI</li> <li>Structural Stalls</li> <li>Data Hazard Stalls</li> <li>Control Stalls</li> </ul> |  |
| Ideal pipeline CF                                                                                                                                                                          |                                                                                                                       |  |
| Structural hazar                                                                                                                                                                           |                                                                                                                       |  |
| Data hazards: No                                                                                                                                                                           |                                                                                                                       |  |
| Control hazards:                                                                                                                                                                           | Branches and jumps                                                                                                    |  |
| Instruction-Level Parallelism (ILP):<br>Seeks to overlap instruction execution<br>Hardware: dynamically runtime<br>Software: statically compile time<br>Simplest is Loop Level Parallelism |                                                                                                                       |  |
|                                                                                                                                                                                            |                                                                                                                       |  |

| 3 | Loops |  |
|---|-------|--|
|---|-------|--|

#### Loop level parallelism

Dynamic: branch prediction Static: Loop unrolling

Dependence

independent/parallel. Simultaneous execution possible. Can be placed in a pipeline, with only (possibly) structural hazards.

dependent.. Must occur in order; partial overlap possible

Dependent: if instruction2 uses result of instruction 1. *Data Hazard.* **Read after Write Hazard : RAW Hazard** 

Hardware and Software must produce the same result as strict sequential execution.

Preserve order only where vital

Actual hazard: existence; stall length. Depends on implementation of pipeline.

Dependence in program indicates *potential* for hazard. stipulates an order upper bound on possible performance May not be correct, if code is not correct

| <b>4</b> Dependence                | Name dependence                                                                                                                                                   |                                            |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|
|                                    | Also called anti-dependence.<br>When a memory location or register is re-used.                                                                                    |                                            |
|                                    | Means instruction 2 may write before instruction 1<br>has used the value<br><b>Write after Read, WAR hazard.</b>                                                  |                                            |
|                                    | Or instruction 1 may write after instruction 2 has<br>written, but before the subsequent use is made of<br>the data<br><b>Write after Write, WAW hazard.</b>      |                                            |
|                                    | Both may be resolved by using a separate register. <i>register renaming</i> (hardware or software)                                                                |                                            |
| Dresserve order order              | <b>Control Dependence</b><br>if then else blocks                                                                                                                  | May not be correct, if code is not correct |
| Preserve order only<br>where vital | Often blocks can be executed ignoring conditions,<br>if we can throw away the results.<br>Ensure the system is completely unaffected by<br>unwanted calculations. |                                            |
|                                    | Need to handle <i>exceptions</i> and ensure correct                                                                                                               |                                            |

data flow

```
5 Loops
```

## Assembler

```
for (int pnt=1000; pnt>0; pnt--) {
 arr[pnt] = arr[pnt] + offset;
This Java loop will compile to something like
lw $r2, offset; offset
Loop:
lw $r3, 0($r1); arrEl element
    add $r4,$r3, $r2; add
    sw 0($r1), $r4; store result
    addi $r1, $r1, -4; decrement pnt
    bnez $r1, Loop continue
MIPS 5 step pipeline becomes
lw $r2, offset;
                                            May not be correct, if
Loop:
                                            code is not correct
    lw $r3, 0($r1);
    stall
                        waiting for $r3
    add $r4,$r3, $r2;
                        waiting for $r4
    stall
    stall
    sw 0($r1), $r4;
    addi $r1, $r1, -4;
                        no forward to
    stall
bnch
    bnez $r1, Loop;
                                                    Pipe Effic
```

Why one stall and then two 6 Loops

### **Re-ordering**

```
lw $r2, offset;
Loop:
    lw $r3, 0($r1);
    stall
                         waiting for $r3
    add $r4,$r3, $r2;
    stall
                          waiting for $r4
    stall
    sw 0($r1), $r4;
    addi $r1, $r1, -4;
                         no forward to
    stall
bnch
    bnez $r1, Loop;
9 cycles
lw $r2, offset;
Loop:
    lw $r3, 0($r1);
    addi $r1, $r1, -4; decrement early
    add $r4,$r3, $r2; removes stall
    stall
                          waiting for $r4
    stall
    sw 8($r1), $r4; displacement
bnez $r1, Loop; $r1 already done
Reordered – 7 cycles
```

```
7 Loops
```

#### Unrolling

```
lw $r2, offset;
Loop:
    lw $r3, 0($r1);
    stall
                         stall comes back
    add $r4,$r3, $r2;
    stall*2
    sw 0($r1), $r4;
    lw $r2, offset;
                          second iteration
    sw -4($r1), $r4;
    lw $r2, offset;
                          third iteration
    sw -8($r1), $r4;
    lw $r3, -12($r1);
    stall
    add $r4,$r3, $r2;
    stall*2
    sw -12($r1), $r4;
    addi $r1, $r1, -16; decrement 4
times
    bnez $r1, Loop;
```

#### Unrolled – 26 cycles 6.5 per iteration

Harder if total number is not divisible by the unroll number

```
8 Loops
```

### General upper bound U

Loop unroll m times. Execute unrolled loop U/m and original loop U mod m.

## Optimise unrolled loop

```
lw $r2, offset;
Loop:
    lw $r7, 0($r1);
    lw $r8, -4($r1);
    lw $r9, -8($r1);
    lw $10, -12($r1);
    addi $r1, $r1, -16; decrement 4
times
    add $r3,$r7, $r2; stall hidden
    add $r4,$r8, $r2;
    add $r5,$r9, $r2;
    add $r6,$r10, $r2;
         0($r1), $r3;
    SW
    sw 4($r1), $r4;
sw 8($r1), $r5;
sw 12($r1), $r6;
    bnez $r1, Loop;
Unrolled optimised – 14 cycles 3.5 per iteration
cf 9 originally - speed up > 3
```

#### **9** Unrolling

## Decisions

• Unrolling useful if iterations are independent

- Use different registers to avoid name dependence. Sets limit on size of unroll
- Eliminate test and branch instructions. Will need to modify them at the end of the code
- Independent iterations allow reorder of load and store between loops
- Ensure code delivers same result.

Note number of iterations depends on length of code.

Unroll too many times for a long loop and may start generating cache missed for the code.

Also run out of independent registers.

Long loops have little overhead from house keeping code. Marginal advantage of extra iterations decreases.

Long loops have other ways to hide stalls.

#### **10** Scheduling

### **Split Instruction Decode**

Instruction Decode **ID**. Step of the pipelining. Checks for structural and data hazards Split into two

**Issue:** Decode instructions. Structural hazards ? *Wait for data hazards to clear* **Read Operands:** 

Extension of 5 step pipeline to out of order execution creates possibility of WAR and WAW

Solved by register renaming.

Out of order execution creates problems for exceptions.

**Imprecise exceptions** raised – exceptions which do not look as if the instructions were executed sequential

• Instructions earlier than the exception may not have completed

• Instructions later than the exception may have completed



#### **CDC 6600**



## 131 kWords

### 3 million instructions per second

The word length of the 6600 central memory is 60 bits.

There are 10 built-in independent Peripheral and Control Processors in the 6600 Computer, each having a 4096-word core memory. These memories are in addition to the 6600 central memory.

There are 10 functional units in the 6600 Central Processor, as follows:

- 2 Adders
- 1 Divider
  1 Shift
- 2 Multipliers
  2 Incrementers
- · 1 Boolean
- 1 Branch

These functional units operate on 8 increment registers of 18-bit length, 8 operand registers of 60bit length, and 8 memory address registers of 18-bit length. Up to 32 instructions can be held at once for program loops.

There are several levels of concurrency in the 6600 Computer, broadly consisting of:

- Concurrency in program execution in the 10 built-in Peripheral and Control Processors.
- Concurrency in the 10 basic Central Processor functions.
- Concurrency in the 32 independent banks of the Central Memory.

Eleven programs are run simultaneously on the 6600 Computer. These programs are **not** timeshared.

The 6600 Computer provides maximum operator/ machine communications via its console tube display, which includes keyboard.





## **15** CDC160

## Addressing modes

| Addressing modes                                  |                                                                                                                                                               |                                                                             |  |  |
|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|--|--|
|                                                   | 11     10     9     8     7       OpCode     Mode       F                                                                                                     | 6       5       4       3       2       1       0         Address         E |  |  |
| Direct                                            | Fetch content<br>of address sp                                                                                                                                | ts of E (only 6 bits<br>ace)                                                |  |  |
| Indirect.                                         | If E=0 next word holds address<br>(address all memory)<br>If E not zero it points to a word<br>(from 0-64) which holds the<br>address of the next instruction |                                                                             |  |  |
| Forward Relative: Next instruction is PC+E field  |                                                                                                                                                               |                                                                             |  |  |
| Backward Relative: Next instruction is PC-E field |                                                                                                                                                               |                                                                             |  |  |
| All possible modes<br><b>And</b> limits instruc   |                                                                                                                                                               |                                                                             |  |  |
|                                                   |                                                                                                                                                               |                                                                             |  |  |



**17** CDC160

## **Indirect Jumps**

111000 - Jump Indirect.

111001 - Jump Forward Indirect.

## Shifts

000001000010 - Shift Left. 000001000011 - Shift Left 2 (SN > 37). 000001001000 - Shift Left 3. 000001001001 - Shift Left 6 (SN > 37). 000001001010 - Multiply by 10. 000001001011 - Multiply by 100 (SN > 37).

## **Control Instructions**

000000000000 - Halt. 11111111111 - Error. 000001000001 - Transmit Program Counter into Accumulator.

Note more than 16 instructions. So instructions are effectively vartiable length

No specific function calls to allow storage of PC and transfer of control. TPC Allows then store and jump in fewer instructions.

| <b>18</b> Dynamic<br>scheduling | <b>Scoreboarding</b><br>Developed for the CDC 6600 in the mid 1960's                                                                      |                               |
|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
|                                 | <u>Execution:</u><br>Goal; 1 instruction per cycle.<br>Instructions executed as early as possible                                         | When no structural<br>hazards |
|                                 | Instruction stall<br>proceed to subsequent instructions<br>Execute unless they depend on previous executing<br>(or stalled) instructions. |                               |
|                                 | Many instructions in simultaneous execution<br>Need hardware to match                                                                     |                               |
|                                 | CDC6600 4 FP units, 5 Memory Refs, 7 integer ops                                                                                          | CDC Load/store                |
|                                 | MIPS only FP units                                                                                                                        |                               |
| See Hennessey<br>Appendix A     | The scoreboard handles<br>hazard detection<br>Example                                                                                     |                               |
|                                 | Two multiplier units, one adder, one divide, one<br>for integer adds and all memory reference<br>calculations                             | Pipe Effic                    |

## **19** Dynamic scheduling



```
20 MIPS
```

## Static v Dynamic scheduling

Static scheduling: can be done at compile time.

Some things are not defined at compile time.

If we have an instruction like

MUL

F0, F1, F2

The instruction cannot proceed until the operands are available. Would like to start some other instructions.

How many depends on how long?

If the contents of F1 need to be loaded from memory, then this instruction may need to wait. *For how long?* 

Depends if F1 is fetched from cache or memory.

That will not be known until run time.

Dynamic scheduling: provides a mechanism to keep the pipeline flowing, using information not available at compile time.

## Scoreboard

Structural hazards cab be mitigated by increasing the number of functional units available (FP add, FP mult,...) and distributing the instructions between them. This takes extra logic.

The scoreboard is some extra circuitry which takes the instructions and distributes the FP operations between multiple functional units – add, multiply, divide.

The aim (as always) is 1 instruction per clock cycle

Three of the steps in the standard MIPS pipeline ID,EX,WB

Are replaced by

Issue, Read Operands, Execution, Write Results

*Issue* decode instructions, check for structural hazards

*Read operands* wait until no data hazards, then read operands

| 22 | MIPS |
|----|------|
| 22 | MIPS |

| Scoreboard |  |  |  |
|------------|--|--|--|
| In order   |  |  |  |

issue

Out of order

Out of order

execution

commit

The scoreboard consists of three parts

Instruction status Functional unit status Register result status

| <b>23</b> Status Units | Instruction s                      | Instruction status                      |                                                        |  |
|------------------------|------------------------------------|-----------------------------------------|--------------------------------------------------------|--|
|                        | Which step th                      | Which step the instruction is executing |                                                        |  |
|                        | Functional u                       | nit status                              |                                                        |  |
|                        | The status of each functional unit |                                         |                                                        |  |
|                        | Busy                               | V                                       | yes/no                                                 |  |
|                        | Op                                 |                                         | operation                                              |  |
|                        | add, subtract                      | ,                                       |                                                        |  |
|                        | Fi                                 |                                         | destination                                            |  |
|                        | register                           | 1                                       | • ,                                                    |  |
|                        | Fj, F                              |                                         | source registers                                       |  |
|                        | Qj, Q                              | -                                       | functional units                                       |  |
|                        | producing Fj,                      |                                         |                                                        |  |
|                        | Rj, F                              | (k                                      | yes, registers ready &                                 |  |
|                        | not read                           |                                         |                                                        |  |
|                        | Register resu                      | ılt status                              |                                                        |  |
|                        | 8                                  |                                         |                                                        |  |
|                        | If a f                             |                                         | write to each register<br>nit has this register as its |  |
|                        |                                    |                                         |                                                        |  |
|                        |                                    |                                         |                                                        |  |
|                        |                                    |                                         |                                                        |  |

## **24** Operation

| <u>Issue:</u> Functional Unit Free and no other unit has same<br>destination register. Scoreboard issues instruction and<br>updates its internal data structure.<br>Protects against WAW hazards.<br>If issue stalls the buffer between IF and Issue fills.                                                                                                            |                             |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|
| <u>Read Operands:</u><br>Scoreboard monitors availability of source operands<br>(data flow).<br>Source operand available if not earlier issued instruction<br>is going to write to it<br>When source operands available to scoreboard tells the<br>functional unit to begin execution.<br>Resolves RAW hazards – instructions may be sent to<br>execution out of order |                             |
| <i>Execution:</i> Functional Unit executes instruction and informs scoreboard when complete.                                                                                                                                                                                                                                                                           |                             |
| <u>Write Result:</u> Scoreboard checks for WAR hazards and stalls completing instruction if required.                                                                                                                                                                                                                                                                  | Execution "as if"<br>serial |
| Instructions may complete out of order.<br>Instructions may even "overtake" each other.                                                                                                                                                                                                                                                                                |                             |
|                                                                                                                                                                                                                                                                                                                                                                        |                             |

#### **25** Limitations

# Amount of parallelismEach instruction depends on predecessorNone

## Number of scoreboard entries (window size)

Determines how far ahead the pipeline can look for instructions

## Number and type of functional units

How many instructions can occur in parallel. Tend to provide more multiply units, since multiply takes longer than add.

**Presence of anti-dependences and dependences** Lead to RAW and WAW stalls

| <b>26</b> Note                                                                    | <b>Balance</b><br>VAX 8650 had a cycle time of 55ns with a<br>sophisticated pipeline.<br>VAX 8700 had a simpler pipeline which allowed a<br>speed of 45ns.<br>8650 had 20% less CPI, 8700 was 20% faster.<br>But 8700 was simpler – less hardware. |  |
|-----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                                                                                   | <b>Performance measurement</b><br>Compiler optimisation covers some of the same<br>ground as dynamic scheduling.                                                                                                                                   |  |
| Manufacturers<br>ingenuity replaces<br>simple clock speed.<br>Doesn't always work | Measure improvement with unoptimised code will<br>give an over optimistic idea of improvement<br><b>Be sure what it is you are measuring</b>                                                                                                       |  |

Complex system – non-linear interactions Measurements are hard