## Computation Structures - Tutorial 12

December 4, 2018

## 1 A $\beta$ Machine with a 2-Stage Pipeline

1. Give a plausible implementation of NOP ().

*** Solution ***

A possible (safe) implementation of the $\operatorname{NOP()}$ instruction could be: $\operatorname{ADD}(\mathrm{R} 31$, R31, R31).
2. Give a software solution to the jump problems for each of the following programs:
(a) 1 Main: $\operatorname{ADDC}(\mathrm{R} 31,0, \mathrm{R} 1)$
2 ADDC(R31,2,R2)
3 Incr: $\operatorname{ADDC}(\mathrm{R} 1,5, R 1)$
$4 \quad \operatorname{SUBC}(R 2,1, R 2)$
$5 \quad \mathrm{BT}(\mathrm{R} 2$, Incr $)$
6 Oper: ADD (R1,R3,R3)
(b) 1 Cas1: $\operatorname{ADDC}(\mathrm{R} 1,4, \mathrm{R} 1)$
$2 \quad \operatorname{SUBC}(R 2,12, R 2)$
3 CMPLT (R1,R2,R0)
4 BNE (R0, Cas2)
$5 \quad \operatorname{MULC}(\mathrm{R} 1,5, \mathrm{R} 1)$
$6 \quad$ BR (rtn)
7 Cas2: MULC(R2,5,R2)
8 BR (rtn)

## $\xrightarrow{* * *}$ Solution ***

1. Problem: on a machine with a pipeline, the next instruction execution begins before the previous one completes. This can lead to problems for codes features branching and jumping instructions.
In this case, ADD would be executed even if R2 is true, which would corrupt the register R3 and put the program in an undesired state.

|  | $t$ | $t+1$ | $t+2$ | $t+3$ | $t+4$ | $t+5$ | $t+6$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | ADDC | ADDC | ADDC | SUBC | BT | ADD | $\ldots$ |
| EXE |  | ADDC | ADDC | ADDC | SUBC | BT | ADD ?? |

Possible solutions:

- Using a $\operatorname{NOP()~instruction~between~BT~(line~5)~and~ADD~(line~6).~Drawback:~}$ some clock-cycles (one in this case) are used for doing nothing.
- Re-ordering the instructions: move $\operatorname{ADDC}(\mathrm{R} 1,5, \mathrm{R} 1)$ after BT. Drawback: error-prone, the next instruction should be chosen cautiously. Indeed, it must be taken in the same block as the branching operation and the registers values should remain consistent. Moreover, we lose portability as the code doesn't work on a pipeline-less machine anymore.


## 2 A $\beta$ Machine with a 4-Stage Pipeline

1. Give a software and a hardware solution to the data conflicts problems for the following program:
```
1 ADD(R1,R2,R3)
2 SUB(R3,R4,R5)
3 MULC(R2,5,R17)
4 ADD(R5,R1,R1)
5 SUB(R17,R1,R17)
```


## *** Solution ***

Problem(s):
(1) SUB needs the result of the preceeding ADD in R3 but ADD is only at the ALU phase so the hasn't been written back yet.
(2) The second ADD needs the result of the preceeding SUB in R5 but SUB is only at the WB phase so the result will only be written back at the next clock cycle.
(3) The second SUB needs the results of both the second ADD and MULC which are respectively at phases ALU and WB. Therefore, they haven't written back yet.

|  | $t$ | $t+1$ | $t+2$ | $t+3$ | $t+4$ | $t+5$ | $t+6$ | $t+7$ | $t+8$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | ADD | SUB | MULC | ADD | SUB | $\ldots$ | $\ldots$ | $\ldots$ | $\ldots$ |
| RF |  | ADD | $\operatorname{SUB}^{(1)}$ | MULC | $\operatorname{ADD}^{(2)}$ | $\operatorname{SUB}^{(3)}$ | $\ldots$ | $\ldots$ | $\ldots$ |
| ALU |  |  | ADD $^{(1)}$ | SUB | MULC $^{(2)}$ | ADD $^{(3)}$ | SUB | $\ldots$ | $\ldots$ |
| WB/MEM |  |  |  | ADD | $\operatorname{SUB}^{(2)}$ | MULC $^{(3)}$ | ADD | SUB | $\ldots$ |

- Software solution: in this case, one cannot re-order the instructions (too many conflicts and too few instructions). One could use NOP () operations at the cost of doubling the number of instructions:

```
1 ADD(R1,R2,R3)
2 NOP()
3 NOP()
4 SUB(R3,R4,R5)
5NOP()
6 MULC(R2,5,R17)
7 ADD(R5,R1,R1)
8NOP()
N NOP()
10 SUB(R17,R1,R17)
```

- Hardware solution: one could do the NOP insertion at the hardware level (still we double the execution time). Second solution would consist in using bypasses. We need a ALU-out bypass for problems (1) and (3) and a WB-out bypass for problems (2) and (3).

2. Give a hardware solution to the data conflicts problems for the following program:

1 LD(R1,0,R4)
2 ADD (R1,R4,R5)
3 XOR (R3,R4,R6)

## *** Solution ***

(1) ADD needs the value loaded in R4 from memory by LD, but R4 is only saved at step $t+4$ as the memory is in the phase WB/MEM.
(2) XOR needs the value loaded in R4 from memory by LD (same reason as above).

|  | $t$ | $t+1$ | $t+2$ | $t+3$ | $t+4$ | $t+5$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | LD | ADD | XOR | $\ldots$ | $\ldots$ | $\ldots$ |
| RF |  | LD | $\mathrm{ADD}^{(1)}$ | $\mathrm{XOR}^{(2)}$ | $\ldots$ | $\ldots$ |
| ALU |  |  | $\mathrm{LD}^{(1)}$ | ADD | XOR | $\ldots$ |
| $\mathrm{WB} / \mathrm{MEM}$ |  |  |  | $\mathrm{LD}^{(2)}$ | ADD | XOR |

- Problem (2) can be handled by using a WB-out bypass.
- Problem (1) cannot be handled by using a bypass as the value from the memory is nowhere on the path (because the memory is only queried in the last phase $\mathrm{WB} / \mathrm{MEM}$ ). The only (simple) solution is to introduce a NOP instruction and then use a WB-out bypass:
$1 \operatorname{LD}(\mathrm{R} 1,0, R 4)$
2 NOP()
$3 \operatorname{ADD}(\mathrm{R} 1, \mathrm{R} 4, \mathrm{R} 5)$
4 XOR (R3,R4,R6)

3. If the $\beta$ Machine features 2 bypasses, what will be the result stored at $0 \times 1000$ after the execution of the following program? Why?

1 ADDC(R31,3,R0)
$2 \operatorname{SUBC}(\mathrm{RO}, 1, \mathrm{R} 1)$
3 MUL(R0,R1,R2)
4 XOR(RO,R2,R3)
5 ST(R3,0x1000,R31)

## *** Solution ***

Thanks to the bypasses, there is no conflicts and the result is that same as if this code ran on a pipeline-less machine. The operation performed is $3 \oplus(3 \times 2)$ :

$$
\begin{array}{lllll} 
& 0 & 1 & 1 & \\
\oplus & 1 & 1 & 0 & \\
\hline & 1 & 0 & 1 & =5
\end{array}
$$

|  | $t$ | $t+1$ | $t+2$ | $t+3$ | $t+4$ | $t+5$ | $t+6$ | $t+7$ | $t+8$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| IF | ADDC | $\begin{aligned} & \text { SUBC } \\ & \text { ADDC } \end{aligned}$ | MUL | XOR | ST | $\ldots$ | $\ldots$ | $\ldots$ | ... |
| RF |  |  | SUBC ${ }^{(1)}$ | MUL ${ }^{(2)}$ | XOR ${ }^{(3)}$ | ST ${ }^{(4)}$ | $\ldots$ | ... | $\ldots$ |
| ALU |  |  | ADDC ${ }^{(1)}$ | SUBC ${ }^{(2)}$ | MUL ${ }^{(3)}$ | $\mathrm{XOR}{ }^{(4)}$ | ST | $\ldots$ | $\ldots$ |
| WB/MEM |  |  |  | SUBC ${ }^{(2)}$ | SUBC | MUL | XOR | ST |  |

