Appendix C

2
Solutions to Case Studies and Exercises
Appendix C Solutions
C.1
a.
R1
R1
R2
R2
R2
R4
LD
DADDI
LD
SD
DSUB
BNEZ
DADDI
SD
DADDI
DADDI
DADDI
DSUB
b. Forwarding is performed only via the register file. Branch outcomes and
targets are not known until the end of the execute stage. All instructions
introduced to the pipeline prior to this point are flushed.
LD
R1, 0(R2)
DADDI R1, R1, #1

SD
0(R2), R1
DADDI R2, R2, #4
10
11
DSUB R4, R3, R2

BNEZ R4, Loop
LD
12
13
14
15
16
17
18
W
D
R1, 0(R2)
Since the initial value of R3 is R2 + 396 and equal instance of the loop adds 4
to R2, the total number of iterations is 99. Notice that there are 8 cycles lost to
RAW hazards including the branch instruction. Two cycles are lost after the
branch because of the instruction flushing. It takes 16 cycles between loop
instances; the total number of cycles is 98 16 + 18 = 1584. The last loop
takes two addition cycles since this latency cannot be overlapped with additional loop instances.
c. Now we are allowed normal bypassing and forwarding circuitry. Branch outcomes and targets are known now at the end of decode.
1
LD
R1, 0(R2)
DADDI R1, R1, #1

SD
R1, 0(R2)
DADDI R2, R2, #4

DSUB R4, R3, R2
BNEZ R4, Loop
(incorrect instruction)
LD
R1, 0(R2)
10
11
12
13
Copyright 2012 Elsevier, Inc. All rights reserved.
14
15
16
17
18
Again we have 99 iterations. There are two RAW stalls and a flush after the
branch since the branch is taken. The total number of cycles is 9 98 + 12 =
894. The last loop takes three addition cycles since this latency cannot be
overlapped with additional loop instances.
d. See the table below.
LD
R1, 0(R2)
DADDI R1, R1, #1

SD
R1, 0(R2)
DADDI R2, R2, #4
D
F
DSUB R4, R3, R2

BNEZ R4, Loop
LD
R1, 0(R2)
10
11
12
13
14
15
16
17
18
Again we have 99 iterations. We still experience two RAW stalls, but since
we correctly predict the branch, we do not need to flush after the branch.
Thus, we have only 8 98 + 12 = 796.
e. See the table below.
1
LD
R1,
DADDI R1,
SD
R1,
DADDI R2,
DSUB R4,
BNEZ R4,
LD
0(R2) F1 F2 D1 D2 X1
R1, #1
F1 F2 D1 D2
0(R2)
F1 F2 D1
R2, #4
F1 F2
R3, R2
F1
Loop
R1, 0(R2)
10
11
12
13
14
15
16
17
18 19 20
X2
s
s
s
s
M1
s
s
s
s
M2
s
s
s
s
W1
X1
D2
D1
F2
F1
W2
X2
X1
D2
D1
F2
M1
X2
X1
D2
D1
M2
M1
X2
s
s
W1
M2
M1
X1
D2
W2
W1
M2
X2
X1
W2
W1 W2
M1 M2 W1 W2
X2 M1 M2 W1 W2
F1
F2
D1 D2 X1 X2 M1 M2 W1 W2
We again have 99 iterations. There are three RAW stalls between the LD and
ADDI, and one RAW stall between the DADDI and DSUB. Because of the
branch prediction, 98 of those iterations overlap significantly. The total number of cycles is 10 98 + 19 = 999.
f.
0.9ns for the 5-stage pipeline and 0.5ns for the 10-stage pipeline.
g. CPI of 5-stage pipeline: 796/(99 6) = 1.34.

CPI of 10-stage pipeline: 999/(99 6) = 1.68.
Avg Inst Exe Time 5-stage: 1.34 0.9 = 1.21.
Avg Inst Exe Time 10-stage: 1.68 0.5 = 0.84.
C.2
a. This exercise asks, How much faster would the machine be . . ., which
should make you immediately think speedup. In this case, we are interested in
how the presence or absence of control hazards changes the pipeline speedup.
Recall one of the expressions for the speedup from pipelining presented on
page C-13.
1
Pipeline speedup = ------------------------------------------- Pipeline depth (S.1)
1 + Pipeline stalls
where the only contributions to Pipeline stalls arise from control hazards
because the exercise is only focused on such hazards. To solve this exercise,
we will compute the speedup due to pipelining both with and without control
hazards and then compare these two numbers.
For the ideal case where there are no control hazards, and thus stalls, Equation S.1 yields
1
Pipeline speedup ideal = ------------ ( 4 ) = 4 (S.2)
1+0
where, from the exercise statement the pipeline depth is 4 and the number of
stalls is 0 as there are no control hazards.
For the real case where there are control hazards, the pipeline depth is still
4, but the number of stalls is no longer 0. To determine the value of Pipeline
stalls, which includes the effects of control hazards, we need three pieces of
information. First, we must establish the types of control flow instructions
we can encounter in a program. From the exercise statement, there are three
types of control flow instructions: taken conditional branches, not-taken conditional branches, and jumps and calls. Second, we must evaluate the number
of stall cycles caused by each type of control flow instruction. And third, we
must find the frequency at which each type of control flow instruction occurs
in code. Such values are given in the exercise statement.
To determine the second piece of information, the number of stall cycles created by each of the three types of control flow instructions, we examine how
the pipeline behaves under the appropriate conditions. For the purposes of
discussion, we will assume the four stages of the pipeline are Instruction
Fetch, Instruction Decode, Execute, and Write Back (abbreviated IF, ID, EX,
and WB, respectively). A specific structure is not necessary to solve the exercise; this structure was chosen simply to ease the following discussion.
First, let us consider how the pipeline handles a jump or call. Figure S.43
illustrates the behavior of the pipeline during the execution of a jump or call.
Because the first pipe stage can always be done independently of whether the
control flow instruction goes or not, in cycle 2 the pipeline fetches the
instruction following the jump or call (note that this is all we can doIF must
update the PC, and the next sequential address is the only address known at
this point; however, this behavior will prove to be beneficial for conditional
branches as we will see shortly). By the end of cycle 2, the jump or call
resolves (recall that the exercise specifies that calls and jumps resolve at the
end of the second stage), and the pipeline realizes that the fetch it issued in
cycle 2 was to the wrong address (remember, the fetch in cycle 2 retrieves the
instruction immediately following the control flow instruction rather than the
target instruction), so the pipeline reissues the fetch of instruction i + 1 in
cycle 3. This causes a one-cycle stall in the pipeline since the fetches of
Clock cycle
Instruction
Jump or call
IF
ID
EX
WB
IF
IF
i+1
i+2
stall
i+3
ID
EX
...
IF
ID
...
stall
IF
...
Figure S.43 Effects of a jump of call instruction on the pipeline.
instructions after i + 1 occur one cycle later than they ideally could have.
Figure S.44 illustrates how the pipeline stalls for two cycles when it
encounters a taken conditional branch. As was the case for unconditional
branches, the fetch issued in cycle 2 fetches the instruction after the branch
rather than the instruction at the target of the branch. Therefore, when the
branch finally resolves in cycle 3 (recall that the exercise specifies that
conditional branches resolve at the end of the third stage), the pipeline
realizes it must reissue the fetch for instruction i + 1 in cycle 4, which creates
the two-cycle penalty.
Figure S.45 illustrates how the pipeline stalls for a single cycle when it
encounters a not-taken conditional branch. For not-taken conditional branches,
the fetch of instruction i + 1 issued in cycle 2 actually obtains the correct
instruction. This occurs because the pipeline fetches the next sequential
instruction from the program by defaultwhich happens to be the instruction
that follows a not-taken branch. Once the conditional branch resolves in cycle
3, the pipeline determines it does not need to reissue the fetch of instruction
i + 1 and therefore can resume executing the instruction it fetched in cycle 2.
Instruction i + 1 cannot leave the IF stage until after the branch resolves
because the exercise specifies the pipeline is only capable of using the IF stage
while a branch is being resolved.
Combining all of our information on control flow instruction type, stall
cycles, and frequency leads us to Figure S.46. Note that this figure accounts
for the taken/not-taken nature of conditional branches. With this information
we can compute the stall cycles caused by control flow instructions:
Pipeline stalls real = ( 1 1% ) + ( 2 9% ) + ( 1 6% ) = 0.24
where each term is the product of a frequency and a penalty. We can now plug
the appropriate value for Pipeline stallsreal into Equation S.1 to arrive at the
pipeline speedup in the real case:
1
Pipeline speedup ideal = ------------------- ( 4.0 ) = 3.23
1 + 0.24
(S.3)
Clock cycle
Instruction
Taken branch
IF
i+1
ID
EX
WB
IF
stall
IF
i+2
stall
i+3
ID
...
stall
IF
...
stall
stall
...
Figure S.44 Effects of a taken conditional branch on the pipeline.
Clock cycle
Instruction
Not-taken branch
IF
ID
EX
WB
IF
stall
ID
EX
...
stall
IF
ID
...
stall
IF
...
i+1
i+2
i+3
Figure S.45 Effects of a not-taken conditional branch on the pipeline.
Finding the speedup of the ideal over the real pipelining speedups from Equations S.2 and S.3 leads us to the final answer:
4
Pipeline speedup without control hazards = ---------- = 1.24
3.23
Frequency
(per instruction)
Stalls
(cycles)
1%
Conditional (taken)
15% 60% = 9%
Conditional (not taken)
15% 40% = 6%
Control flow type

Jumps and calls
Figure S.46 A summary of the behavior of control flow instructions.
Thus, the presence of control hazards in the pipeline loses approximately

24% of the speedup you achieve without such hazards.
b. We need to perform similar analysis as in Exercise C.2 (a), but now with
larger penalties. For a jump or call, the jump or call resolves at the end of
cycle 5, leading to 4 wasted fetches, or 4 stalls. For a conditional branch that

is taken, the branch is not resolved until the end of cycle 10, leading to 1
wasted fetch and 8 stalls, or 9 stalls. For a conditional branch that is not
taken, the branch is still not resolved until the end of cycle 10, but there were
only the 8 stalls, not a wasted fetch.
Pipeline stallsreal = (4 1%) + (9 9%) + (8 6%) = .04 + .81 + .48 = 1.33
Pipeline speedupreal = (1/(1 + 1.33)) (4) = 4/2.33 = 1.72
Pipeline speedupwithout control hazards = 4/1.72 = 2.33
If you compare the answer to (b) with the answer to (a), you can see just how
important branch prediction is to the performance of modern high-performance
deeply-pipelined processors.
C.3
a. 2ns + 0.1ns = 2.1ns

b. 5 cycles/4 instructions = 1.25
c. Execution Time = I CPI Cycle Time
Speedup = (I 1 7)/(I 1.25 2.1) = 2.67
d. Ignoring extra stall cycles, it would be: I 1 7/I 1 0.1 = 70
C.4
a. Calculating the branch in the ID stage does not help if the branch is the one
receiving data from the previous instruction. For example, loops which exit
depending on a memory value have this property, especially if the memory is
being accessed through a linked list rather than an array. There are many correct answers to this.
LOOP:
LW
ADD
LW
BNE
R1,
R3,
R2,
R2,
4(R2)
R3, R1
0(R2)
R0, LOOP
# R1 = r2->value
# sum = sum + r2->value
# r2 = r2->next
# while (r2 != null) keep
looping
The second LW and ADD could be reordered to reduce stalls, but there would
still be a stall between the LW and BNE.
C.5
(Same as C.4)
C.6
(Same as C.4)
C.7
a. Execution Time = I CPI Cycle Time

Speedup = (I 6/5 1)/(I 11/8 0.6) = 1.45
b. CPI5-stage = 6/5 + 0.20 0.05 2 = 1.22,
CPI12-stage = 11/8 + 0.20 0.05 5 = 1.425
Speedup = (1 1.22 1)/(1 1.425 0.6) = 1.17
C.8C.11
(Same as C.7)

a. 21 5 = 16 cycles/iteration
C.12
1
L.D F6, 0(R1)
10 11 12
13 14 15 16
17
18 19 20 21
IF ID EX MM WB
MUL.D F4, F2, F0
IF ID s
LD F6, 0(R2)
IF s
ADD.D F6, F4, F6
M1 M2 M3 M4 M5 M6 M7 MM WB
ID EX MM WB
IF
S.D 0(R2), F6
ID s
A1 A2 A3 A4 MM WB
IF s
ID
EX MM WB
IF
ID s
EX MM WB
IF
ID
EX MM WB
IF
ID EX MM WB
DADDIU R1, R1, #

DSGTUI
BEQZ
b. (Same as C.12)
c. (Same as C.12)
C.13
(Same as C.12)
C.14
a. There are several correct answers. The key is to have two instructions using
different execution units that would want to hit the WB stage at the same time.
b. There are several correct answers. The key is that there are two instructions
writing to the same register with an intervening instruction that reads from
the register. The first instruction is delayed because it is dependent on a longlatency instruction (i.e. division). The second instruction has no such delays,
but it cannot complete because it is not allowed to overwrite that value before
the intervening instruction reads from it.

Appendix C

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appendix C

Uploaded by

Copyright:

Available Formats

2

Solutions to Case Studies and Exercises

DADDI R1, R1, #1

DADDI R2, R2, #4

DSUB R4, R3, R2

DADDI R1, R1, #1

DADDI R2, R2, #4

Copyright 2012 Elsevier, Inc. All rights reserved.

DADDI R1, R1, #1

DADDI R2, R2, #4

DSUB R4, R3, R2

g. CPI of 5-stage pipeline: 796/(99 6) = 1.34.

Solutions to Case Studies and Exercises

Copyright 2012 Elsevier, Inc. All rights reserved.

Figure S.43 Effects of a jump of call instruction on the pipeline.

Copyright 2012 Elsevier, Inc. All rights reserved.

Solutions to Case Studies and Exercises

Figure S.44 Effects of a taken conditional branch on the pipeline.

Figure S.45 Effects of a not-taken conditional branch on the pipeline.

Conditional (not taken)

Control flow type

Figure S.46 A summary of the behavior of control flow instructions.

Thus, the presence of control hazards in the pipeline loses approximately

Copyright 2012 Elsevier, Inc. All rights reserved.

cycle 5, leading to 4 wasted fetches, or 4 stalls. For a conditional branch that

a. 2ns + 0.1ns = 2.1ns

a. Execution Time = I CPI Cycle Time

Copyright 2012 Elsevier, Inc. All rights reserved.

Solutions to Case Studies and Exercises

MUL.D F4, F2, F0

ADD.D F6, F4, F6

DADDIU R1, R1, #

Copyright 2012 Elsevier, Inc. All rights reserved.

You might also like