In class notes

Report
Exploiting Parallelism
• As we saw last week, dynamic instruction issue allows us
to exploit as much parallelism as exists in instruction code
– Our stalling situations are limited to finite resources, cache
misses and branches
– Data dependencies do not limit instruction issuing, they merely
cause instructions to wait in reservation stations
• enough hardware removes or reduces impact of data dependencies
– We explore cache optimizations later in the semester
– Later in this lecture, we examine ways to improve branch
handling
• The term instruction-level parallelism (ILP) is used to
denote how much parallelism exists in straight line code
(code between branches)
– Since branches occur often, we use loop unrolling and other
schemes to provide parallelism across branches, this is known
as loop level parallelism (LLP)
Multiple Issue
• The next step to further exploit ILP is to issue multiple
instructions per cycle
– Send 2 or more instructions in 1 cycle to the execution units
(reservation stations)
• we call such a pipeline a superscalar pipeline
• Two approaches
– Static scheduling – instructions are grouped together by the
compiler
– Dynamic scheduling – using a Tomasulo-based architecture,
instructions are issued in the order they are retrieved as found in
the instruction queue, but if we can issue more than 1 at a time,
we do
• Simple example:
– Issue 1 integer and 1 FP operation since they do not typically
interact and so would not have dependencies
• loads/stores are considered int even if the datum is FP
• Superscalars can be set up to issue up to 8 instructions per
clock cycle
Loop:
L.D
L.D
ADD.D
L.D
L.D
ADD.D
SUB.D
DADDI
DADDI
DADDI
DADDI
DADDI
S.D
DSUBI
BNE
F0, 0(R1)
F1, 0(R2)
F2, F0, F1
F3, 0(R3)
F4, 0(R4)
F5, F3, F4
F6, F3, F2
R1, R1, #8
R2, R2, #8
R3, R3, #8
R4, R4, #8
R5, R5, #8
F6, 0(R5)
R6, R6, #1
R6, Loop
Example
Integer
Loop:
L.D
L.D
L.D
L.D
DADDI
DADDI
DADDI
DSUBI
DADDI
DADDI
BNE
S.D
FP
F0, 0(R1)
F1, 0(R2)
F3, 0(R3)
F4, 0(R4)
R1, R1, #8
R2, R2, #8
R3, R3, #8
R6, R6, #1
R4, R4, #8
R5, R5, #8
R6, Loop
F6, -8(R5)
ADD.D F2, F0, F1
ADD.D F5, F3, F4
SUB.D F6, F3, F2
Stalls for the single issue code would be 1 after each of L.D F0, L.D F4, 3 after
ADD.D, 1 after DSUBI and 1 in the branch delay slot, scheduling can remove
a number of the stalls, using the 2 issue superscalar (without loop unrolling) provides
the code to the right, which consists of 13 cycles of 15 instructions, a CPI = .87
Adding Loop Unrolling
• Given the loop below, scheduling it for a superscalar
would be futile
– The only FP operation has a data dependence with the
instructions that precede and succeed it
– The best we could do is to schedule the DADDI and
ADD.D together, moving the S.D down
• to resolve this, we need to add loop unrolling to the process so that
we have enough instructions to schedule
• assume the ADD.D has a latency of 2 cycles between it and the
S.D, we need to have 2 additional operations between it and the
S.D
Loop: L.D
F2, 0(R1)
ADD.D
F2, F2, F0
Solution: unroll the loop at least
S.D
F2, 0(R1)
3 iterations worth and schedule
DADDI
R1, R1, #8
BNE
R1, R2, Loop
Solution
Integer
FP
Loop:
L.D
F2, 0(R1)
L.D
F4, 8(R1)
L.D
F6, 16(R1)
ADD.D
L.D
F8, 24(R1)
ADD.D
DADDIR1, R1, #32
ADD.D
S.D
F2, -32(R1)
ADD.D
S.D
F4, -16(R1)
S.D
F6, -8(R1)
BNE R1, R2, Loop
S.D
F8, 0(R1)
• CPI for this code is 14 instructions in 11 cycles
F2, F2, F0
F4, F4, F0
F6, F6, F0
F8, F8, F0
– or 11/14 = .79
– with a superscalar that we can achieve a CPI < 1
• Note that if we unrolled the loop 3 times instead of 4, we would have 1
cycle where there was no int operation
• Also note that the FP adder must be pipelined for this to work, from here
on we will assume all FP functional units are pipelined
Another Solution
• Assume we can issue any combination of instructions (that
is, int or FP on either side of the superscalar)
– Would this alter our solution?
– We could still have the same solution, but we could also do this
Loop:
Instruction 1
L.D
F2, 0(R1)
L.D
F6, 16(R1)
L.D
F10, 32(R1)
L.D
F14, 48(R1)
ADD.D F2, F2, F0
ADD.D F6, F6, F0
ADD.D F10, F10, F0
ADD.D F14, F14, F0
S.D
F2, -56(R1)
S.D
F6, -40(R1)
S.D
F10, -24(R1)
S.D
F12, -16(R1)
Instruction 2
L.D
F4, 8(R1)
L.D
F8, 24(R1)
L.D
F12, 40(R1)
DADDIR1, R1, #56
ADD.D F4, F4, F0
ADD.D F8, F8, F0
ADD.D F12, F12, F0
S.D
S.D
BNE
S.D
F4, -48(R1)
F8, -32(R1)
R1, R2, Loop
F14, -8(R1)
Dynamic Scheduling
• With static issue, the compiler is tasked with finding
instructions to schedule
• With dynamic issue, the instructions must already be
available to issue
– Imagine our IF stage fetches 2 instructions per cycle, what
are the chances that one is an int and one is a FP
operation?
– Can the hardware search through the instruction queue (the
fetched instructions) to select two instructions?
• there may not be enough time for that
– Instead, we could issue 1 instruction per cycle, 2 if the
instructions are int & FP
• the following example demonstrates a typical loop being
dynamically unrolled and issued 2 instructions at a time (without
the 1 int + 1 FP restriction)
Example On Tomasulo
Loop #
Instr
Issues
Executes
Memory
CDB
Comments
1
L.D F0, 0(R1)
1
2
3
4
First issue
1
ADD.D
1
5
8
Wait for L.D
1
S.D F4, 0(R1)
2
3
1
DSUBI R1, R1, #8
2
4
1
BNE R1, R2, Loop
3
6
2
L.D F0, 0(R1)
4
7
2
ADD.D
4
10
2
S.D F4, 0(R1)
5
8
2
DSUBI R1, R1, #8
5
9
2
BNE R1, R2, Loop
6
11
3
L.D F0, 0(R1)
7
12
3
ADD.D
7
15
3
S.D F4, 0(R1)
8
13
3
DSUBI R1, R1, #8
8
14
3
BNE R1, R2, Loop
9
16
9
Wait for ADD.D
5
Wait for ALU
Wait for DSUBI
8
9
Wait for BNE
13
Wait for L.D
14
Wait for ADD.D
10
Wait for ALU
Wait for DSUBI
13
14
Wait for BNE
18
Wait for L.D
19
Wait for ADD.D
15
Wait for ALU
Wait for DSUBI
Resource Usage for Loop
Clock #
Integer ALU
2
1/L.D
3
1/S.D
4
1/DSUBI
5
FP ALU
Data cache
CDB
1/L.D
1/L.D
1/ADD.D
1/DSUBI
7
2/L.D
8
2/S.D
2/L.D
1/ADD.D
9
2/DSUBI
1/S.D
2/L.D
10
2/ADD.D
2/DSUBI
12
3/L.D
13
3/S.D
3/L.D
2/ADD.D
14
3/DSUBI
2/S.D
3/L.D
15
3/ADD.D
3/DSUBI
18
19
3/ADD.D
3/S.D
VLIW Approach
• Related to the static superscalar approach is one
known as the Very Long Instruction Word
• Here, the compiler bundles instructions that can be
issued together so that they can be fetched and issued
as a bundle
– Unlike the superscalar, the number of instructions
scheduled together can be greatly increased
– The combination of instructions scheduled together will
reflect the available hardware units, for instance
•
•
•
•
Up to 2 load/stores
Up to 2 FP
Up to 2 integer
1 branch
• The instructions are bundled together so that the CPU
fetches and issues 1 VLIW at a time
– Loop unrolling is required to support this
Example: 5 Instruction VLIW
Memory 1
Memory 2
L.D F0, 0(R1)
L.D F6, -8(R1)
L.D F10, -16(R1)
L.D F14, -24(R1)
L.D F18, -32(R1)
L.D F22, -40(R1)
L.D F26, -48(R1)
S.D F4, 0(R1)
S.D F8, -8(R1)
S.D F12, -16(R1)
S.D F16, -24(R1)
S.D F20, 24(R1)
S.D F24, 16(R1)
S.D F28, 8(R1)
FP 1
FP 2
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2
ADD.D F20, F10, F2
ADD.D F24, F22, F2
Int
ADD.D F28, F26, F2
DSUBI
R1, R1,
#56
BNE R1,
R2, Loop
Note that we do not fill the branch delay slot because we are using
speculation. This code executes 7 iterations worth of the loop in only 9
cycles, providing a CPI of 9 / 23 = .39.
Multi-Issue Processor Approaches
Name
Issue
Structure
Hazard
Detection
Scheduling
Distinguishing Examples
characteristics
Superscalar
Dynamic
Hardware
Static
In-order
execution
MIPS and
ARM
Superscalar
Dynamic
Hardware
Dynamic
Some out-oforder
execution, no
speculation
None
Superscalar
Dynamic
Hardware
Dynamic
with
speculation
Out-of-order
completion,
speculation
Intel Pentium,
Intel Core
i3/i5/i7, AMD
Phenom, IBM
Power 7
VLIW
Static
Software
Static
Compiler finds TI C6x
all hazards
EPIC
Static
Software
Static
same
Itanium
Supporting Superscalar
• By expanding our architecture to a superscalar, we
confront a serious problem
– Recall from appendix A that branches make up perhaps
16% of a benchmark
– Assuming a uniform distribution of instructions, this
means that a branch occurs roughly every 6 instructions
• for FP benchmarks, it may be more like every 8-9 instructions
– There is just not enough instructions within a basic block
(between branches) to obtain enough instructions to issue
multiple instructions per cycle
– Loop unrolling is one approach to handle this, but in
general, to support multiple issue, we need instruction
speculation
• that is, we need to speculate which instruction(s) should next be
fetched and/or scheduled
Branch Handling Schemes
• For our pipeline, we used a branch delay slot or
assumed branches were not taken
– We could also use branch prediction to predict taken or not
taken to control whether we immediately take the branch
as soon as the new PC value is taken or not
• this is only useful in longer pipelines where the branch location is
computed earlier in the pipeline than the branch condition
• the prediction is made using a cache, addressed by the
instruction’s address (or some bits of the address)
• Static branch prediction
– Accumulate statistics on each branch in a program and
base the branch prediction on those earlier runs
• see figure C.17 (p C-27) which demonstrates miss-prediction rates
on some SPEC92 benchmarks of between 5% and 22%
– Static approach is not ideal because it does not consider
the behavior of the current program
Dynamic Branch Prediction
• For each branch, we store a single bit to indicate if
that branch was last taken or not taken
– If the buffer (cache) says “taken”, we will assume the
branch should be taken and if the buffer says “not taken”,
we will assume the branch should not be taken
– If the prediction is incorrect, we modify the buffer
• There is a problem with this simple approach
– Assume we have a for loop which iterates 100 times
• the first prediction is “not taken”
– we have never been here before or the last time we were here, we exited
the for loop and the branch was not taken
• we reset our prediction bit to “taken” and predict correctly for the
next 98 iterations
• in the last iteration, we predict taken but the branch is not taken
• 2 miss-predictions out of 100 (98%), can we improve?
Improved Scheme
• An alternative approach is to use a 2-bit scheme
– In general, it takes 2 miss-predictions to change our next
prediction so we would miss-predict 1 time out of 100
– We can extend this to be an n-bit scheme which store the result
of n previous branches, if the value >= 2n-1 / 2, we predict taken
See table C.19 which
shows the accuracy of
the 2-bit predictor on
several SPEC89
benchmarks – average
miss-prediction is
about 7%
An n-bit predictor
provides little
improvement over
the 2-bit scheme
Other Branch Prediction Schemes
• We want to improve over even
the 2-bit scheme because of
the amount of speculation
required to support our
superscalar
• Consider the code above to the
right
– An analysis shows that if the
first two branches are taken,
then the third branch will not be
taken
– The equivalent MIPS code
might look like that shown to
the right
if(aa==2) aa=0;
if(bb==2) bb=0;
if(aa!=bb) …
DSUBI
BNEZ
DADD
L1: DSUBI
BNEZ
DADD
L2: DSUB
BEQZ
R3, R1, #2
R3, L1
R1, R0, R0
R3, R2, #2
R3, L2
R2, R0, R0
R3, R1, R2
L3
Correlating Branch Predictors
• Although the previous example is probably not very
common, it demonstrates that some branches can be
predicted, at least in part, in correlation with other branches
– A compiler may recognize the correlation
• The problem with the branch prediction scheme (whether 1bit, 2-bit or n-bit) is that it stores information only local to
the one branch
– Another approach is to have several predictors and choose the
predictor based on the result of several recent branches
• A correlating (or 2-level) predictor can provide more
accurate predictions
– The top level contains the locations of various predictors based
on the result of more than one branch
– Based on the top level, a different prediction is consulted
– Both the top level and the branch predictor can be altered over
time
Continued
• An (m, n) predictor uses the behavior of the last m
branches
– to choose from 2m branch predictors
– each branch predictor is an n-bit predictor
• A (1, 2) predictor only uses the behavior of the last branch
and selects a 2-bit predictor, this is what we had previously
• A (2, 2) predictor might be useful for our previous example
of if(aa!=bb) because it will select a 2-bit predictor
– based on whether aa==2 or aa!=2 and whether bb==2 or bb!=2
• An (m, n) predictor scheme will require a buffer of 2m * n
* k bits where k is the number of branches that we want to
store predictions for
– If we wanted to use a (2, 2) prediction scheme for 1K branches,
it would take 22 * 2 * 1K = 8Kbits or 1KByte
• remember, this is stored in an on-chip cache
Tournament Predictors
• The tournament prediction
scheme extends the correlating
predictor by having multiple
predictors per branch
– At least one global and one local
– The prediction is created as a
combination of the various
predictors
• we won’t go over the details here
– The chart to the right compares 2bit predictor schemes, the top two
being non-correlating predictors of
4096 and unlimited entries and the
bottom being a (2, 2) correlating
predictor using both local and
global data
More Results
Miss-prediction rates given
total predictor size and
type of predictors
Overall miss-prediction rate
FP predictions are more
accurate than int predictions
probably because there are
more and lengthier loops in
FP benchmarks
Branch Target Buffer
• Our prediction schemes so far only store the prediction
itself, not where the branch, if taken, is taking us
– For some architectures, a pure prediction without a target
location makes sense, recall the MIPS R4000 pipeline where
the branch target was computed earlier in the pipeline than the
branch prediction
– However, for our superscalars, it would be necessary for us to
know the predicted PC value as well as the prediction
The branch target buffer,
shown to the right, stores
for each PC value, a predicted
PC value
To be most efficient, we
would only want to
cache branch instructions
Using the Branch Target Buffer
• We fetch the branch prediction & target location in
the IF stage from the buffer (an on-chip cache) at the
same time we fetch the instruction
– Is instruction a branch and branch is predicted as taken?
• immediately change the PC to be the target location
• if cache hit and successful prediction, penalty = 0
– Is instruction a branch and predicted to not be taken?
• just increment PC as normal
• if cache hit and successful prediction, penalty = 0
– Is instruction not a branch?
• increment PC, no penalty
– We have two sources of branch penalty
• the branch is predicted as taken but is not taken (this is a buffer hit
with a miss-prediction)
• the branch is taken but is not currently in the buffer (a miss)
Performance of Branch Target Buffer
• In MIPS, the penalty is a 2 cycle stall
– One cycle to know the result of the branch as computed in
the ID stage
– One cycle to modify the cache (on a miss or missprediction)
• many high performance processors have lengthy pipelines with
branch penalties on the order of 15 cycles instead of 2!
• Assume a prediction accuracy of 90%
• Assume a buffer hit rate of 90%
– For the two sources of stalls, we have
• hit & miss-prediction = .90 * .10 = .09
• miss = .10
– Branch penalties then are (.09 + .10) * 2 = .38
– From appendix C, we averaged about between .38 and .56
cycles of penalties from our various schemes
Two Other Speculation Techniques
• Branch folding
– Instead of just storing the branch prediction and the target
location in a branch target buffer, you also store the
destination instruction
• in this way, a successfully predicted branch not only has a 0 cycle
penalty but actually has a penalty of -1 because you are fetching
the new instruction at the same time you are fetching the branch
instruction!
• Return address predictors
– By using a small cache-based stack, you can push return
addresses and so return statement locations can be
predicted just as any other branch
• these are indirect jumps and by using the stack, you save on an
extra potential memory (or register) reference
– Miss-prediction rates vary but a buffer of 8 elements
offers a low miss-prediction rate of under 10%
Integrated Instruction Fetch Unit
• Returning to the idea of a dynamically scheduled
superscalar, we see that not only do we have to fetch 2
(or more) instructions per cycle in order to have
instructions to issue
– But if an instruction is a branch, we must quickly decide
whether to and where to branch to
– The IIFU is a unit whose job is to feed the pipeline with
speculated instructions at a rate that will permit the issue
stage to issue as many instructions as are permissible
– This unit must perform
• branch prediction through a branch buffer
• instruction prefetch to ensure that instructions are always available
in the instruction queue
• memory access and buffering which might involve fetching
multiple cache lines in one cycle
Implementing Speculation
• For the MIPS pipeline, a miss-prediction is
determined in the ID stage (prediction fetched in IF)
– a 2 cycle penalty arises (1 cycle if we don’t modify the
buffer) which can be resolved simply by flushing 1
instruction and stalling 1 cycle
• In most processors, a miss-prediction may not be
recognized for several cycles, and cleaning up
afterward may be a challenge
– Consider for instance that the branch relies on a
computation and that the computation contains a RAW
hazard
– Later, speculated instructions may be issued and executed
with their results forwarded to other reservation stations,
register, and even memory
– How do we fix this problem of tracking down what was
changed and reset it?
Re-Order Buffer
• The challenge arises because we have out-of-order
completion of instructions
• The Re-Order Buffer (ROB) stores all instructions
issued in the order they were issued, not the order
they were completed
– Results are not sent to the register file or memory, they are
sent to the ROB
– When an instruction reaches the front of the ROB
• if it is not a speculated instruction, or it is a speculated instruction
whose speculation has been confirmed, then it is allowed to store
its results and leave the ROB
• if it is a branch, we must wait until the result of the branch is
known – if speculation was correct, remove the branch
• if speculation was incorrect, all instructions in the ROB after it
were the result of the miss-speculation, flush them all
Superscalar Tomasulo + ROB
New Implementation
• With the ROB in place, we have a slightly different
approach
– Instructions fetched as before
– Instructions issued to reservation stations as are available,
using register renaming to avoid detected WAW and WAR
hazards
• issue stalls if there are no available reservation stations OR if the
ROB is full
• the instruction’s ROB slot number is used for forwarding purposes
both of this instruction’s results and any other reservation stations
which will forward results to this instruction
– Execution as before
– Write result sends the result on the CDB and to the ROB
but not directly to the register file or store unit
– Commit – this stage means that the instruction has reached
the front of the ROB and was speculated correctly
• an instruction commits by forwarding its result to the register file
and/or store unit, and leaving the ROB
Tomasulo With Speculation
Cycle
Instruction
Issue
Exec
Mem Acc
CDB
Commit
Comments
1
LD R2, 0(R1)
1
2
3
4
5
First issue
1
DADDIU R2, R2, #1
1
5
6
7
Wait for LD
1
SD R2, 0(R1)
2
3
7
Wait for add
1
DADDIU R1, R1, #4
2
3
8
Commit in order
1
BNE R2, R3, Loop
3
7
8
Wait for add
2
LD R2, 0(R1)
4
5
7
9
No delay
2
DADDIU R2, R2, #1
4
8
9
10
Wait for LD
2
SD R2, 0(R1)
5
6
10
Wait for add
2
DADDIU R1, R1, #4
5
6
11
Commit in order
2
BNE R2, R3, Loop
6
10
11
Wait for add
3
LD R2, 0(R1)
7
8
10
12
No delay
3
DADDIU R2, R2, #1
7
11
12
13
Wait for LW
3
SD R2, 0(R1)
8
9
13
Wait for add
3
DADDIU R1, R1, #4
8
9
14
Commit in order
3
BNE R2, R3, Loop
9
13
14
Wait for add
4
6
7
9
10
Implementation Issues
• We need each of the following steps to take place in 1 cycle
apiece
– Fetch some number of instructions from instruction cache and
branch prediction/target buffer
• apply the prediction
• update the buffer on miss-predictions and cache misses
– Find instructions from the queue that can be issued in parallel
•
•
•
•
are reservation stations available for the collection of instructions?
is there room in the ROB?
handle WAW/WAR hazard through register renaming
move instructions to ROB
– Execute at the functional units as data become available
(number of cycles varies by type of functional unit)
– Send results out on CDB if it is not currently busy
– Commit the next instruction(s) in the ROB
• forwarding data to register/store units
• with the multi-issue, we might want to commit > 1 instruction per cycle
or the ROB becomes a potential bottleneck
How Much Can We Speculate?
• Speculation requires having a number of instructions in the
instruction queue
– This requires having an IIFU and the ability to grab several
instructions per cycle (a wide bus)
– It may also require non-blocking caches or caches that can respond
with multiple refill lines per cycle
• The logic to issue instructions requires searching for resource
conflicts
– reservation station limitations
– restrictions on the instructions that can be issued at the same time,
e.g., 1 FP and 1 int or only 1 memory operation
• And additionally, the logic in the issue stage must handle
WAR/WAW hazards/register renaming
– There is a limitation in terms of the time it takes to perform these
tasks and the energy requirements
– Today, an 8-issue superscalar is the largest number so far proposed
• in practice, only “up to 6-issue” superscalars exist and most can issue on the
order of 2-4 instructions per cycle
Window Size
• If the issue unit must look at n instructions (n is our
window size) to find those that can be issued
simultaneously, it requires (n-1)2 comparisons
• A window of 4 instructions would require 9 comparisons, a
window of 6 instructions would require 25 comparisons
• These comparisons must take place in a short amount of
time so that the instructions can be issued in the same
cycle and data forwarded to reservation stations and ROB
• That’s a lot of work in 1 cycle!
– We explore a study on ILP limitations in 2 weeks
including the limitations imposed by window sizes
• Modern processors have a window size of 2-8 instructions
although the most aggressive has a window 32
– remember window size how many instructions can be compared, not the
number issued
Cost of Miss-speculation
• The question comes down to
– Is the amount of effort of executing miss-speculated
instructions and the recovery process more time consuming
than the time spent waiting on a branch condition’s result if
we did not speculate?
– The answer in most cases is no, the speculation route is by
far more effective
– This is especially true of FP benchmarks where the amount
of miss-speculated code is commonly below 5%
• for integer benchmarks, the amount of time spent on missspeculated code can be as high as 45%
– Since cleaning up after a miss-speculation is largely a
matter of flushing the ROB and refilling the instruction
queue, recovery itself is quick and not very complicated
Speculating Through Branches
• The idea of speculation is to predict the behavior of a
branch in order to immediately start fetching
instructions from the speculated direction of the
branch
• To promote ILP and LLP in the face of a dynamically
scheduled multiple issue processor, we may wind up
having to speculated across branches
– That is, speculating a second branch before the first
branch’s speculation is determined
• consider for instance nested for-loops or a loop which contains a
single if-else statement
• if we are attempting to issue say 6 instructions per cycle, we may
find 2 branches within 1 cycle’s worth of instructions!
– Speculation recovery becomes more complicated and to
date, no processor has implemented a full recovery scheme
for multiple branch speculation
– We will consider compiler-solutions in a couple of weeks

similar documents