In class notes

Report
Dynamic Scheduling
• Pipelines rely on instruction flow as scheduled by the
compiler (whether optimized or not)
– We could continue to fetch instructions as scheduled by the
compiler, but we could use hardware to anticipate
problems and dynamically schedule the instructions to the
execution units (here, referred to as functional units)
– This was the approach taken in the CDC 6600 in the late
60s, using a new piece of hardware called the scoreboard
– The ID unit has an additional duty
• issuing the instruction to the proper functional unit when the
circumstances permit it
– The scoreboard keeps track of the waiting and executing
instructions, monitoring data dependencies and structural
hazards
• this leads to out-of-order completion of instructions, but as long
as data dependencies are maintained, this is only a problem when
we have exceptions (we’ll worry about that later)
Example to Motivate the Scoreboard
• Consider the following code:
–
–
–
–
L.D
MUL.D
ADD.D
DIV.D
F0, 0(R3)
F2, F0, F10
F4, F2, F12
F6, F8, F14
• In the MIPS FP pipeline, we have a data hazard after the
L.D resulting in a stall and a data hazard after the MUL.D
resulting in 6 stalls
– The ADD.D cannot move into A1 until F2 has been forwarded
out of MUL.D’s M7 stage, thus the ADD.D stalls for 6 cycles in
ID and DIV.D stalls for 6 cycles in IF
– There is no need for the DIV.D to stall since it is not data
dependent on previous instructions, but there is a structural
hazard in that ID cannot permit DIV.D to move into it while
ADD.D is waiting
– The scoreboard gets around such problems
• The IF stage is the same
• The ID stage is now broken
into a two stages:
decode/issue and read
operands
The New MIPS
Pipeline
– Decode/issue decodes the
instructions and tracks data
dependencies, and
manipulates the scoreboard
– It then passes the instruction
onto the proper functional unit
for execution
– Read operands is moved to
the functional units to obtain
register values right from the
register file, or from
functional units that forward
their results
We might have additional FUs such
– Functional units will write
as another FP + or FP * as needed
directly to the register file
thus performing the WB step
Issuing Logic
• Given an instruction, examine its
– Source registers – do either values need to be forwarded from
other FUs?
– Destination register – does the result get forwarded to another
FU?
• here, we have to be aware of WAW and WAR hazards
– Instruction type – is a functional unit available?
• If a functional unit is available and no other active
instruction has the same destination register, then issue the
instruction
– If another instruction has this destination register, then it
creates a potential WAW hazard
– If either no functional unit is available or a WAW hazard exists,
we postpone issuing this instruction which results in stalling
the pipeline
• but notice that we no longer stall the pipeline for situations that should
not require it like the prior example
WAW and WAR Hazards
• In the 5-stage pipeline, the only type of data hazards was
the RAW hazard (read after write)
– LD
R1, 0(R2)
– DADDI R3, R1, #1
• we are supposed to read R1 after we write R1 but because of where LD
accesses the data cache, if we didn’t prevent it, we would read R1 in
DADDI’s ID stage before the LD could write to R1
• The WAW hazard is a write after write – the writes occur in
the wrong order
– This is only possible if we permit out of order completion, but
that is necessary when we introduce floating point, multiply and
divide operations to the 5 stage pipeline
• The WAR hazard is a write after read but performed in the
wrong order because the read has been postponed
– This can happen with the scoreboard
– We’ll visit a WAR example in a little bit, after we explore how
the scoreboard works
Scoreboard Structure
• The scoreboard is a data structure which
maintains three types of information
– Instruction status (which stage issued instruction is in)
– Functional unit status (for each functional unit, its
status, where its destination and source registers are
from, and boolean flags to indicate whether the source
values are available yet)
– Register result status (for each register, which
functional unit might be writing to it)
–a
How the Scoreboard Works
• IF stage – same as in the pipeline, just fetch next
instruction
• Issue stage – take instruction from IF stage
– If functional unit is not busy and destination register
does not have a pending result (WAW hazard) issue
instruction
•
•
•
•
Busy  ‘yes’
OP  operation,
Fi  destination register, Fj/FK  source registers
Qj/Qk  functional unit producing the source registers (if
any)
• Rj/Rk  ‘yes’/‘no’ depending on whether source value is
available yet
• Result register for Fi  functional unit name
Continued
• Read operands
– Occurs at the functional unit when Rj & Rk are set to
yes
• Rj/Rk  ‘no’, Qj/Qk  0, Fj/Fk  values
• Execution takes place in the functional unit
– Execution complete is set to true when functional unit
completes execution
• Write result
– This step takes place after execution is complete and
after any potential WAR hazard goes away (because of
a read taking place elsewhere)
– This stage takes a cycle by itself, so it happens the
cycle after execution completes
• Result register  result, busy  ‘no’, functional unit
becomes available
Given the code under
the instruction list
with the following
assumptions:
Execution time in
cycles:
Add = 2
Multiply = 10
Divide = 40
Register write and
register read of same
register occur in
separate cycles
Only 1 FU can read
operands at a time
FUs: 1 int unit,
1 FP div,
2 FP Mult,
1 FP adder
Example
Continued
• During cycles 2-4, second L.D cannot issue because it also requires the
same Integer functional unit, so the scoreboard just tracks the one L.D as it
goes through read operands, execution and write result stages
• Example
continues:
• Second L.D is
issued
• MUL.D is
issued while
L.D reads
operands
• SUB.D is
issued while
L.D executes
the load but
MUL.D cannot
yet read
operand F2 (F0
is read)
• At this point,
L.D’s
execution has
completed,
MUL.D is
waiting to read
F2, SUB.D is
waiting to read
F2, and DIV.D
is issued
Continued
• In cycle 9:
Continued
– L.D writes its results, MUL.D can waits 1 cycle to read the register
because the same register cannot be read while it is being written to
• In cycle 10:
– L.D leaves the scoreboard, MUL.D reads its register, SUB.D and
DIV.D continue to wait to read operands since they cannot yet read
operands while MUL.D does this, ADD.D is not issued because of
structural hazard with SUB.D (only one FP adder available)
• In cycle 11:
– SUB.D reads its operands while MUL.D begins execution, DIV.D
cannot yet read operands and ADD.D is still stalled
• In cycle 12:
– SUB.D begins execution while MUL.D continues execution, DIV.D
cannot read operands since F0 has not been produced by MUL.D
• In cycle 13:
– SUB.D finishes execution while MUL.D continues
• In cycle 14:
– SUB.D writes its result, MUL.D continues, DIV.D still cannot read
• In cycle 15:
Continued
– SUB.D leaves the scoreboard and now ADD.D is issued, MUL.D continues
to execute while DIV.D waits
• In cycle 16:
– ADD.D reads operands, MUL.D continues to execute while DIV.D waits
• In cycles 17:
– ADD.D executes, MUL.D finishes executing, DIV.D waits
• In cycle 18:
– ADD.D executes, MUL.D writes its result, DIV.D waits (it cannot read it
this cycle)
• In cycle 19:
– MUL.D leaves the scoreboard, ADD.D cannot yet write its result because of
a WAR conflict with DIV.D (if the ADD.D writes its result now, the DIV.D
could read the wrong value for F6), DIV.D reads its operands
• In cycle 20:
– DIV.D begins execution (this will take 40 total cycles), ADD.D writes its
result to F6
• In cycle 21-61:
– DIV.D continues execution, ADD.D leaves the scoreboard in 21, DIV.D
continues execution until 59, writes result in 60, leaves at 61
Benefits and Costs of the Scoreboard
• Benefits – dynamic nature minimizes stalls from structural
hazards due to data dependencies
• Costs –a little extra logic and data storage for the
scoreboard itself are both needed, the primarily cost is in
the added buses that permit forwarding from FUs to FUs
and registers
– About 4 times as many buses as in an ordinary pipeline
• Restrictions
– There is only so much parallelism available in the instruction
stream, other techniques might be employed to improve this
– Limitations based on the size of the scoreboard and number of
FUs
– WAW and WAR hazards still stall instructions in their FUs
• We can try to remove the WAR and WAW hazards using a
variation of the scoreboard
Re-examining Data Dependencies
• We will refer to three types of dependencies
– Data dependencies – as before, a datum is needed by a
later instruction but is not available from an earlier
instruction in time
• these can arise either because
– instruction i produces a result that is used by j but i has not produced it
by the time j needs it
– instruction j is data dependent on instruction k and k is data dependent
on instruction I
• these are RAW hazards as we saw with the 5-stage pipeline
– Name dependencies – two instructions reference the same
named entity but where there is no data dependency
• these are WAW and WAR hazards
• There are also control dependencies which we will
consider later
Continued
• If two instructions are data dependent, they must
be executed in their proper order
– The data dependency is a property of the program itself
(the logic) and the situation arises because of the
overlapped nature of the pipeline execution of the
instructions
• note that data dependencies are not limited to registers but
can also flow through memory, consider the following:
– L.D
– ADD.D
– S.D
F0, 0(R1)
F6, F0, F4
F2, 0(R1)
• although it is unlikely that any architecture would have the
store executing before the load, we must preserve the
dependency because the store will wipe out the value that
will be used in the ADD.D
Name Dependencies
• There are two types of name dependencies between
instructions i and j (i precedes j)
– Antidependence – instruction j writes to a name that i
reads but j finishes writing before i gets to read
– Output dependence – instruction i and j write to the same
named item without an intervening read, but the writes are
in the wrong order
• Name dependencies may not be obvious if you focus
on one iteration of a loop
– We will consider a means of enhancing ILP (instruction
level parallelism) by unrolling a loop across iterations
– In such a case, we can find additional name dependencies
because the same registers are referenced in each loop
iteration, but if we have 2 iterations worth of loop body,
we will find the registers used in the first iteration are also
referenced in the second
Register Renaming
• Recall that the compiler generates registers to be used by
the processor to compute values
– Registers are often selected using a graph coloring algorithm
that ensures unique data are placed in separate registers as much
as possible to retain as much data in the CPU as possible
– However, because of the pipeline and out of order execution
hardware, the compiler’s designation of data to registers may
cause conflicts
• data dependencies cannot be helped, we have to live with them
• name dependencies arise because of limitations on registers
– An optimizing compiler may be able to rename registers to
reduce the number of name dependencies that could potentially
arise
• Dynamic register renaming is a superior approach
– This can be handled using Tomasulo’s algorithm where potential
name dependencies are defeated through register renaming as
the instructions are issued to their functional units
Control Dependencies
• A control dependence arises due to conditional
branches
– Code along one branch should only be completed if the
condition dictates that the branch should be taken
– This means that you should not execute the if clause if the
condition is false, or bypass the else clause if the
condition is true, or that the loop body should be skipped
if the condition is false, etc
• However, with speculation, we may take or ignore
branches improperly and so we have to dynamically
determine if we speculated incorrectly and resolve it
– Control dependencies can lead to data and name
dependencies that would not normally exist
– Miss-speculation can also lead to exceptions (see the next
slide)
Control Dependence Problems
L1:
DADD R2, R3, R4
BEQZ R2, L1
LW
R1, 0(R2)
…
L2:
DADD
BEQZ
DSUB
…
OR
R1, R2, R3
R4, L2
R1, R5, R6
R7, R1, R8
• The code above to the left contains a data dependence
between the DADD and BEQZ but also DADD and LW
– Imagine that we speculate that the branch is not taken, this
would cause us to begin execution of LW
– Further imagine that we do not compute the condition for
numerous stages so that the LW completes first
• If we have miss-speculated, then R2 == 0 and we will have attempted to
load from memory location 0(R2) = 0, which is part of the OS and will
most likely cause a memory violation interrupt
• Thus, our miss-speculation directly caused an exception
• In the code above to the right, the value of R1 changes
depending on whether we took the branch or not, so a missspeculation leads to a data hazard
Loop Unrolling
• Compiler scheduling allows us to
remove a number of stalling
situations
– However in concise code, there may
be too few instructions to move
around
– Consider the code to the right
• there are stalls after L.D (1), ADD.D (2),
DSUBI (1) and the branch delay slot
– The best we can do to schedule the
code is shown to the right below, but
we cannot remove all stalls
• One strategy is to have the compiler
“unroll” the for loop some number
of iterations to provide additional
instructions to schedule
Loop:
L.D
ADD.D
S.D
DSUBI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #8
R1, R2, Loop
Loop:
L.D
DSUBI
ADD.D
stall
BNE
S.D
F0, 0(R1)
R1, R1, #8
F4, F0, F2
R1, R2, Loop
F4, 8(R1)
The 2 stalls assumes no
collision in the MEM stage
Continued
• We can unroll the loop 4 times
– We will keep 4 iterations worth of L.D, ADD.D and S.D but only have 1
DSUBI and 1 BNE
– This requires adjusting displacements for our L.D and S.D operations and
the immediate datum for the DSUBI
– Below to the left is the unrolled code, to the right it is scheduled, no stalls!
Loop:
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DSUBI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, -8(R1)
F4, F0, F2
F4, -8(R1)
F0, -16(R1)
F4, F0, F2
F4, -16(R1)
F0, -24(R1)
F4, F0, F2
F4, -24(R1)
R1, R1, #32
R1, R2, Loop
Loop:
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DSUBI
S.D
BNE
S.D
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8(R1)
R1, R1, #32
F12, 16(R1)
R1, R2, Loop
F16, 8(R1)
Complications of Loop Unrolling
• Obviously we need far more registers
– This is sometimes referred to as register pressure
• If the loop iterates n times instead of 100, how do we know
we can unroll it 4 times?
– We could still unroll it 4 times and have a separate loop to
iterate the final 0-3 iterations as needed
• The compiler must adjust all memory displacements
appropriately and the immediate datum, although this may
be mathematically challenging to us, it’s a simple piece of
code for the compiler
– Our latencies for the example were made up to illustrate this
example, imagine if we still had a 7-cycle multiply, unrolling a
similar loop that performed MUL.D instead of ADD.D would
be more complex – you will do this in the homework
– Additionally, we must have a pipelined FP unit or else this
would not be worth doing – two consecutive ADD.Ds would
cause the second to stall
• what would happen if we were to do DIV.D instead of ADD.D?
Dynamic Scheduling
• We enhance the scoreboard concept with register renaming
– This is known as the Tomasulo architecture
• Our new pipeline distributes the 5 stages of MIPS
differently
– IF – as before, instructions are fetched and placed into an
instruction queue
– ID – separated into issue and read operand stages as with the
scoreboard
– EX – distributed pipelined functional units (e.g., FP adder, FP
mult, integer), each with its own register sets (plural) to store
multiple waiting instructions
– MEM – a separate memory unit with a load buffer and a store
buffer – instructions line up here to perform loads and stores
– WB – values are forwarded directly from functional unit to
functional unit, store buffer and registers via a common data
bus
MIPS FP Using Tomasulo’s Algorithm
• Issue
New Stages
– Next instruction from instruction queue
– Decode, check for structural hazards, if reservation station
available, issue to reservation station
• Read operands
– Operands forwarded from Load unit, Register file or other
Functional Unit to reservation station so that “read
operands” when available
• this means that we no longer wait on reading both operands when
they are available like we did in the scoreboard
• a reservation station stores operation, location operand sources
and destination location
– Instructions wait at reservation stations and can execute
out of order based on order that operands are received
– Register renaming is applied dynamically in the
reservation stations to avoid WAR and WAW hazards
• see the example on the next slide
Register Renaming Example
• In the code to the right
– F6 causes a RAW hazard between
ADD.D and S.D
• nothing can be done about this
– F6 causes a WAW hazard between
ADD.D to MUL.D
– F8 causes a WAR hazard between
ADD.D to SUB.D
• We rename F6 to S between the
ADD.D and S.D
– This does nothing to prevent the
RAW hazard, but prevents the WAW
hazard
• We rename F8 to T between the
SUB.D and MUL.D
– This removes the WAR hazard with
the ADD.D
DIV.D
ADD.D
S.D
SUB.D
MUL.D
F0, F2, F4
F6, F0, F8
F6, 0(R1)
F8, F10, F14
F6, F10, F8
DIV.D
ADD.D
S.D
SUB.D
MUL.D
F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T
As long as S continues to be
used for F6 and T continues
to be used for F8, there is no
problem with this renaming
• Execution
Stages Continued
– operands are “forwarded” to the reservation station
• once operands are available (must wait for any RAW hazards),
execution can begin in the functional unit
• if multiple instructions’ operands become available at the same
time, they can all begin execution simultaneously as long as they
are at different functional units
• if two instructions of the same functional unit become available at
the same time, the instruction to begin is selected randomly
(except for loads/stores which are handled in order)
– Any instruction that is control dependent is not allowed to
execute until the branch condition has been confirmed
• this may slow down processing but it ensures that miss-speculation
cannot cause an exception
• Write result
– Once a value has been computed, it is forwarded via CDB
to waiting reservation stations, store buffer and register file
• however if multiple functional units complete at the same time,
forwarding has to be done in sequential order
How Register Renaming Works
• We need to implement register renaming in hardware
• A register renamer (hardware) is added to the issue stage
– It maintains a list of temporary registers, called T registers
– It maintains a rename table
• When the renamer sees a destination register that causes a
WAW or WAR conflict
– It substitutes the destination register with the next available T
register and adds the old and new registers to the register table
– Now, for every new instruction’s source register that matches an
entry in the table, replace it with the T register
– The architecture might have as many as 64 T registers which are
rotated (when a register is no longer in use, it becomes available
to be reused)
– This requires examining the table each cycle
• the table is in reality a simple hardware-stored vector that can
accommodate an access per clock cycle
Example
• Consider the code to the
right
– Assume that the loop will be
unrolled in a Tomasulo-type
architecture
– How will register renaming
operate assuming that the
next available T register is
T9?
Loop: L.D
F4, 0(Rx)
MUL.DF2, F0, F4
DIV.D F8, F4, F2
L.D
F4, 0(Ry)
ADD.D F6, F0, F4
SUB.D F8, F8, F6
S.D
F8, 0(Ry)
In the first iteration becomes:
Loop: L.D
T9, 0(Rx)
MUL.DT10, F0, T9
DIV.D T11, T9, T10
T12, 0(Ry)
Each successive iteration of the loop will continue L.D
ADD.D T13, F0, T12
To use additional T registers, the next iteration
SUB.D T14, T11, T13
Will use T15-T20, followed by T21-T26, etc
S.D
T14, 0(Ry)
If we reach T63, we cycle around to T0
Example
L.D
F6, 32(R2)
L.D
F2, 44(R3)
MUL.DF0, F2, F4
SUB.D F8, F2, F6
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Instruction
L.D
L.D
MUL.D
SUB.D
DIV.D
ADD.D
F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F2, F6
F10, F0, F6
F6, F8, F2
Name Busy Op
Load1 yes Load
They have changed the
Load2
execution times of all FP
Add1
operations to be 4
Add2
cycles so that
FP – FP delays are 3
Add3
cycles and FP – store
Mult1
delays are 2 cycles
Mult2
Field
Qi
Vj
Issue
*
Vk
R2
F0
Qj
F2
Write
Execute Result
Qk
F4
A
34+Regs[R2]
F6
Load1
F8
F10
Continued
Name Busy Op
Load1 yes L.D
Load2 yes L.D
Add1 yes
Add2
Add3
Mult1 yes
Mult2
Vj
Vk
R2
R3
Qj
Qk
SUB.D
Mem[45+ Mem[34+
Regs[R3]] Regs[R2]]
MUL.D
Mem[45+
Regs[R3]]
Regs[F4]
A
34+Regs[R2]
45+Regs[R3]
Register renaming is handled through forwarding from the functional unit(s)
responsible for delivering the operands, this is denoted through the Qj/Qk
fields above and the register status below
Field
Busy
F0
F2
F4
Mult1 Load2
F6
F8
F10
Load1 Add1
Dynamic Loop Unrolling
• The Tomasulo architecture improves over the Scoreboard
because instructions can complete without having to wait
on WAW and WAR hazards
• However, the real power behind the approach is the
dynamic issuing of instructions
– This allows instructions to be sent to reservation stations as
soon as they reach the front of the instruction queue unless a
functional unit’s reservation stations are all filled
• Rather than using a compiler to unroll and schedule a loop,
the hardware can execute many iterations of a loop without
waiting for one iteration to complete
– Recall in the pipeline, one iteration would stall the pipeline
waiting for completion because of the lengthy computation
times for FP operations
– The only disadvantage of the dynamic approach is that we are
not removing branch operations as we did with the compiler,
however, with proper branch speculation, we can minimize the
branches’ impacts on performance
Continued
• With dynamic loop unrolling, we have to worry about
an excessive number of WAR and WAW hazards
arising by the use of the same registers across loop
iterations
– Previously, the compiler took care of register renaming for
us by altering the registers used in each iteration
– Now, we have to rely on the register renamer and the T
registers
– How many iterations can we unroll? In part, it will be
restricted by the number of T registers available
• in our previous example, we used 6 T registers per iteration and
had 64 T registers, so at most, we could only accommodate 10
unrolled loops at any one time
• this seems more than reasonable because there shouldn’t be that
iterations that need to be unrolled to account for any FP latencies,
but recall that our loop performed DIV.D
• if we have a divider that takes more than 10 cycles to execute, we
could easily run out of registers
Example
Loop:
L.D
MUL.D
F0, 0(R1)
F4, F0, F2
S.D
DSUBI
BNE
F4, 0(R1)
R1, R1, #-8
R1, R2, Loop
RAW hazard with L.D
RAW hazard with MUL.D, which will take
several cycles
RAW hazard with DSUBI
Branch hazard
Here is the similar loop to the one we unrolled by compiler
The MUL.D’s latency causes stalls that cannot be removed through
instruction scheduling because there are not enough instructions
available
Loop unrolling and scheduling resolved all stalls, however loop unrolling
is not without its own problems (compiler complexity, need for many more
registers, problems if the number of loop iterations is not divisible by the
number of times we unrolled the loop (4 in the example)
Instruction
L.D
MUL.D
S.D
L.D
MUL.D
S.D
Name
Load1
Load2
Add1
Add2
Add3
Mult1
Mult2
Store1
Store2
Tomasulo Solution
Issue
*
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
Busy Op
yes Load
no
no
no
no
no
no
no
no
Field
Busy
Vj
F0
F2
Load1
Execute
Vk
R1
F4
Qj
F6
Write Result
Qk
F8
F10
A
Regs[R1] + 0
Continued
Instruction
L.D
MUL.D
S.D
L.D
MUL.D
S.D
Name
Load1
Load2
Add1
Add2
Add3
Mult1
Mult2
Store1
Store2
Issue
*
*
*
*
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
Busy
yes
yes
no
No
no
yes
no
yes
no
Op
Load
Load
Vj
Field
Busy
Vk
Qj
Write Result
*
Qk
A
Regs[R1] + 0
Regs[R1] - 8
R1
MUL
Store
Execute
*
Regs[F2] Load1
Regs[F2] Load2
Regs[R1]
F0
F2
Load2
Mult1
F4
F6
Mult1
F8
F10
Continued
Instruction
L.D
MUL.D
S.D
L.D
MUL.D
S.D
Name
Load1
Load2
Add1
Add2
Add3
Mult1
Mult2
Store1
Store2
Issue
*
*
*
*
*
*
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
Busy
no
Yes
no
no
no
yes
yes
yes
yes
Op
Vj
Execute
*
*
Write Result
*
*
Vk
Qj
Qk
Load
R1
MUL
MUL
Store
Store
Field
Busy
Regs[F2]
Regs[F2] Load2
Regs[R1]
Regs[R1]-8
F0
F2
Load2
A
Regs[R1] - 8
F4
Mult1
Mult1
Mult2
F6
F8
F10
Mult2
Continued
Instruction
L.D
MUL.D
S.D
L.D
MUL.D
S.D
Name
Load1
Load2
Add1
Add2
Add3
Mult1
Mult2
Store1
Store2
Issue
*
*
*
*
*
*
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
Busy Op
no
no
no
no
no
no
yes MUL
no
yes Store
Field
Busy
Vj
Execute
*
*
*
*
----
Vk
Qj
Write Result
*
*
*
Qk
A
Regs[F2] Load2
Regs[R1]-8
F0
F2
Load2
F4
Mult2
F6
F8
F10
Mult2
Summary on Tomasulo
• Notice in the previous example that we unrolled the
loop two times and the hardware was able to execute
the two iterations with only RAW hazards as delays
– However, given the limitation on FUs, we were not able to
unroll the loop further
• with fully pipelined functional units, or multiple functional units,
and numerous reservation stations, we could continue to execute
iterations of the loop without delay
– Additionally, if memory accesses cause a stall because of
a cache miss, the hardware is able to continue issuing
instructions out of order, thus greatly improving
performance over shutting down the pipeline because of
the cache miss
– Also, compiler optimizations are no longer essential to
obtain performance increases
Continued
• The bottleneck for Tomasulo is primarily the single CDB
– We can improve on this by having two, one for FP and one for
INT values
• We also see impacts because of
–
–
–
–
The limited number of FUs
The limited number of reservation stations
Restrictions on single data accesses (loads/stores) per cycle
The added cost because of additional logic to look for potential
WAR/WAW hazards and perform register renaming
• However, the dynamic issuing allowed by the Tomasulo
approach will encourage the use of speculation to keep the
hardware busy
• This will also lead to an easier way to perform superscalar
pipelining (we will cover this idea next week)

similar documents