In class notes

Report
Real Processor Architectures
• Now that we’ve seen the basic design elements for
modern processors, we will take a look at several
specific processors
– We start with the 486 pipeline to see how NOT to do a
pipeline
• recall Intel x86 is a CISC with variable length instructions,
memory-register addressing, some complex addressing modes and
some complex instructions
• we will compare it to the much more efficient MIPS pipeline
– We then consider dynamic issue superscalars of varying
degrees of sophistication
– To understand the Pentium architecture, we must look at
how they avoided the pitfalls of the 486 by issuing microcode rather than machine instructions, so this requires that
we also look at microprogrammed control units and microcode
486 Processor
• The instruction set was almost identical to the 386
(and thus was still CISC based)
– They added a floating point functional unit to the
processor so that it could execute the floating point
operations introduced for the x86 math coprocessor
– This FP functional unit provided a degree of parallel
processing in that while FP operations were executed,
the pipeline would continue to fetch and execute
integer operations
– It contained an 8KB combined instruction/data cache
(later expanded to 16KB)
• The big difference between the 386 and 486
though was the pipeline, the first Intel processor
with a pipeline
– However, because of the CISC nature of the
instruction set, the pipeline is not particularly efficient
The 486 Pipeline
• They used a 5 stage pipeline
– Fetch 16 bytes worth of instruction
• this may be 1 instruction (or even a part of 1 instruction), or
multiple instructions
– Decode stage 1 – was an entire instruction fetched? If not,
this stage stalls
• divide up the 16 bytes into instruction(s)
– Decode stage 2 – decode the next instruction, fetch
operands from registers
– Execution – ALU operations, branch operations, cache
(mov) instructions
• this stage may take multiple cycles if an ALU operation requires 1
or more memory access (e.g., add x, 5 takes 2 memory accesses)
– Write result of load or ALU operation to register
486 Difficulties
• Stalls arise for numerous reasons
– 17 byte long instructions require 2 instruction fetch stages
– Any ALU memory-register or memory-immediate takes at
least 1 additional cycle, possibly two if the memory item
was both a source and destination
• such a situation stalls instructions in the decode 2 stage
• or in the EX stage if the result is written back to memory
– Complex addressing modes can cause stalls
• pointer accessing (indirect addressing) is available which takes 2
memory accesses
• scaled addressing mode can involve both an add and a shift
• again, stalls occur in the decode 2 stage
– Branch instructions have a 3 cycle penalty because
branches are computed in the EX stage (4th stage) and
some loop operations take more than 1 cycle to compute
adding a further stall
• The first
example has
three data
movements
with no
penalties
• The second
example has a
data hazard
requiring 1
cycle of stall
• The third
example
illustrates a
branch penalty
486 Examples
486 Overall Architecture
ARM Cortex A-8 Processor
• Dual-issue superscalar with static scheduling but dynamic
issue detection through a scoreboard
– Up to 2 instructions per cycle
• 14-stage pipeline (see next slide)
– Branch prediction performed by in the AGU (address generation
unit) using:
• Dynamic branch prediction with 512-entry two-way set associative branch
target buffer
• 4K global history buffer
– when branch target buffer misses, a prediction is obtained from the global history
buffer
• 8-entry return stack
– an incorrect branch prediction flushes the entire pipeline
– Instruction decode is 5 stages long and up to 2 instructions
decoded per cycle
• 8 bytes fetched from cache
• if neither instruction is a branch, PC is incremented
• stage 4 in this 5 stage mini-pipeline is the scoreboard and issue logic
A-8 Pipeline
A-8 Execution Unit
Either instruction can
Go to the load/store
Pipeline but not
Both
ALU pipe 1 is for
simple integer
operations
Multiplies use ALU pipe
0 and can accommodate
up to 2 in one cycle
Structural hazards are rare because the compiler attempts to schedule pairs of
instructions to not use the same instruction pipe at the same time
Data hazards are detected during decode by the scoreboard and may either stall both
instructions or just the second of the pair, the compiler is responsible for attempting
to prevent such stalls (note that forwarding is only available from WB (E5) to E0
A-8 Performance
The ideal CPI for the
A-8 is .5 (2 instructions
issued per cycle)
Here, you see the truth
is that the ideal is not
possible and that aside
from the mcf and gzip
benchmarks, the greatest
source of stalls arise
because of the pipeline
stalling (not because
of cache misses)
Pentium Architecture
• Recall our examination of the Intel 486 pipeline
– variable length of instructions, variable complexity of
operations, memory-register ALU operations, etc led to poor
performance
• In order to improve performance using RISC features, the
Pentium architects had to rethink things – they were stuck
with their CISC instruction set (for backward
compatibility)
– in CISC architectures, a machine instruction is first translated
into a sequence of microinstructions
– each microinstruction is a lengthy string of 1s and 0s, each of
which refer to one control signal in the machine
– there needs to be a process to translate each machine instruction
into microinstructions and execute each microinstruction – this
is done by collecting machine instructions and their associated
microinstructions into microprograms
Why Microinstructions?
• The Pentium architecture uses a microprogrammed
control unit
– there is already a necessary step of decoding a machine
instruction into microcode
• Now, consider each microinstruction:
– equal length
– executes in the same amount of time (unless hazards arise)
– branches are at the microinstruction level and are more
predictable than machine language level branching
• In a RISC architecture, the simplicity of each instruction
allows it to be carried out directly in hardware in 1 cycle
(usually)
– Intel realized that to efficiently pipeline their CISC architecture,
they had to pipeline the microinstructions instead of machine
instructions
Control and Micro-Operations
• An example architecture
is shown to the right
• Each of the various
connections is controlled
by a particular control
signal
– MBR to the AC
controlled with signal
C11
– PC to MAR by C2
– AC to ALU C7
• note that this figure is
incomplete
• A microprogram is a
sequence of microoperations
this is not an x86 architecture!
Example
• Consider a CISC instruction such as Add R1, X
– X copied into MAR and a memory read signaled
– datum returned across data bus to MBR
– adder sent values in R1 and MBR, adding the two, storing
result back into R1
• This sequence can be written in terms of microoperations as:
–
–
–
–
–
t1:
t2:
t3:
t3:
t4:
MAR  (IR (address) )
MBR  Memory
R1  (R1) + (MBR)
Acc  (R1) + (MBR)
R1  (Acc)
t1 – t5 are clock cycles, each
microinstruction executes in
separate clock cycles
• Each micro-operation is handled by one or more
control signals
– For instance, MBR  Memory is C5
Control Memory
Each microprogram consists of
one or more microinstructions, each
stored in a separate
entry of the control
memory
The control
memory itself is
firmware, a
program stored in
ROM, that is placed
inside of the control
unit
...
Jump to Indirect or Execute
...
Jump to Execute
...
Jump to Fetch
Jump to Op code routine
...
Jump to Fetch or Interrupt
...
Jump to Fetch or Interrupt
Fetch cycle routine
Indirect Cycle routine
Interrupt cycle routine
Execute cycle begin
AND routine
ADD routine
Note: each micro-program ends with a branch to the
Fetch, Interrupt, Indirect or Execute micro-program
Example of Three Micro-Programs
• Fetch:
t1: MAR  (PC)
C2
t2: MBR  Memory
C0, C5, CR
PC  (PC) + 1
C*
t3: IR  (MBR)
C4
• Indirect: t1: MAR  (IR (address) )
C8
t2: MBR  Memory
C0, C5, CR
t3: IR(address)  (MBR (address) )
C4
• Interrupt: t1: MBR  (PC)
C1
t2: MAR  save address
C*
PC  routine address
C*
t3: Memory  (MBR)
C12, CW
– CR – Read control to system bus
– CW – write control to system bus
• C0 – C12 refers to the previous figure
• C* are signals not shown in the figure
Horizontal vs. Vertical
Micro-Instructions
Micro-instruction
address points to a
branch in the control
memory and is taken if
the condition bit is true
Micro-instruction Address
Function Codes
Vertical micro-instructions
use function codes that
need additional decoding
Jump
Condition
Internal CPU Control Signals
This micro-instruction requires 1 bit
for every control line, it is longer
than the vertical micro-instruction
and therefore takes more space to
store, but does not require additional
time to decode by the control unit
Horizontal micro-instructions
contain 1 bit for every control
signal controlled by the control
unit
Micro-instruction Address
Jump Condition
System Bus
Control Signals
Micro-programmed
Control Unit
Continued
• Decoder analyzes IR
– delivers starting address of op code’s micro-program in
control store
• address placed in the to a micro-program counter (here, called a
Control Address Register)
• Loop on the following
– sequencer signals read of control memory using address
in microPC
– item in control memory moved to control buffer register
– contents of control buffer register generate control
signals and next address information
• if the micro-instructions are vertical, decoding is required here
– sequencer moves next address to control address register
• next instruction (add 1 to current)
• jump to new part of this microprogram
• jump to new machine routine
Pentium IV: RISC features
• All RISC features are implemented at the
microinstructions level instead of machine
instruction level as seen in typical RISC
processors
– Microinstruction-level pipeline
– Dynamically scheduled micro-operations
– Reservation stations (128) and multiple functional
units (7)
– Branch speculation via branch target buffer
• speculation at micro-instruction level, not machine level
• instead of an ROB, decisions are made at the reservation
stations so that a miss-speculation causes reservation
stations to flush their contents, correct speculation causes
reservation stations to forward results to registers/store units
– Trace cache used (discussed shortly)
Pentium Pipeline
• Fetch machine instruction (3 stages)
• Decode machine instruction into microinstructions (2
stages)
• Superscalar issues multiple microinstructions (2 stages)
– register renaming occurs here, up to 3 microinstructions can be
issued per cycle – 2 integer and 1 FP
• Execute of microinstructions (1 stage)
– Functional units are pipelined and can take from 1 up to
approximately 32 cycles to execute
• Write back (3 stages)
• Commit (3 stages)
– up to 3 microinstructions can commit in any cycle
Pentium IV Overall Architecture
Specifications
• 7 functional units:
– 2 simple ALUs (add, compare, shift) – ½ cycle execution
to accommodate up to 2 micro-operations per cycle
– 1 complex ALU (integer multiplication and division) –
multicycle, pipelined
– 1 load unit and 1 store unit – including address
computation
– 1 FP move (register to register move and convert)
– 1 FP arithmetic unit (+, -, *, /) – multicycle, pipelined,
some SIMD execution permitted on these units
• 128 registers for renaming
– reservation stations are used rather than a re-order buffer
– instructions must wait in reservation stations longer than in
Tomasulo’s version, waiting for speculation results
Trace Cache
• The trace cache is an instruction cache
– It caches not just individual instructions or even memory
refill lines, it caches blocks of instructions that have
recently been executed together
• In this way, the trace cache encodes branch behavior
implicitly
• Additionally, miss-speculated instructions would be
discarded from a trace cache
• The trace cache was developed for the Pentium IV, so it
stores microinstructions (not machine instructions)
• Combining a trace cache and branch target buffer
together minimize microinstruction fetch and decoding
– As long as the microinstructions remain in the trace cache
– Miss-predictions at the microinstruction level is far rarer
than miss-predictions at the machine level
Source of Stalls
• This architecture is very complex and relies on
being able to fetch and decode instructions quickly
– The process breaks down when
• less than 3 instructions can be fetched in 1 cycle
• trace cache causes a miss, or branches are miss
predicted
• less than 3 instructions can be issued because they
either are not 2 int + 1 FP or because of structural
hazards
• limitation of reservation stations
• data dependencies between functional units cause
stalls because other instructions have to wait at their
reservation stations
• data cache access results in a miss
Continued
• Stalls manifest themselves in two places
– The issue stage
• branch miss-predictions
• cache misses
• reservation stations full
– The commit stage
• branch miss-predictions
• instructions not ready to commit yet
• these are not actually stalls, but because instructions are
committed in the order they were issued, a later instruction
may wait to commit because of earlier instructions being
time consuming, and if the later instruction is a branch,
improperly fetched instructions because of miss-speculation
may continue to occur
• branch computation not yet available
Continued
• Miss-prediction rates (at the micro-operation
level) are very low
– About .8% for integer benchmarks and .1% for
floating point benchmarks
• notice how FP benchmarks continue to have high
predictability because they involve a lot of for loops which
are very predictable, integer benchmarks tend to have more
conditional statements which are less predictable
– At the machine language level, miss-speculation is
between .1% and 1.5%
• Trace cache has nearly a 0% miss rate
– The L1 and L2 data caches have miss rates of around
6% and .5% respectively
– The machine’s effective CPI ranges from around 1.2 to
5.85 with an average of around 2.2 (machine
instructions, not micro-operations)
Earlier Pentiums
• Pipeline changes:
– Pentium pipeline: 2-issue superscalar, 5 stages
– Pentium Pro pipeline: 14 stages
– Pentium III pipeline 14 stages (shown earlier in
these slides)
– Pentium IV pipeline 21 stages (minimum) and
eventually widened to 31 (minimum)
• Bus widened to support 64 GB
• Conditional instructions introduced (we will
cover this next week)
• Faster clock cycles introduced
– From 1 GHz to 1.5 GHz, eventually up to 3.2 GHz
• the clock rate is so fast that it takes 2 complete cycles for an
instruction or data to cross the chip
• Increased reservation stations
– PIII: 40, PIV: 128
• up to 128 instructions can become state of
operation simultaneously
Pentium IV versus AMD Opteron
• The Opteron uses dynamic scheduling,
speculation, a shallower pipeline, issue and
commit of up to 3 instructions per cycle, 2-level
cache, chip has a similar transistor count although
is only 2.8 GHz
• The Opteron is a RISC instruction set, so
instructions are machine instructions, not
microinstructions
– P4 has a higher CPI on all benchmarks except mcf
• AMD is more than twice the P4 on this benchmark
– So for most cases, instructions take fewer cycles to
complete (lower CPI) in the AMD than the P4 but the
P4 has a slightly faster clock to offset this
Intel Core i7
• The i7 extends on the Pentium approach
– Aggressive out of order speculation
– Deep pipeline (14 stages)
• instruction fetch – retrieves 16 bytes to decode
• there is a separate IIFU that feeds a queue that can store
up to 18 instructions at a time
– unlike the Pentium, decoding is done using a step called
macro-op fusion which combines instructions that have
independent micro-ops that can execute in parallel
• if a loop is detected that contains fewer than 28
instructions or 256 bytes, these instructions will remain
in a buffer to repeatedly be issued (rather than repeated
instruction fetches)
Continued
• Instruction fetch also includes
– The use of a multilevel branch target buffer and a return
address stack for speculation
• miss-predictions cause a penalty of about 15 cycles
– A 32 KB instruction cache
• Decoding first converts machine instructions into
microcode and breaks instructions into two types using
four decoders
– Simple micro-operation instructions (2 each)
– Complex micro-operation instructions (2 each)
• Instruction issue can issue up to 6 micro-operations per
cycle to
– 36 centralized reservation stations
– 6 functional units including 1 load and 2 store units that
share a memory buffer connected to 3 different data caches
i7
Architecture
i7 Performance
CPI for various SPEC06
Benchmarks
Average CPI is 1.06 for
both integer programs
and .89 for FP
This is the number of
machine instructions
issued (not micro-ops)
so obtaining the values is
not completely transparent
The Pentium and i7 are both susceptible to miss-speculation, that results in “wasted”
work, up to 40% of the total work that goes into Spec 06 benchmarks is wasted
Waste also arises from cache misses (10 cycles or more lost with an L1 miss, 30-40 for
L2 misses and as much as 135 for L3 misses)
Multicore Processors
• With additional space on the chip, the
strategy today is to equip the processor
with multiple cores
– Each core is a separate processor with its
own local cache, local bus, etc
– An additional cache is commonly added
to the chip so that there is an L1 (within
each core), L2 (on the chip, shared among
cores) and L3 (off chip)
• We will briefly consider multicore
processors later when we consider
thread level processor and true parallel
processing
• We wrap up our examination of
processors by looking multicore
performances as number of cores
increase
Multicore Performance
• Three things are apparent when considering the
performance of the multi-core processors
– First, obviously, the IBM Power 7 outperforms the other two in
every case
– The speedup is close to but not always linear to the number of
cores
• doubling the number of cores does not guarantee twice the performance
– There is a greater potential for speedup on FP benchmarks for
the Power7 than on the int benchmarks
A Balancing Act
• Improving one aspect of our processor does not
necessarily improve performance
– in fact, it might harm performance
• consider lengthening the pipeline depth and increasing
clock speed in the P4 without adding reservation stations
or using the trace cache
• stalls will arise at the issue stage thus defeating any benefit
from the longer pipeline
• cache misses will have a greater impact, not a lesser
impact, when the clock speed is increased
• Modern processor design takes a lot of effort to
balance out the factors
– without accurate branch prediction and speculation
hardware, stalls from miss-predicted branches will
drop performance greatly
• we saw this in both the ARM and i7 processors
Continued
• As clock speeds increase
– Stalls from cache misses create a bigger impact on CPI, so
larger caches and cache optimization techniques are
needed
• To support multiple issue of instructions
– we need a larger cache-to-processor bandwidth, which can
take up valuable space
• As we increase the number of instructions that
can be issued
– we need to increase the number of reservation stations and
reorder buffer size
• Some compiler optimizations can also be applied to
help support the distributed nature of the hardware
(we look at this next week)

similar documents