Itanium Processor Microarchitecture

Report
by Harsh Sharangpani and Ken Arora
Presented by Teresa Watkins
4/16/02
1
2
First implementation of the IA64 instruction set architecture
Targets memory latency, memory address disambiguation,
and control flow dependencies
0.18 micron process, 800MHz
EPIC design style shifts more responsibilities to compiler
Challenge
Try to identify which improvements discussed in this class
found their way into the Itanium.
3
Idea
Compiler has
larger
instruction
window than
hardware.
Communicate
to the hardware
more of the
information
gleaned at
compile time.
4
Six instructions wide and ten stage deep
Tries to minimize latency of most frequent operations
Hardware support for compilation time indeterminacies
5
Software initiated prefetch (requests filtered by instruction cache)
prefetch must be 12 cycles before branch to hide latency
L2 -> streaming buffer -> instruction cache
Four level branch predictor hierarchy to prevent 9-cycle pipeline stall
Decoupling buffer hold up to 8 bundles of code
6
Compiler provides branch hint directives
• explicit branch predict (BRP) instructions
• hint specifiers on branch instructions
which provide
• branch target addresses
• static hints on branch detection
• indicators for when to use dynamic predictors
Four types of predictors
•Resteer 1: single cycle predictor (4 BRP programmed TARs)
•Resteer 2: Adaptive multi-way and return predictors (dynamic)
•Resteer 3&4: Branch address calculation and correction
-Resteer 3 includes “perfect-loop-exit-predictor”
7
Plentiful Resources
Organized around 9 issue ports
• four integer units
• two memory
• four multi-media units
• two integer
• two load/store units
• two FP
• three branch units
• three branch
• two extended precision FP units
• two single precision FP units
• SIMD allows up to 20 parallel operations per clock
Dispersal follows high level semantics provided by IA64 ISA
Check for:
•Independence (determined by stop bits)
•Oversubscription (determined by 8-bit instruction template)
Template allows for simplified dispersal routing
8
Two types of register renaming (virtual register addressing):
Register Stacking
reduces function call and return
overhead by stacking new register
frame on top of old frame to
prevent explicit save of caller’s
register (not supported in FP
registers)
Register Rotation
supports software-pipelining
by accessing the registers
through an indirection based
on the iteration count
If software allocates more virtual registers than are physically
available (overflow), the Register Stack Engine takes control of the
pipeline to store register values to memory, and the reverse for
underflow. No pipeline flushes required :)
9
Integer register file
•128 entries
• 8 read ports
• 6 write ports
• postincrement performed by idle ALU and write ports
FP register file
•128 entries
•8 read ports
•4 write ports, separated into odd and even banks
•supports double extended-precision arithmetic
Predicate register file: 1-bit entries with 15 read and 11 write ports
10
Non-blocking cache with scoreboard-based stall on use control strategy
Pipeline only stalls when data is needed, not on other hazards
Deferred-stall strategy (hazards evaluation in REG stage) allows more
time for dependencies to resolve
Stalls in EXE stage, where input latches snoop returning data values for
correct data using existing register bypass hardware.
Predication : turns control dependency into data dependency by
executing all sides of a predicted branch and squashing the incorrect
instructions before they change the machine state (speculative predicate
register file vs architectural predicate register file)
Executes up to three parallel branch predictions a cycle, uses priority
encoding to determine earliest taken branch.
11
Exception tokens
In FP registers, exceptions are noted by storing a NaTVal value
in the NaN space, but an extra bit is added to the INT register for the
exception token (NaT). These bits must be stored in a special UNaT
register in the event of a register spill because it won’t fit in memory,
and it is restored during fills.
ALAT structure
If an instruction writes to a register between the time the speculative
load reads that register and consumes the value, the ALAT invalidates
the speculative load value and recovery is initiated. ALAT checks can
be issued in parallel with the consuming instruction.
12
First Level Cache
Second Level Cache
• Combined data and instructions
• Data and Instruction Separate
• 16Kbytes each, 32 byte line size • 96Kbytes
• six-way set-associative
(6 instructions/cycle in I cache)
• 64 byte line size
• four-way set-associative
• two banks
• dual ported
• four-state MESI for multi• 2 cycle latency, fully pipelined
processor coherence
• write through
• 4 double precision operands per
• physically addressed and tagged
clock to FP register file
• single cycle, 64 entry, fully
associative iTLB (backed up by an on-chip hardware page walker)
• iTLB and cache tags have an additional port to check address for miss
13
Third Level Cache
• 4Mbytes
• 64-byte line size
• four-way set associative
• 128-bit core speed bus line
(12 Gbytes/s bandwidth)
• MESI protocol
Optimal Cache Management
• Memory locality hints
- allocation and replacement strategies
•Bias hints
- optimize MESI latency
14
• 64-bit system bus, source-synchronous data transfer (2.1 Gbytes/sec)
• Multi-drop shared system bus uses MESI coherence protocol
• Four-way glueless multiprocessor system support (4 processor nodes)
• Multiple nodes connected through high speed interconnects
• Transaction based bus protocol allows 56 pending transactions
• ‘Defer mechanism’ for OoO data transfers and transaction completion
15
Non-blocking caches as seen in
“Lockup-free instruction fetch cache organization”
Prefetch - decoupled prefetch based on branch hints as seen in
“A Scalable Front-End Architecture for Fast Instruction Delivery”
- software initiated prefetch as seen in
“Design and Evaluation of a Compiler Algorithm for Prefetching”
Memory locality hints for more efficient use of caches
Speculation - extra bit for deferred exception tokens
What else?
Do you think they made a simple, scalable hardware implementation?
16

similar documents