Intel Pentium 4 Processor

Intel Pentium 4 Processor
Presented by
Michele Co
(much slide content courtesy of Zhijian Lu and Steve Kelley)
Introduction (Zhijian)
– Willamette (11/2000)
Instruction Set Architecture (Zhijian)
Instruction Stream (Steve)
Data Stream (Zhijian)
What went wrong (Steve)
Pentium 4 revisions
– Northwood (1/2002)
– Xeon (Prestonia, ~2002)
– Prescott (2/2004)
Dual Core
– Smithfield
Intel Pentium 4 processor
– Latest IA-32 processor equipped with a full set
of IA-32 SIMD operations
First implementation of a new microarchitecture called “NetBurst” by Intel
Intel architecture 32-bit (IA-32)
– 80386 instruction set (1985)
– CISC, 32-bit addresses
“Flat” memory model
 Registers
– Eight 32-bit registers
– Eight FP stack registers
– 6 segment registers
IA-32 (cont’d)
Addressing modes
Register indirect (mem[reg])
Base + displacement (mem[reg + const])
Base + scaled index (mem[reg + (2scale x index)])
Base + scaled index + displacement (mem[reg + (2scale
x index) + displacement])
SIMD instruction sets
– MMX (Pentium II)
» Eight 64-bit MMX registers, integer ops only
– SSE (Streaming SIMD Extension, Pentium III)
» Eight 128-bit registers
Pentium III vs. Pentium 4 Pipeline
Comparison Between Pentium3 and
Execution on MPEG4 Benchmarks @ 1 GHz
Instruction Set Architecture
Pentium4 ISA =
Pentium3 ISA +
SSE2 (Streaming SIMD Extensions 2)
SSE2 is an architectural enhancement to
the IA-32 architecture
Extends MMX and the SSE extensions with
144 new instructions:
 128-bit SIMD integer arithmetic operations
 128-bit SIMD double precision floating
point operations
 Enhanced cache and memory management
Comparison Between SSE and SSE2
Both support operations on 128-bit XMM register
SSE only supports 4 packed single-precision
floating-point values
SSE2 supports more:
2 packed double-precision floating-point values
16 packed byte integers
8 packed word integers
4 packed doubleword integers
2 packed quadword integers
Double quadword
128 bits (word = 2 bytes)
Quad word
64 bit
Quad word
64 bit
Double word Double word Double word Double word
32 bit
32 bit
32 bit
32 bit
Hardware Support for SSE2
Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of
that in Pentium3
 Increased bandwidth in load/store for
floating-point values
load and store are 128-bit wide
One load plus one store can be completed
between XMM register and L1 cache in one
clock cycle
SSE2 Instructions (1)
Data movements
Move data between XMM registers and between
XMM registers and memory
Double precision floating-point operations
Arithmetic instructions on both scalar and
packed values
Logical Instructions
Perform logical operations on packed double
precision floating-point values
SSE2 Instructions (2)
Compare instructions
Compare packed and scalar double precision
floating-point values
Shuffle and unpack instructions
Shuffle or interleave double-precision floatingpoint values in packed double-precision floatingpoint operands
Conversion Instructions
Conversion between double word and doubleprecision floating-point or between singleprecision and double-precision floating-point
SSE2 Instructions (3)
Packed single-precision floating-point instructions
Convert between single-precision floating-point
and double word integer operands
128-bit SIMD integer instructions
Operations on integers contained in XMM
Cacheability Control and Instruction Ordering
More operations for caching of data when storing
from XMM registers to memory and additional
control of instruction ordering on store operations
Pentium4 is equipped with the full set of
IA-32 SIMD technology. All existing
software can run correctly on it.
 AMD has decided to embrace and
implement SSE and SSE2 in future CPUs
Instruction Stream
Instruction Stream
What’s new?
– Added Trace Cache
– Improved branch predictor
 op – Micro-op, already decoded RISC-like
– Front end – instruction fetch and issue
Front End
Prefetches instructions that are likely to be
 Fetches instructions that haven’t been
 Decodes instruction into ops
 Generates ops for complex instructions or
special purpose code
 Predicts branches
Three methods of prefetching:
Instructions only – Hardware
 Data only – Software
 Code or data – Hardware
Single decoder that can operate at a
maximum of 1 instruction per cycle
 Receives instructions from L2 cache 64 bits
at a time
 Some complex instructions must enlist the
help of the microcode ROM
Trace Cache
Primary instruction cache in NetBurst
 Stores decoded ops
 ~12K capacity
 On a Trace Cache miss, instructions are
fetched and decoded from the L2 cache
What is a Trace Cache?
I1 …
I2 br r2, L1
I3 …
I4 …
I5 …
L1: I6
I7 …
 Traditional instruction cache
Trace cache
Pentium 4 Trace Cache
Has its own branch predictor that directs
where instruction fetching needs to go next
in the Trace Cache
 Removes
– Decoding costs on frequently decoded
– Extra latency to decode instructions upon
branch mispredictions
Microcode ROM
Used for complex IA-32 instructions (> 4
ops) , such as string move, and for fault
and interrupt handling
 When a complex instruction is encountered,
the Trace Cache jumps into the microcode
ROM which then issues the ops
 After the microcode ROM finishes, the
front end of the machine resumes fetching
ops from the Trace Cache
Branch Prediction
Predicts ALL near branches
– Includes conditional branches, unconditional
calls and returns, and indirect branches
Does not predict far transfers
– Includes far calls, irets, and software interrupts
Branch Prediction
Dynamically predict the direction and target
of branches based on PC using BTB
 If no dynamic prediction is available,
statically predict
– Taken for backwards looping branches
– Not taken for forward branches
Traces are built across predicted branches to
avoid branch penalties
Branch Target Buffer
Uses a branch history table and a branch
target buffer to predict
 Updating occurs when branch is retired
Return Address Stack
16 entries
 Predicts return addresses for procedure calls
 Allows branches and their targets to coexist
in a single cache line
– Increases parallelism since decode bandwidth is
not wasted
Branch Hints
P4 permits software to provide hints to the
branch prediction and trace formation
hardware to enhance performance
 Take the forms of prefixes to conditional
branch instructions
 Used only at trace build time and have no
effect on already built traces
Out-of-Order Execution
Designed to optimize performance by
handling the most common operations in
the most common context as fast as possible
 126 ops can in flight at once
– Up to 48 loads / 24 stores
Instructions are fetched and decoded by
translation engine
 Translation engine builds instructions into
sequences of ops
 Stores ops to trace cache
 Trace cache can issue 3 ops per cycle
Can dispatch up to 6 ops per cycle
 Exceeds trace cache and retirement op
– Allows for greater flexibility in issuing ops to
different execution units
Execution Units
Double-pumped ALUs
ALU executes an operation on both rising
and falling edges of clock cycle
Can retire 3 ops per cycle
 Precise exceptions
 Reorder buffer to organize completed ops
 Also keeps track of branches and sends
updated branch information to the BTB
Execution Pipeline
Execution Pipeline
Data Stream of Pentium 4 Processor
Register Renaming
Register Renaming (2)
8-entry architectural register file
 128-entry physical register file
 2 RAT
Frontend RAT and Retirement RAT
 Data does not need to be copied between
register files when the instruction retires
On-chip Caches
L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
All caches are not inclusive and a pseudo-LRU
replacement algorithm is used
L1 Instruction Cache
Execution Trace Cache stores decoded
 Remove decoder latency from main
execution loops
 Integrate path of program execution flow
into a single line
L1 Data Cache
Support up to 4 outstanding load misses
Load latency
2-clock for integer
6-clock for floating-point
1 Load and 1 Store per clock
Speculation Load
Assume the access will hit the cache
“Replay” the dependent instructions when miss
L2 Cache
Load latency
Net load access latency of 7 cycles
 Nonblocking
 Bandwidth
One load and one store in one cycle
New cache operation begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second @ 1.5GHz
Data Prefetcher in L2 Cache
Hardware prefetcher monitors the reference
 Bring cache lines automatically
 Attempt to stay 256 bytes ahead of current
data access location
 Prefetch for up to 8 simultaneous
independent streams
Store and Load
Out of order store and load operations
Stores are always in program order
 48 loads and 24 stores can be in flight
 Store buffers and load buffers are allocated
at the allocation stage
Total 24 store buffers and 48 load buffers
Store operations are divided into two parts:
Store data
Store address
 Store data is dispatched to the fast ALU,
which operates twice per cycle
 Store address is dispatched to the store
AGU per cycle
Store-to-Load Forwarding
Forward data from pending store buffer to
dependent load
 Load stalls still happen when the bytes of
the load operation are not exactly the same
as the bytes in the pending store buffer
System Bus
Deliver data with 3.2Gbytes/S
 64-bit wide bus
 Four data phase per clock cycle (quad
 100MHz clocked system bus
Reduced Cache Size
Increased Bandwidth and Lower Latency
What Went Wrong
No L3 cache
Original plans called for a 1M cache
 Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of
the processor to act as the L3
 But that added another 100 pads to the
processor, and would have also forced Intel
to devise an expensive cartridge package to
contain the processor and cache memory
Small L1 Cache
Only 8k!
– Doubled size of L2 cache to compensate
Compare with
– AMD Athlon – 128k
– Alpha 21264 – 64k
– 32k
– Itanium
– 16k
Loses consistently to AMD
In terms of performance, the Pentium 4 is as
slow or slower than existing Pentium III and
AMD Athlon processors
 In terms of price, an entry level Pentium 4
sells for about double the cost of a similar
Pentium III or AMD Athlon based system
 1.5GHz clock rate is more hype than
 Differences from Willamette
Socket 478
21 stage pipeline
512 KB L2 cache
2.0 GHz, 2.2 GHz clock frequency
0.13 fabrication process (130 nm)
» 55 million transistors
 Differences
31 stage pipeline!
1MB L2 cache
3.8 GHz clock frequency
0.9 fabrication process

similar documents