Bulldozer: An Approach to multithreaded Compute Performance

Michael Butler, Leslie Barnes,
Debjit Das Sarma, Bob Gelinas
This paper appears in: Micro, I EEE
March/April 2011 (vol. 31 no. 2)
pp. 6-15
마이크로 프로세서 구조
speaker: 박세준
1. Motivation
2. Introduction
3. Block diagram
4. Key features
5. Function block highlights
6. Bulldozer-based SoC
AMD has been focusing on the core count and highly parallel sever
Two basic observations
1. Future SoCs support multiple execution threads
• The smallest possible building module
2. Core would operate in constrained power environment.
• Power reduction techniques:
Filtering , speculation reduction, data movement minimization
Performance per watt!!
Bulldozer is New direction in
Bulldozer is the first x86 design to share
substantial hardware between multiple
Bulldozer is a hierarchical design with
sharing at nearly every level
Bulldozer is a high frequency optimized
Instead of peak performance, average
performance increased.
Major contribution
Scaling the core structures
Aggressive frequency goal
low gates per clock
Block diagram
It combines two independent core as a module
implementation of a shared level 2 cache
Improved area and power efficiency
The module can fetch and
decode up to four x86
instruction per clock.
Each core can services two
loads per cycle.
Shared Frontend
• Decoupled predict and
fetch pipelines
Block diagram
• ALU performance 33% decrease FPU performance 33% increase
• ALU performance 33% increase FPU performance 33% increase
Key features
1. Multithreading microarchitecture
Appropriate use of replication and shared hardware
Main advantage to sharing instruction cache and branch
Enforcing frontend (increasing ROB, BTB)
2. Decoupled branch-prediction from instruction fetch pipelines
Enablement of instruction prefetch using the prediction queue
instruction control unit increased 128 (reorder buffer)
3. Register renaming and operand delivery
scheduler and operand-handling is the biggest power consumer in the integer
execution unit
• PRF-based renaming microarchitecture for power efficiency
• Eliminates data replication
4. FMAC and media extension
FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth
It made one per each module like coprocessor
Function block highlights
Branch prediction
multilevel BTB
Instruction cache
64 Kbyte, two-way set-associative,
cache shared between both threads
Function block highlights
branch fusion (intel: macro fusion ), four x86 instruction per cycle
Bulldozer execution pipeline
Function block highlights
Integer scheduler and execution
renaming by PRF(Physical Register Files)
Floating point
FPU is a coprocessor between two integer core
L2 cache
the two cores share the unified L2 cache
Bulldozer-based SoC
1. In single threading, sacrifice peak performance, throughput increase
2. In single threading, FPU is more important
3. ALU performance need in server
Bulldozer can deliver a significant performance improvement in the same
The end

similar documents