Omar Aragon
Abdel Salam Sayyad
This presentation is missing the
references used
• Features
• Block diagram
• Microarchitecture
• Pipeline
• Cache
• Memory controller
• HyperTransport
• InterCPU Connections
• 64-bit x86-based microprocessor
• On chip double-data-rate (DDR) memory controller [low
memory latency]
Three HyperTransport links [connect to other devices
without support chips]
Out of order, superscalar processor
Adds 64-bit (48-bit virtual and 40-bit physical) addressing
and expands number of registers
Supports legacy 32-bit applications without modifications
or recompilation
• Double the number of registers
• Integer general purposes registers (GPR’s) – 16 each
• Streaming SIMD extension (SSE) registers – 16 each
• Satisfies the register allocation needs of more than 80%
of functions appearing in a typical program.
• Connected to a memory through an integrated memory
• High performance I/O subsystem via HyperTransport bus.
Block diagram
• Works with fixed-length micro-ops and dispatches into two
independent schedulers: One for integer, and one for
floating point and multimedia (MMX, 3DNow, SSE and
• Load and store micro-ops go to the load/store unit
• 11 micro-ops each cycle to the following execution
• Three integer execution units
• Three address generation units
• Three floating point and multimedia units
• Two load/store to the data cache
• Long enough for high frequency and short enough for
good IPC (Instructions per cycle)
• Fully integrated from instruction fetch through DRAM
• Execute pipeline is typically
• 12 stages for integer
• 17 stages for floating-point
• Data cache access occurs in stage 11.
• In
case that L1 cache miss, the pipeline access the L2
cache in parallel and the request goes to the system
request queue.
• Pipeline in the DRAM run as the same frequency as the
Memory, Cache, and HyperTransport
• Separate L1 Instruction and Data caches.
• Each is 64 Kbytes, 2-way set associative, 64-byte cache line.
• L2 cache (Data & Instructions)
• Size: 1 Mbytes. 16-way set associative.
• uses a pseudo-least-recently-used (LRU) replacement policy
• Independent L1 and L2 translation look-aside buffers
• The L1 TLB is fully associative and stores thirty-two 4-Kbyte page
translations, and eight 2-Mbyte/4-Mbyte page translations.
• The L2 TLB is four-way set-associative with 512 4-Kbyte entries.
Onboard Memory Control
• 128-bit memory bus
• Latency reduced and bandwidth doubled
• Multicore: Processors have own memory interface and
own memory
Available memory scales with the number of processors
Up to 8 registered DDR DIMMs per processor
Memory bandwidth of up to 5.3 Gbytes/s per processor.
• Bidirectional, serial/parallel, scalable, high-bandwidth low-
latency bus
• Packet based
• 32-bit words regardless of physical width
• Facilitates power management and low latencies
HyperTransport in the Opteron
• 16 CAD HyperTransport (16-bit wide, CAD=Command,
Address, Data)
• processor-to-processor and processor-to-chipset
• bandwidth of up to 6.4 GB/s (per HT port)
• 8-bit wide HyperTransport for components such as normal
InterCPU Connections
• Multiple CPUs connected through a proprietary extension
running on additional HyperTransport interfaces
• Allows support of a cache-coherent, Non-Uniform Memory
Access, multi-CPU memory access protocol
• Non-Uniform Memory Access
• Separate cache memory for each processor
• Memory access time depends on memory location. (i.e. local faster
than non-local)
• Cache coherence
• Integrity of data stored in local caches of a shared resource
• Each CPU can access the main memory of another
processor, transparent to the programmer

similar documents