ppt - ECE Users Pages

Report
ALU Architecture and ISA
Extensions
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
Reading
• Sections 3.2-3.5 (only those elements covered
in class)
• Sections 3.6-3.8
• Appendix B.5
• Goal: Understand the
 ISA view of the core microarchitecture
 Organization of functional units and register files into
basic data paths
(2)
Overview
• Instruction Set Architectures have a purpose
 Applications dictate what we need
• We only have a fixed number of bits
 Impact on accuracy
• More is not better
 We cannot afford everything we want
• Basic Arithmetic Logic Unit (ALU) Design
 Addition/subtraction, multiplication, division
(3)
Reminder: ISA
byte addressed memory
Register File (Programmer Visible State)
Memory Interface
stack
0x00
0x01
0x02
0x03
Processor Internal Buses
0x1F
Dynamic Data
Data segment
(static)
Text Segment
Programmer Invisible State
Program
Counter
Instruction
register
Kernel
registers
Reserved
0xFFFFFFFF
Arithmetic Logic Unit (ALU)
Memory Map
Who sees what?
(4)
Arithmetic for Computers
• Operations on integers
 Addition and subtraction
 Multiplication and division
 Dealing with overflow
• Operation on floating-point real numbers
 Representation and operations
• Let us first look at integers
(5)
Integer Addition(3.2)
• Example: 7 + 6

Overflow if result out of range


Adding +ve and –ve operands, no overflow
Adding two +ve operands


Overflow if result sign is 1
Adding two –ve operands

Overflow if result sign is 0
(6)
Integer Subtraction
• Add negation of second operand
• Example: 7 – 6 = 7 + (–6)
+7:
–6:
+1:
0000 0000 … 0000 0111
1111 1111 … 1111 1010
0000 0000 … 0000 0001
2’s complement
representation
• Overflow if result out of range
 Subtracting two +ve or two –ve operands, no overflow
 Subtracting +ve from –ve operand
o Overflow if result sign is 0
 Subtracting –ve from +ve operand
o Overflow if result sign is 1
(7)
ISA Impact
• Some languages (e.g., C) ignore overflow
 Use MIPS addu, addui, subu instructions
• Other languages (e.g., Ada, Fortran) require
raising an exception
 Use MIPS add, addi, sub instructions
 On overflow, invoke exception handler
o
o
o
Save PC in exception program counter (EPC) register
Jump to predefined handler address
mfc0 (move from coprocessor register) instruction can
retrieve EPC value, to return after corrective action
(more later)
• ALU Design leads to many solutions. We look
at one simple example
(8)
Integer ALU (arithmetic logic unit)(B.5)
• Build a 1 bit ALU, and use 32 of them
(bit-slice)
operation
a
op a
b
res
result
b
(9)
Single Bit ALU
Implements only AND and OR operations
Operation
0
A
Result
1
B
(10)
Adding Functionality
• We can add additional operators (to a point)
• How about addition?
cout = ab + acin + bcin
sum = a  b  cin
CarryIn
a
Sum
b
CarryOut
• Review full adders from digital design
(11)
Building a 32-bit ALU
CarryIn
a0
b0
Operation
Operation
CarryIn
ALU0
Result0
CarryOut
CarryIn
a1
a
0
b1
CarryIn
ALU1
Result1
CarryOut
1
Result
a2
2
b
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
a31
b31
CarryIn
ALU31
Result31
(12)
Subtraction (a – b) ?
• Two's complement approach: just negate b
and add 1.
• How do we negate?
sub
Binvert
CarryIn
a0
CarryIn
ALU0
b0
Operation
Result0
CarryOut
• A clever solution:
Binvert
a1
ALU1
b1
Operation
a
a2
0
1
0
Result1
CarryOut
CarryIn
b
CarryIn
CarryIn
ALU2
b2
Result2
CarryOut
Result
2
1
a31
CarryOut
b31
CarryIn
ALU31
Result31
(13)
Tailoring the ALU to the MIPS
• Need to support the set-on-less-than instruction(slt)
 remember: slt is an arithmetic instruction
 produces a 1 if rs < rt and 0 otherwise
 use subtraction: (a-b) < 0 implies a < b
• Need to support test for equality (beq $t5, $t6, $t7)
 use subtraction: (a-b) = 0 implies a = b
(14)
What Result31 is when (a-b)<0?
Binvert
CarryIn
a0
b0
CarryIn
ALU0
Less
CarryOut
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Operation
Result0
Binvert
Operation
CarryIn
a
0
Result1
1
Result
a2
b2
0
b
CarryIn
ALU2
Less
CarryOut
Result2
CarryIn
ALU31
Less
2
1
Less
3
CarryOut
CarryIn
a31
b31
0
0
Result31
Set
Overflow
Unsigned vs. signed support
(15)
Test for equality
Bnegate
Operation
• Notice control lines:
000
001
010
110
111
=
=
=
=
=
and
or
add
subtract
slt
•Note: zero is a 1 when the result is zero!
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
Zero
Note test for overflow!
a31
b31
0
CarryIn
ALU31
Less
Result31
Set
Overflow
(16)
ISA View
CPU/Core
$0
$1
$31
ALU
• Register-to-Register data path
• We want this to be as fast as possible
(17)
Multiplication (3.3)
• Long multiplication
multiplicand
multiplier
product
1000
× 1001
1000
0000
0000
1000
1001000
Length of product
is the sum of
operand lengths
(18)
A Multiplier
• Uses multiple adders
 Cost/performance tradeoff

Can be pipelined

Several multiplication performed in parallel
(19)
MIPS Multiplication
• Two 32-bit registers for product
 HI: most-significant 32 bits
 LO: least-significant 32-bits
• Instructions
 mult rs, rt / multu rs, rt
o 64-bit product in HI/LO
 mfhi rd / mflo rd
o
o
Move from HI/LO to rd
Can test HI value to see if product
overflows 32 bits
 mul rd, rs, rt
o
Least-significant 32 bits of product –
> rd
Study Exercise: Check out signed and
unsigned multiplication with QtSPIM
(20)
Division(3.4)
quotient
dividend
1001
1000 1001010
-1000
divisor
10
101
1010
-1000
10
remainder
• Check for 0 divisor
• Long division approach
 If divisor ≤ dividend bits
o
1 bit in quotient, subtract
 Otherwise
o
0 bit in quotient, bring down
next dividend bit
• Restoring division
n-bit operands yield n-bit •
quotient and remainder
 Do the subtract, and if
remainder goes < 0, add
divisor back
Signed division
 Divide using absolute values
 Adjust sign of quotient and
remainder as required
(21)
Faster Division
• Can’t use parallel hardware as in multiplier
 Subtraction is conditional on sign of remainder
• Faster dividers (e.g. SRT division) generate
multiple quotient bits per step
 Still require multiple steps
• Customized implementations for high
performance, e.g., supercomputers
(22)
MIPS Division
• Use HI/LO registers for result
 HI: 32-bit remainder
 LO: 32-bit quotient
• Instructions
 div rs, rt / divu rs, rt
 No overflow or divide-by-0
checking
o
Software must perform checks
if required
 Use mfhi, mflo to access result
Study Exercise: Check out signed
and unsigned division with QtSPIM
(23)
ISA View
CPU/Core
$0
$1
$31
Multiply
Divide
ALU
Hi
Lo
• Additional function units and registers (Hi/Lo)
• Additional instructions to move data to/from
these registers
 mfhi, mflo
• What other instructions would you add? Cost?
(24)
Floating Point(3.5)
• Representation for non-integral numbers
 Including very small and very large numbers
• Like scientific notation
 –2.34 × 1056
 +0.002 × 10–4
 +987.02 × 109
normalized
not normalized
• In binary
 ±1.xxxxxxx2 × 2yyyy
• Types float and double in C
(25)
IEEE 754 Floating-point Representation
Single Precision (32-bit)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
S
exponent
significand
1bit
23 bits
8 bits
(–1)sign x (1+fraction) x 2exponent-127
Double Precision (64-bit)
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
S
exponent
significand
1bit
20 bits
11 bits
significand (continued)
32 bits
(–1)sign x (1+fraction) x 2exponent-1023
(26)
Floating Point Standard
• Defined by IEEE Std 754-1985
• Developed in response to divergence of
representations
 Portability issues for scientific code
• Now almost universally adopted
• Two representations
 Single precision (32-bit)
 Double precision (64-bit)
(27)
FP Adder Hardware
• Much more complex than integer adder
• Doing it in one clock cycle would take too long
 Much longer than integer operations
 Slower clock would penalize all instructions
• FP adder usually takes several cycles
 Can be pipelined
Example: FP Addition
(28)
FP Adder Hardware
Step 1
Step 2
Step 3
Step 4
(29)
FP Arithmetic Hardware
• FP multiplier is of similar complexity to FP
adder
 But uses a multiplier for significands instead of an
adder
• FP arithmetic hardware usually does
 Addition, subtraction, multiplication, division,
reciprocal, square-root
 FP  integer conversion
• Operations usually takes several cycles
 Can be pipelined
(30)
ISA Impact
• FP hardware is coprocessor 1
 Adjunct processor that extends the ISA
• Separate FP registers
 32 single-precision: $f0, $f1, … $f31
 Paired for double-precision: $f0/$f1, $f2/$f3, …
o Release 2 of MIPs ISA supports 32 × 64-bit FP
reg’s
• FP instructions operate only on FP registers
 Programs generally do not perform integer ops on FP
data, or vice versa
 More registers with minimal code-size impact
(31)
ISA View: The Co-Processor
Co-Processor 1
CPU/Core
$0
$1
$0
$1
$31
$31
Multiply
Divide
ALU
Hi
FP ALU
Lo
Co-Processor 0
BadVaddr
Status
Causes
EPC
later
• Floating point operations access a separate set
of 32-bit registers
 Pairs of 32-bit registers are used for double precision
(32)
ISA View
• Distinct instructions operate on the floating
point registers (pg. A-73)
 Arithmetic instructions
o
add.d fd, fs, ft, and add.s fd, fs, ft
double precision
single precision
• Data movement to/from floating point
coprocessors
 mcf1 rt, fs and mtc1 rd, fs
• Note that the ISA design implementation is
extensible via co-processors
• FP load and store instructions
 lwc1, ldc1, swc1, sdc1
o e.g., ldc1 $f8, 32($sp)
Example: DP Mean
(33)
Associativity
• Floating point arithmetic is not commutative
• Parallel programs may interleave operations in
unexpected orders
 Assumptions of associativity may fail
(x+y)+z
x+(y+z)
-1.50E+38
x -1.50E+38
y 1.50E+38 0.00E+00
z
1.0
1.0 1.50E+38
1.00E+00 0.00E+00

Need to validate parallel programs under varying
degrees of parallelism
(34)
Performance Issues
• Latency of instructions
 Integer instructions can take a single cycle
 Floating point instructions can take multiple cycles
 Some (FP Divide) can take hundreds of cycles
• What about energy (we will get to that shortly)
• What other instructions would you like in
hardware?
 Would some applications change your mind?
• How do you decide whether to add new
instructions?
(35)
Multimedia (3.6, 3.7, 3.8)
• Lower dynamic range and precision
requirements
 Do not need 32-bits!
• Inherent parallelism in the operations
(36)
Vector Computation
• Operate on multiple data elements (vectors) at
a time
• Flexible definition/use of registers
•
Registers hold integers, floats (SP), doubles DP)
128-bit Register
1x128 bit integer
2x64-bit double precision
4 x 32-bit single precision
8x16 short integers
(37)
Processing Vectors
• When is this more efficient?
Memory
vector registers
• When is this not efficient?
• Think of 3D graphics, linear algebra and media
processing
(38)
Case Study: Intel Streaming SIMD
Extensions
• 8, 128-bit XMM registers
 X86-64 adds 8 more registers XMM8-XMM15
• 8, 16, 32, 64 bit integers (SSE2)
• 32-bit (SP) and 64-bit (DP) floating point
• Signed/unsigned integer operations
• IEEE 754 floating point support
• Reading Assignment:
 http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
 http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
(39)
Instruction Categories
• Floating point instructions
 Arithmetic, movement
 Comparison, shuffling
 Type conversion, bit level
register
memory
register
• Integer
• Other
 e.g., cache management
• ISA extensions!
• Advanced Vector
Extensions (AVX)
 Successor to SSE
(40)
Arithmetic View
• Graphics and media processing operates on
vectors of 8-bit and 16-bit data
 Use 64-bit adder, with partitioned carry chain
o
Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors
 SIMD (single-instruction, multiple-data)
• Saturating operations
 On overflow, result is largest representable value
o
c.f. 2s-complement modulo arithmetic
 E.g., clipping in audio, saturation in video
4x16-bit
2x32-bit
(41)
SSE Example
// A 16byte = 128bit vector struct
struct Vector4
{
float x, y, z, w;
};
More complex
example (matrix
multiply) in Section
3.8 – using AVX
// Add two constant vectors and return the resulting vector
Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B )
{
Vector4 Ret_Vector;
__asm
{
MOV EAX Op_A
MOV EBX, Op_B
}
// Load pointers into CPU regs
MOVUPS XMM0, [EAX]
MOVUPS XMM1, [EBX]
// Move unaligned vectors to SSE regs
ADDPS XMM0, XMM1
MOVUPS [Ret_Vector], XMM0
// Add vector elements
// Save the return vector
}
return Ret_Vector;
From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
(42)
Characterizing Parallelism
Today serial computing cores
(von Neumann model)
Instruction Streams
Data Streams
SISD
SIMD
MISD
MIMD
Single instruction
multiple data stream
computing, e.g., SSE
Today’s Multicore
• Characterization due to M. Flynn*
*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions
on Computers, C–21 (9): 948–960t
(43)
Parallelism Categories
From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy
(44)
Data Parallel vs. Traditional Vector
Vector Architecture
Vector
Register
A
Vector
Register
C
Vector
Register
B
pipelined functional unit
Data Parallel Architecture
registers
Process each square in
parallel – data parallel
computation
(45)
ISA View
SIMD Registers
CPU/Core
$0
$1
XMM0
XMM1
$31
XMM15
Multiply
Divide
ALU
Hi
Vector ALU
Lo
• Separate core data path
• Can be viewed as a co-processor with a distinct
set of instructions
(46)
Domain Impact on the ISA: Example
Scientific Computing
•
•
•
•
Floats
Double precision
Massive data
Power
constrained
Embedded Systems
•
•
•
•
•
Integers
Lower precision
Streaming data
Security support
Energy
constrained
(47)
Summary
• ISAs support operations required of application
domains
 Note the differences between embedded and
supercomputers!
 Signed, unsigned, FP, SIMD, etc.
• Bounded precision effects
 Software must be careful how hardware used e.g.,
associativity
 Need standards to promote portability
• Avoid “kitchen sink” designs
 There is no free lunch
 Impact on speed and energy  we will get to this later
(48)
Study Guide
• Perform 2’s complement addition and subtraction
(review)
• Add a few more instructions to the simple ALU
 Add an XOR instruction
 Add an instruction that returns the max of its inputs
 Make sure all control signals are accounted for
• Convert real numbers to single precision floating
point (review) and extract the value from an
encoded single precision number (review)
• Execute the SPIM programs (class website) that
use floating point numbers. Study the
memory/register contents via single step
execution
(49)
Study Guide (cont.)
• Write a few simple SPIM programs for
 Multiplication/division of signed and unsigned
numbers
o
o
Use numbers that produce >32-bit results
Move to/from HI and LO registers ( find the instructions
for doing so)
 Addition/subtraction of floating point numbers
• Try to write a simple SPIM program that
demonstrates that floating point operations are
not associative (this takes some thought and
review of the range of floating point numbers)
• Look up additional SIMD instruction sets and
compare
 AMD NEON, Altivec, AMD 3D Now
(50)
Glossary
• Co-processor
• Data parallelism
• Data parallel
computation vs.
vector
computation
• Instruction set
extensions
• Overflow
• MIMD
• Precision
• SIMD
• Saturating
arithmetic
• Signed arithmetic
support
• Unsigned
arithmetic
support
• Vector processing
(51)

similar documents