### Quiz-2 Topics

```ECE-3056-B
Quiz-2 Topic Areas
John Copeland
March 28, 2014
05 Single-Cycle Datapath
Asynchronous Logic –combinatorial logic
Synchronous Logic – latches, registers
Data Path – actual determined by control signals and mux'es
Instruction Decoding – determines values of control signals
3 Types of Instruction Formats – how used to simplify decoding
Opcode( bits 31-26) "funct" (bits (5-0) determines ALU operation
PC Register – 3 possible inputs: pc+4, branch, jump
Given a diagram like slide 05-28, or VHDL like 05-32; know how to figure
which control lines are set (true) for: add, lw, sw, j, jr, beq See 05-33, 34,
35, or 37
Energy use depends on average number of gates that switch per cycle
Which uses more: Register Set or ALU (during add).
Given VHDL code, or combinatorial logic, for simple ALU, be able to
modify to add another operation (e.g., shift, subtract, bit-wise logic, …)
2
Datapath With Jumps Added
Slide 05-37
3
06 Multi-Cycle Datapath
Divide Datapath into five phases – Fetch Instr., Decode Instr., Execute
(calculate), Memory R/W, Write Back - (IF, ID, EX, MEM, WB)
Given a diagram like slide 06-8, 11, 17, 18, 20, 22, 23,24; know how to figure
which control lines (ALUOp, ALUSrcA, MemWrite, …) are set (true) for: add,
lw, sw, j, jr, beq, …
Interpret a state diagram like 06-10 or 06-30 – how many clock cycles for br,
For given MIPS assembler code: how many cycles to execute.
State diagram for decoding instruction - ways to implement:
ROM Implementation – very inefficient
PLA " – Inputs: 5 Opcode bits and 4 state bits (“funct” bits decoded apart)
Outputs: 1 for each control line (control signal)
Vertical Lines: Minimized by choosing similar bits for closely related
functions (06-32).
CISC (complex instruction set computer) – uses microcode, more ticks/instr
Unexpected change in flow of control: Exception (error) vs. Interrupt (I/O)
Two registers just for Exception Handling: cause and status
4
Pipelining Example
add \$14, \$5, \$6
lw \$13, 24(\$1)
add \$12, \$3, \$4
0
M
u
x
1
sub \$11, \$2, \$3
lw \$10, 20(\$1)
Note what is happening in the register file
IF/ID
ID/EX
EX/MEM
MEM/WB
4
PC
Instruction
memory
Instruction
Shift
left 2
register 1
data 1
register 2
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Data
memory
Write
data
16
Instruction
[20– 16]
Instruction
[15– 11]
Pipeline stage
execution time
Sign
extend
data
1
M
u
x
0
32
0
M
u
x
1
RegDst
Slide 08a-21
5
Pipelined Control
Instruction
R-format
lw
sw
beq
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
Memory access stage
control lines
Mem
Mem
0
0
0
0
1
0
0
0
1
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
• Control signals derived from
instruction
– As in single-cycle
implementation
• Pass control signals along like
data
Slide 08a-23
07 Performance
Determined by: Algorithm & Language, Compiler, (hardware) Architecture
Different Metrics, and different numbers for different "bench mark" programs
SPEC2005 – performance doing a standard set of bench mark programs
"Throughput" – work done per unit time
"Response Time" (latency) – time to do a task (e.g., display a Web page)
Trading throughput versus latency (for multitasking or parallel processing)
Instructions per second = clock rate / average cycles per instruction
Programs with heavy I/O or memory R/W's may depend less on CPU speed.
Amdahl's Law – know definition, and do calculations like 07-18.
Corollary – "make the common case fast"
Increased clock rate  higher temperature and lower energy efficiency,
7
08a Pipelined Datapath
Use clocked registers to divide Datapath's five stages – Fetch Instr.,
Decode Instr., Execute, Memory R/W, Write Back - (IF, ID, EX, MEM, WB)
Use more clocked registers to delay control signals as instruction moves
through stages (08a-23).
Hazards: Structural, Control, Data
Structural – need to separate Instruction and Data memory (or caches)
Control – guess "to branch" or not – stall when wrong – how?
Can branch be determined 2 or 3 instructions ahead - if so ?
How many real br instructions are there? only bre (branch if equal)
Jump/Br must have next instruction started before (kill if wrong)
Data – add logic devices to "forward" data needed by next instruction.
Forwarding unit – controls when to forward – from EXE or MEM stage
Problem with" Load Word" and Use –requires a" nop" insertion unless
order of following instructions can be swapped ("code scheduling").
MIPS can predict branch always taken (wrong when loop ends)
More complicated CPU's have "dynamic" branch prediction.
MIPS – Exception causes Jump to "Exception Handler” procedure in OS. 8
08b Pipelined – Thread Level Parallelism
Flynn's Taxonomy - Single/Multiple Instructions Single/Multiple Data
Multithreading – MIMD – parallel threads or LWPs (light-weight processes)
Process – running program with state (register values, include PC, pipeline r's).
Scheduled by OS –swapped out, restarted later.
Threads have own state, stack memory (local data), but share static data
Hardware assisted switching between threads – separate hardware registers
instructions may be interleaved in pipeline ("Fine-Grain Multithreading")
"Course-Grained Multithreading" – swap on long stall (like cache miss)
Synchronization between threads – requires hardware "atomic commands"
"Test and Set" without switching threads between the Test and the Set.
When altering common data, thread must tell OS "start/end Critical Path"
Simultaneous Multithreading on multiple cores (data paths).
Amdahl's Law – increased multithreading reaches "point of diminishing
return”
Big time Parallelism – requires special many-core computers, high-level
languages, shared and local memory. "Grid" has cores on 2-d grid of buses.
Communication (assigning tasks, collecting results) – by passing messages.
9
09a Memory
Different levels – large, slow, and low cost  small, fast, and high cost.
Hard disk (Terabyte)
\$ 0.25 per Gigabyte
millisecond
DRAM (Gigabyte)
\$ 20.00 per Gigabyte
several nanoseconds
Static Ram (Megabytes) \$ 2000 per Gigabyte
fraction of nanosecond
Registers
On CPU – limited area and power
Use multiple levels to get low average cost, high-speed for data being used.
Multicore computers need a fast cache memory for each core
Data movement takes more energy than computation
What to put in cache? Temporal (recent) and Spatial (nearby)
How to locate data.
Direct Cache: address parts: [ tag ][ cache address or index][ byte location ]
Size of cache: = 2^(no. of index bits) x 2^((no. of byte location bits)
= number of blocks x block size
Each block also has valid bit, dirty bit, tag.
To read or write data, look at indexed block, match tag, select byte location.
If not found (tag did not match), put contents in write queue if "dirty" before
getting new data from memory (all byte-addresses that start with tag + index)
"Write Through" immediately update memory, "Write Back" just write cache.
10
Previous State of Cache
Index
V
000
N
001
N
010
Y
011
N
100
N
101
N
110
Y
111
N
Tag
Data
11
Mem[11010]
09a-20
10
Mem[10110]
Then This Happens
Hit/miss
Cache block
16
10 000
?
000
3
00 011
?
011
16
10 000
?
000
What is new State of Cache?
09a-21