slides - Harvard University

Report
A Pre-RTL, Power-Performance Accelerator Simulator
Enabling Large Design Space Exploration of Customized Architectures
Yakun Sophia Shao, Brandon Reagen,
Gu-Yeon Wei, David Brooks
Harvard University
Beyond Homogeneous Parallelism
General-Purpose
Cores
(CPU)
Programmable
Accelerators
(DSP, GPU)
Energy
Efficiency
Flexibility
Programmability
Design
Cost
2
Application-Specific
Accelerator
(ASIP, ASIC)
Today’s SoC
OMAP 4 SoC
3
Today’s SoC
ARM
Cores
Audio
DSP
Video
DSP Face
Imaging
GPU DMA USB
System Bus
USB
DMA
Secondary
Bus
Secondary
Bus
Tertiary
Bus
OMAP 4 SoC
4
SD
Today’s SoC
Apple A7
Other
Blocks
61%
CPU + L2$
+ GPU
39%
Harvard VLSI-ARCH Group
SoC Tapeout
5
Today’s SoC
CPU
GPU/
DSP
CPU
Buses
Mem
InterAcc Acc Acc face
Acc Acc Acc
Acc Acc Acc
6
Future Accelerator-Centric Architectures
Big
Cores
Small
Cores
Shared Resources
Memory
Interface
GPU/DSP
Sea of Fine-Grained
Accelerators
How to decompose an application to accelerators?
How to rapidly design lots of accelerators?
How to design and manage the shared resources?
7
Flexibility
Design Cost
Programmability
Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator
Shared Memory/Interconnect
Models
Unmodified
C-Code
Aladdin
Accelerator Design
Parameters
(e.g., # FU, mem. BW)
Power/Area
Accelerator
Specific
Datapath
Private L1/
Scratchpad
Performance
“Accelerator Simulator”
Design Accelerator-Rich SoC
Fabrics and Memory Systems
“Design Assistant”
Understand Algorithmic-HW
Design Space before RTL
Flexibility
Programmability
Design Cost
8
Future Accelerator-Centric Architecture
Big Cores
GPU/DS
P
Small
Cores
Shared Resources
Memory
Interface
Sea of Fine-Grained
Accelerators
9
Future Accelerator-Centric Architecture
Big Cores
GPU/DS
P
Small
Cores
Shared Resources
Memory
Interface
Sea of Fine-Grained
Accelerators
Aladdin can rapidly evaluate large design
space of accelerator-centric architectures.
10
Aladdin Overview
Optimization Phase
C Code
Acc Design
Parameters
Optimistic
IR
Initial
DDDG
Idealistic
DDDG
Dynamic Data
Dependence Graph
Resource
Program (DDDG)
Constrained
DDDG
Constrained
DDDG
Realization Phase
11
Performance
Activity
Power/Area
Models
Power/Area
Aladdin Overview
Optimization Phase
C Code
Optimistic
IR
Initial
DDDG
Idealistic
DDDG
Performance
Acc Design
Parameters
Program
Constrained
DDDG
Resource
Constrained
DDDG
Realization Phase
12
Activity
Power/Area
Models
Power/Area
From C to Design Space
C Code:
for(i=0; i<N; ++i)
c[i] = a[i] + b[i];
13
From C to Design Space
IR Dynamic Trace
C Code:
for(i=0; i<N; ++i)
c[i] = a[i] + b[i];
0. r0=0 //i = 0
1. r4=load (r0 + r1) //load a[i]
2. r5=load (r0 + r2) //load b[i]
3. r6=r4 + r5
4. store(r0 + r3, r6) //store c[i]
5. r0=r0 + 1 //++i
6. r4=load(r0 + r1) //load a[i]
7. r5=load(r0 + r2) //load b[i]
8. r6=r4 + r5
9. store(r0 + r3, r6) //store c[i]
10. r0 = r0 + 1 //++i
…
14
From C to Design Space
Initial DDDG
C Code:
for(i=0; i<N; ++i)
c[i] = a[i] + b[i];
IR Trace:
0. r0=0 //i = 0
1. r4=load (r0 + r1) //load a[i]
2. r5=load (r0 + r2) //load b[i]
3. r6=r4 + r5
4. store(r0 + r3, r6) //store c[i]
5. r0=r0 + 1 //++i
6. r4=load(r0 + r1) //load a[i]
7. r5=load(r0 + r2) //load b[i]
8. r6=r4 + r5
9. store(r0 + r3, r6) //store c[i]
10.r0 = r0 + 1 //++i
…
0. i=0
5. i++
10. i++
11. ld a
1. ld a
2. ld b
6. ld a
7. ld b
3. +
12. ld b
8. +
4. st c
13. +
9. st c
14. st c
15
From C to Design Space
Idealistic DDDG
C Code:
for(i=0; i<N; ++i)
c[i] = a[i] + b[i];
IR Trace:
0. r0=0 //i = 0
1. r4=load (r0 + r1) //load a[i]
2. r5=load (r0 + r2) //load b[i]
3. r6=r4 + r5
4. store(r0 + r3, r6) //store c[i]
5. r0=r0 + 1 //++i
6. r4=load(r0 + r1) //load a[i]
7. r5=load(r0 + r2) //load b[i]
8. r6=r4 + r5
9. store(r0 + r3, r6) //store c[i]
10.r0 = r0 + 1 //++i
…
0. i=0
0. i=0
5. i++
1. ld a
2. ld b
1. ld a
5. i++
2. ld b
6. ld a
10. i++
7. ld b
11. ld a
10. i++
6. ld a
7. ld b
3. +
3. +
8. +
13. +
11. ld a
12. ld b
8. +
4. st c
4. st c
9. st c
14. st c
13. +
9. st c
14. st c
16
12. ld b
From C to Design Space
Optimization Phase: C->IR->DDDG
• Include application-specific customization strategies.
• Node-Level:
– Bit-width Analysis
– Strength Reduction
– Tree-height Reduction
• Loop-Level:
– Remove dependences between loop index variables
• Memory Optimization:
– Memory-to-Register Conversion
– Store-Load Forwarding
– Store Buffer
• Extensible
– e.g. Model CAM accelerator by matching nodes in DDDG
17
From C to Design Space
One Design
Resource Activity
Idealistic DDDG
0. i=0
1. ld a
5.i++
2. ld b
6. ld a
7. ld b
0. i=0
15. i++
10. i++
1. ld a
11. ld a 12. ld b 16. ld a 17. ld b
2. ld b
MEM MEM
3. +
8. +
13. +
18. +
3. +
+
4. st c
9. st c
14. st c
19. st c
4. st c
MEM
+
5.i++
6. ld a
Acc Design Parameters:
 Memory BW <= 2
 1 Adder
7. ld b
MEM MEM
8. +
+
9. st c
MEM
Cycle
18
From C to Design Space
Another Design
Resource Activity
Idealistic DDDG
0. i=0
1. ld a
5.i++
2. ld b
6. ld a
15. i++
10. i++
7. ld b
0. i=0
11. ld a 12. ld b 16. ld a 17. ld b
1. ld a
+
5.i++
2. ld b
6. ld a
7. ld b
MEM MEM MEM MEM
3. +
8. +
13. +
18. +
3. +
8. +
+
+
4. st c
9. st c
14. st c
19. st c
4. st c
9. st c
MEM
MEM
Acc Design Parameters:
 Memory BW <= 4
 2 Adders
11. ld a
12. ld b
16. ld a
17. ld b
MEM MEM MEM MEM
13. +
18. +
+
+
14. st c
19. st c
MEM
MEM
Cycle
19
+
+
15. i++
10. i++
From C to Design Space
Realization Phase: DDDG->Estimates
• Constrain the DDDG with program and user-defined
resource constraints
• Program Constraints
– Control Dependence
– Memory Ambiguation
• Resource Constraints
–
–
–
–
Loop-level Parallelism
Loop Pipelining
Memory Ports
# of FUs (e.g., adders, multipliers)
20
From C to Design Space
Power-Performance per Design
Power
Acc Design Parameters:
 Memory BW <= 4
 2 Adders
Acc Design Parameters:
 Memory BW <= 2
 1 Adder
Cycle
21
From C to Design Space
Design Space of an Algorithm
Power
Cycle
22
Aladdin Validation
Aladdin
C Code
Power/Area
Design
Compiler
Activity
Verilog
ModelSim
23
Performance
Aladdin Validation
Aladdin
C Code
Power/Area
Design
Compiler
Activity
RTL
Designer
Verilog
HLS C
Tuning
Vivado
HLS
ModelSim
24
Performance
Aladdin Validation
25
Aladdin Validation
26
Aladdin enables rapid design space
exploration for accelerators.
Aladdin
C Code
Power/Area
Design
Compiler
Activity
RTL
Designer
Verilog
HLS C
Tuning
Vivado
HLS
ModelSim
27
Performance
Aladdin enables pre-RTL simulation of
accelerators with the rest of the SoC.
MARSx86
Big
Cores
...
XIOSim
Small
Cores
…
Shared
Cacti/Orion2
Resources
GPGPUGPU
Sim
Memory
DRAMSim2
Interface
Sea of Fine-Grained
Accelerators
28
Modeling Accelerators in a
SoC-like Environment
Acc
Core
Cache
Memory
160
Acc
Core
block=16
block=32
140
Power (mW)
120
Cache
With Memory Contention
100
80
60
40
Memory
20
0
29
0
0.5
1.0
1.5
2.0
Time (Million Cycles)
2.5
3.0
Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator
• Architectures with 1000s of accelerators will be
radically different; New design tools are needed.
• Aladdin enables rapid design space exploration of
future accelerator-centric platforms.
• You can find Aladdin at
http://vlsiarch.eecs.harvard.edu/aladdin
30

similar documents