ASIPs - ChipEx

Report
SoC Subsystem Acceleration
using Application-Specific
Processors (ASIPs)
Markus Willems
Product Manager
Synopsys
SoC Design
• What to do when the performance
of your main processor is insufficient?
– Go multicore?
• Application mapping difficult,
resource utilisation unbalanced
– Add hardwired accelerators?
• Balanced but inflexible
SoC Design
• What to do when the performance
of your main processor is insufficient?
ASIPs: application-specific processors
• Anything between general-purpose P and hardwired data-path
• Deploys classic hardware tricks (parallelism and customized datapaths) while
retaining programmability – Hardware efficiency with software programmability
Agenda
• ASIPs as accelerators in SoCs
• How to design ASIPs
• Examples
• Conclusions
Architectural Optimization Space
ASIP architectural
optimization space
Parallelism
Specialization
Architectural Optimization Space
Parallelism
Instructionlevel
parallelism
(ILP)
Orthogonal
instruction
set (VLIW)
Encoded
instruction
set
Datalevel
parallelism
Vector
processing
(SIMD)
Tasklevel
parallelism
Multicore
Multithreading
Architectural Optimization Space
Specialization
App.-specific
data types
App.-specific
instructions
Pipeline
Connectivity & storage
matching application’s dataflow
Distributed regs,
sub-ranges
Integer, fractional,
floating-point, bits,
complex, vector…
Multiple mem’s,
sub-ranges
App.-spec.
memory
addressing
App.-spec.
data
processing
App.-spec.
control
processing
Direct, indirect, postmodification, indexed,
stack indirect…
Any exotic
operator
Jumps, subroutines,
interrupts, HW do-loops, residual
control, predication…
Single or multicycle
Relative or absolute, address range,
delay slots…
IP Designer:
ASIP Design and Programming
Agenda
• ASIPs as accelerators in SoCs
• How to design ASIPs
• Examples
• Conclusions
Synopsys - Full Spectrum
Processor Technology Provider
32-bit ARC HS Processors
High-Performance for Embedded Applications
• Over 3100 DMIPS @ 1.6 GHz*
ARC Floating Point Unit
JTAG
User Defined Extensions
• HS Family products
ARCv2 ISA / DSP
MAC &
SIMD
Multiplier
ALU
Divider
Late
ALU
10-stage pipeline
Optional
Instruction
Cache
Data
Cache
RealTime
Trace
– HS34 CCM, HS36 CCM plus I&D cache
– HS234, HS236 dual-core
– HS434, HS436 quad-core
• Configurable so each instance can
be optimized for performance and
power
Memory Protection Unit
Instruction
CCM
• 53 mW* of power; 0.12mm2 area in
28-nm process*
Data
CCM
• Custom instructions enable
integration of proprietary hardware
*Worst case 28-nm silicon and conditions
Pedestrian Detection and HOG
• Pedestrian detection
• Standard feature in luxury vehicles
• Moving to mid-size and compact vehicles
in the next 5-10 years, also due to
legislation efforts
• Implementation requirements
• Low cost
• Low power (small form factor, and/or battery powered)
• Programmable (to allow for in-field SW upgrades)
• Most popular algorithm for pedestrian detection is
Histogram of Oriented Gradients (HOG)
Histogram Of Oriented Gradients
Grey scale conversion
Scale to Multiple Resolutions
Scale to multiple
resolutions
Gradient computation
Histogram
computation per block
Normalization of the
histograms
SVM per window
position
Non-max suppression
Use a fixed 64x128-pixel detection window.
Apply this detection window to scaled frames.
Gradient Computation
+1 +2
Apply Sobel operators: 0
0
−1 −2
+1 0 −1
+1
0 and +2 0 −2
+1 0 −1
−1
Histogram Of Oriented Gradients
Grey scale conversion
Histogram Computation
Scale to multiple
resolutions
Gradient computation
Histogram
computation per block
Normalization of the
histograms
The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply
Gaussian weights and compute 4 histograms of orientation of gradients.
Normalization of the Histograms
(1) L2 Normalization
(2) clipping (saturation)
Support Vector Machine
Linear classification of histograms
for every 64x128 windows position.
SVM per window
position
Non-Max Suppression
Non-max suppression
Cluster multi-scale dense scan of
detection windows and select unique
(3) L2 Normalization
HOG Functional Validation on ARC HS
(640 x 480 pixels)
1
Grey scale
conversion
Rescaling
Gradient
Normalization
Histogram
SVM
Non-max
suppression
Dedicated Streaming Interconnect (FIFOs)
D
ASIP1
ASIP2
…
D
ASIPn
AXI local interconnect
HS
Subs. ctrl
DCCM
DMA,
Sync
& I/O
• OpenCV float profiling results: 2.6 G cycles per frame
Fixed point profiling results: 2.4 G cycles per frame
Profiling (640 x 480 pixels, at 30 FPS)
ARC HS
G cycles
%
# ARC HS
equivalent
Grey scale conversion
0.1
0.2%
0.07
Scale to multiple
resolutions
1.6
2.3%
1.0
Gradient computation
17.3
26%
10.8
Histogram
computation per block
31.9
47%
20.0
1.2
1.8%
0.8
SVM per window
position
15.7
23%
9.8
Non-max suppression
0.004
0.01%
0.002
Normalization of the
histograms
Task Assignment #2
Grey scale
conversion
Rescaling
Gradient
2
Normalization
Histogram
Dedicated Streaming Interconnect (FIFOs)
D
ASIP1
D
ASIP2
D
ASIP4
AXI local interconnect
HS
Subs. ctrl
DCCM
L3 Ext. DRAM
DMA,
Sync
& I/O
SVM
Non-max
suppression
ASIP Example: HISTOGRAM
•
•
•
•
•
•
Vector-slot next to existing scalar instructions (VLIW)
16x(8/16)-bit vector register files
16x8-bit SRAM interface
16x8-bit FIFO interfaces
Vector arithmetic instructions
Special registers and instructions to compute histograms
4x size increase & 200x speedup
(relative to RISC template)
Implemented in less than 1 week
Task Assignment #3
Grey scale
conversion
Rescaling
Gradient
3
Normalization
Histogram
Dedicated Streaming Interconnect (FIFOs)
D
ASIP1
D
ASIP2
D
ASIP3
D
ASIP4
AXI local interconnect
HS
Subs. ctrl
DCCM
L3 Ext. DRAM
DMA,
Sync
& I/O
SVM
Non-max
suppression
Task Assignment #4
Grey scale
conversion
Rescaling
Gradient
4
Normalization
Histogram
Dedicated Streaming Interconnect (FIFOs)
D
ASIP1’
D
ASIP2
D
ASIP3
D
ASIP4
AXI local interconnect
HS
Subs. ctrl
DCCM
L3 Ext. DRAM
DMA,
Sync
& I/O
SVM
Non-max
suppression
Task Assignment #4
Grey scale
conversion
Rescaling
Gradient
4’
Normalization
Histogram
Dedicated Streaming Interconnect (FIFOs)
D
ASIP1’
D
ASIP2
D
ASIP3
D
ASIP4
AXI local interconnect
HS DCCM
L2
SRAM
L3 Ext. DRAM
DMA,
Sync
& I/O
SVM
Non-max
suppression
Comparison
Platform
configuration
#HS
(MHz)
#ASIP
(MHz)
ARC
Functions
ASIP
Functions
All
None
HS
1
~40
0
HS + ASIPs
2
2
(1600)
2.5
(500)
Greyscale
Rescaling
Normalization
Non-max suppr.
Display
Gradient
Histogram
SVM
HS + ASIPs
3
1
(1600)
3.5
(500)
Greyscale
Rescaling
Non-max suppr.
Display
Gradient
Histogram
Normalization
SVM
1
(500)
4
(500)
Greyscale
Non-max suppr.
Display
Rescaling
Gradient
Histogram
Normalization
SVM
HS + ASIPs
4
Final Results
• 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM
• 30 frames/second at 500 MHz
• Functionally identical to OpenCV reference
• TSMC 28nm
• ASIP gate count: 330k gates
• ASIP power consumption: ~130mW
• Scaling due to multi-core, specialization and SIMD usage
• Power/performance/area via ASIPs
• Scaling due to multi-core, specialization and SIMD usage
• Performance gains and power efficiency due to tailored
instruction sets and dedicated memory architecture
23
Scenario: Need for Flexible FEC Core
• Existing and emerging standards use advanced
FEC schemes like turbo coding, LDPC and Viterbi
• Instead of duplication of FEC cores, need for reconfigurable architecture at minimum power and
area
DVB-X?
.11n
LDPC-A
LDPC-C
.11n
Vit
UMTS
Turbo-B
3GPP-LTE
.16e
turbo-A
LDPC-D
FlexFEC
(turbo/LDPC/Vit)
Architecture Refinement to Increase Throughput:
Increased ILP from 2 to 6
ILP: 2 FU (scalar+vector unit)
ILP: 6 FU (1 scalar+5 vector units)
No duplication for arithmetic
functionality
For exploiting ILP to increase
throughput
2 FUs for local memory access
Fast Area/Performance Trade-off
(40nm logical synthesis Processor only)
0.189 sqmm
0.177 sqmm
100
90
80
cycle count
70
60
ldpc - layer 6
50
ldpc - layer 8
40
turbo - beta
30
turbo - output
20
10
0
2
3
4
5
Total number of processor functional units
6
Architectural Exploration
FU Utilization: 2  5
100.0
Vector slot separated in different FUs
without overlapping functionality
90.0
80.0
70.0
60.0
50.0
scalar
40.0
vector
Local memory access
congestion
30.0
100.0
20.0
90.0
10.0
80.0
0.0
layer6
layer7
layer8
alpha
beta
output
70.0
scalar
60.0
vector alu
50.0
vector spec
40.0
vector vmem
30.0
vector bg vmem
20.0
10.0
0.0
layer6
layer7
layer8
alpha
beta
output
Architectural Exploration
More Balanced FU Utilization: 5  6
90.0
80.0
70.0
60.0
scalar
vector alu
50.0
vector spec
vector vmem
40.0
vector vmem2
vector bg vmem
30.0
20.0
10.0
0.0
ldpc - layer6
ldpc - layer7
ldpc - layer8
turbo - alpha
turbo - beta
turbo - output
Highly Efficient C-compilation
Vast Majority of 6 FU Used
Latest IP Available from IMEC
Blox-LDPC ASIP
Instances available
ad
Agenda
• ASIPs as accelerators in SoCs
• How to design ASIPs
• Examples
• Conclusions
Conclusion
• ASIPs enable programmable
accelerators
• IP Designer enables efficient
design and programming of
ASIPs
• “Programmable datapath”
ASIPs offer performance,
area and power comparable
to hardwired accelerators
• ASIPs enable balanced
multicore SoC architectures

similar documents