Performance Validation framework

Report
Performance Modeling and Validation of
C66x DSP multilevel cache memory system
Rama Venkatasubramanian, Pete Hippleheuser, Oluleye Olorode,
Abhijeet Chachad, Dheera Balasubramanian, Naveen Bhoria, Jonathan Tran,
Hung Ong and David Thompson
Texas Instruments Inc, Dallas TX
1
Pre-silicon Performance Validation
• Improved IPC
– Better energy efficiency
• Processor Memory systems becoming more and more
complex
– Multicore vs clock speed - trend seen in industry. Memory
systems are becoming difficult to validate.
• Cost of bug fix : Exponentially increases as left
undetected through the design flow.
• Performance validation Goal: Identify and fix all
performance bugs during design development phase.
– Modeling and validation of a multi-level memory system is
complex
• Novelty of this work:
– Unique latency crediting scheme allows pre-silicon
performance validation with minimal CPU simulation time
increase
– Reusable performance validation framework across the DV
stack (multiple levels of design verification)
2
C66x DSP Memory system architecture
• Two levels of on-die caches
L1P
SRAM/Cache
32KB
Embedded
Debug
L1P
Interrupt
controller
Fetch
Emulation
Dispatch
L
M
D
Register file A
L
L1D
SRAM/Cache
32KB
• Controllers operate at CPU clock rate
– Minimize CPU read latency
M
S
D
Register file B
L1D
• L1/L2 configurable as SRAM (or) cache (or)
both
Exectute
C66x
DSP
S
Power
Management
– 32KB direct mapped L1I cache
– 32KB 2-way set associative writeback L1D
cache
– 1MB 4-way private unified L2 cache
L2
Prefetch
DMA
L2
SRAM/Cache
1MB
• DMA
– Slave DMA Engine
– Internal DMA Engine
• Stream based prefetch engine
• Coherency: All-inclusive coherent memory
system.
Performance bottlenecks
Typical architectural constraints in a Processor memory system:
• Memory system pipeline stalls
– Stalls due to movement of data (controlled by the availability of buffer space)
– Stall conditions to avoid a hazard scenario.
• Arbitration points
– Memory access arbitrated between multiple requestors (or) data arbitration in a
shared bus.
• FIFO’s
• Bank stalls and Bank conflicts
– Bank Stalls
– Bank conflicts: Burst mode SRAMs used to implement the memories
• Bandwidth Management Architecture
– Bandwidth requirement dictated by an application (real-time applications)
– A minimum bandwidth may have to be guaranteed.
• Miscellaneous Stalls
4
Performance Validation framework
• Implementation:
–
–
–
–
Theoretical analysis based on System micro architecture.
The model framework was developed in “Specman-e” language
Overlaid on top of the functional verification environment
Complex, but scalable architecture
• Overall Goal:
–
–
–
–
Identify any performance bottlenecks in the memory system pipeline
Measure worst case latency for all the transactions
Ensure there are no blocking/hang scenarios at arbitration points
Re-usable framework across DV stack
5
Performance Validation framework (contd.)
• Model probes into the designs
– All controllers, the internal interfaces
etc.,
• Measures the number of cycles for
each transaction
– Initiated by CPU (or) DMA (or) cache
operations
• Stalls in arbitration points, bandwidth
management etc are tracked.
• Novel latency credit based transaction
modeling system developed to
determine the true latency incurred by
a transfer in the system.
6
Example 1 – Single traffic stream
• CPU load from L2 SRAM.
• Miss in L1I cache and request
reaches unified L2 controller.
– Ex: Transaction goes through A3,
P0, and P1 and then reads the data
from L2SRAM and the data is
returned to the program memory
controller.
• Flight time for entire transfer
calculated inside the memory
system.
•Pipeline stages – rectangles
•Arbitration points - circles.
•FIFO shown for illustration purposes.
7
Latency Crediting methodology
• The model tracks the transfer through the system and buffer space
availability in the pipeline stages and arbitration points.
• Assume:
–
–
–
–
–
Total flight time for the transfer within the L2 controller = tL2lat
Each pipeline stall cycle = tI0, tI1, tP0, tP1 etc.,
Arb stall cycles = tA0, tA3 etc.,
Unused Arb stall cycles = tA1=tA3=0 (arbitration path not taken)
Adjusted latency tAdjLat inside L2 controller for this transfer is:
• Ideally, adjusted latency for the transfer should equal to the pipeline depth
inside the controller.
– Measuring the latency to that level of accuracy would require a cycle-accurate
performance validation model  impractical.
• Hence the adjusted latency for each transfer was measured and ensured
to be within an acceptable latency defined by the architecture.
8
Example 2: Multiple concurrent traffic
streams
• Three concurrent streams:
– CPU program read from L2SRAM,
– CPU data read from MDMA path
(through the fifo)
– A coherence transaction – say a
writeback invalidate operation that
arbitrates for the L2 cache, checks
for hit or miss, writebacks the data
(through the MDMA path) and
invalidates the cache entry.
• Model has to be aware of :
– Interactions of the pipeline stages
– Apply credits accordingly
• Millions of functional tests in the
regression suite.
• Every conceivable traffic type
inferred by the model and tracked
9
Performance bugs identification
• The data collected for each transaction type is plotted for each memory controller
interface (or) transaction type.
• If latency value is above a certain value (a checker value), it is either a design bug
(or) an incorrect modeling in the performance validation environment, which is fixed
and re-analyzed.
– Checkers modeled based on theoretical analysis. Outliers analyzed.
• Over a period of time, resulting plot shows the minimum and maximum number of
cycles spent by any given transfer type in that particular memory controller for the
various stimuli provided by the testbench.
10
Bandwidth analysis and validation
• C66x DSP supports various bandwidth
programmations.
• Theoretical expectations for the various
bandwidth settings when multiple
requestors are arbitrating for a resource is
calculated:
CPU
Arb
(BW Mgmt)
L2
SRAM
DMA
Bandwidth configuration
– Example: CPU and DMA traffic are
arbitrating for the L2SRAM resource.
– Various configurations and throughput is
tabulated as shown below:
CPU Priority > DMA Priority. Bank conflict occurs if same bank is accessed within 4 cycles
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BW config 1 1 A 2 B 3 A 1 B 2 A 3 B 1 A 2 B 3 A 1 B 2 A
BW config 2 1 2 A 3 1 B 2 3 A 1 2 B 3 1 A 2 3 B 1 2 A 3
BW config 3 1 2 3
A 1 2 3
B 1 2 3
A 1 2 3
B 1 2
BW config 4 1 2 3
1 2 3
A 1 2 3
1 2 3
B 1 2 3
BW config 5 1 2 3
1 2 3
1 2 3
1 2 3
A 1 2 3
1
BW config 6 1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2
Legend:
CPU transfers marked in Yellow DMA transfers marked in Blue
23 24 25 26 27 28
3B 1A 2B
1B 2 3A 1
3
A 1 2 3
A 1 2 3
2 3
1 2 3
3
1 2 3
29 30
3A
2B
B
1 2
1
1 2
Expected Throughput
31 32 CPU
DMA
1B
50%
50%
3 1
66%
33%
1 2
60%
20%
3
66%
11%
2 3
70%
5%
3A
72%
3%
11
Bandwidth validation (contd..)
• Efficiency of bandwidth allocation
– To improve energy efficiency in the system
– Targeted stress tests written to exercise the full bandwidth scenarios in all the
interfaces. With bandwidth arbitration enabled, total bandwidth utilized is plotted
per requestor.
– Ex: Bandwidth that each of the requestors - CPU, DMA and the coherence
engine get when they access the same resource L2SRAM concurrently
• Total available L2SRAM bandwidth is 32 Bytes/cycle.
• But when all the three requestors are accessing L2SRAM, the L2 controller provides a
maximum of only 24 Bytes/cycle, which may or may not be the architecture intent.
• Scenarios like this are highlighted to the design team for review
• Architecture revised during the design phase accordingly.
12
Validation of Cache coherency operations
• The C66x DSP Core memory system supports block and global cache operations..
– The global cache operations
– The block cache operations
• The DSP core supports a snoop interface between the data memory controller and
the L2 controller to support all-inclusive coherent memory system.
– For the snoop operations, the latency for each snoop transaction is tracked and reviewed
against architectural expectations.
• The latency of each of the cache coherency operations is a function of the size of
the cache, the number of clean/dirty/valid lines in the cache, the block word count,
and the number of empty lines in the cache.
• For different step sizes of cache size and block sizes, the total number of cycles
taken for each operation is determined and a formula is derived.
– Used by the performance validation model with random stimuli such that whenever a cache
operation is initiated, the latency and credits are checked against their respective formulae.
Example:
13
Conclusion
• Post-silicon performance validation to identify performance issues is very late in the
design development cycle and can prove very costly.
– There is an ever increasing need for detection of performance issues early on so that the
cost of the design fixes needed to fix performance issues is minimized.
• The “Specman-e” model overlays on top of the functional simulation framework
– Collates traffic information for every transfer in the system.
– Computes the total latency incrementally and also calculates the expected latency either by
using the theoretical equations or default values based on pipeline depth.
• Numerous performance bugs identified and fixed during design development phase
• The performance validation model probes into the design and is reused across all
the levels of the DV stack with minimal simulation time overhead.
– The framework can guarantee that performance is validated across the entire
design/system instead of just the unit level functional verification environment.
• Furthermore, between different revisions of the processor cores, if feature addition
at a later stage (or) a functional bug fix introducing a performance bug, the
performance model checker will fail and can thus determine any performance issue
created by design changes.
14
Q&A
Thank you
[email protected]
15

similar documents