RT Bridge - University of Illinois at Urbana

Report
Predictable Integration
of Safety-Critical Software
on COTS-based Embedded Systems
Marco Caccamo
University of Illinois
at Urbana-Champaign
Outline
• Motivation
• PRedictable Execution Model (PREM)
– Peripheral scheduler & real-time bridge
– Memory-centric scheduling
• MemGuard
– Memory bandwidth Isolation
• Colored Lockdown
– Cache space management
2
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Real-Time Applications
• Resource intensive real-time applications
– Multimedia processing(*), real-time data analytic(**), object tracking
• Requirements
– Need more performance and cost less  Commercial Off-The Shelf (COTS)
– Performance guarantee (i.e., temporal predictability and isolation)
(*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010
(**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
3
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Modern System-on-Chip (SoC)
• More cores
– Freescale P4080 has 8 cores
• More sharing
More performance
Less energy,
Less cost
– Shared memory hierarchy (LLC, MC, DRAM)
– Shared I/O channels
But, isolation?
4
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
SoC: challenges for RT safety-critical systems
• In a multicore chip, memory controllers, last level cache,
memory, on chip network and I/O channels are globally
shared by cores. Unless a globally shared resource is over
provisioned, it must be partitioned/reserved/scheduled.
Otherwise
– Complexity, cost and schedule: The schedulability analysis,
testing and temporal certification of an IMA partition in a
core will also depend on tasks running in other cores
– Safety Concerns: The change of software in one core could
cause the tasks in other cores’ IMA partitions missing their
deadlines. This is unacceptable!
5
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Shared Memory Hierarchy
App 1
Core1
App 2
Core2
App 3
App 4
Core3
Core4
Shared Last Level Cache (LLC)
Memory Controller (MC)
Space sharing
Access
contention
DRAM
• Shared hardware resources
• OS has little control
6
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Task-Peripheral conflict (1 core)
• Task-peripheral conflict:
– Master peripheral working for Task B.
– Task A suffers cache miss.
– Processor activity can be stalled due to
interference at the FSB level.
CPU
Task A
Task B
Front Side Bus
Host PCI
Bridge
Master
peripheral
PCI Bus
DDRAM
This effect MUST be
• How relevant is the problem?
considered
wcet
– Upin
to 49%
increased wcet for memory
intensive tasks.
computation!!
Slave
peripheral
– Contention for access to main memory
can greatly increase a task worst-case
computation time!
Sebastian Schonberg, Impact of PCI-Bus Load on
Applications in a PC Architecture, RTSS 03
7
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Experiment: Task and Peripherals
•
•
•
•
Experiment on Intel Platform, typical embedded system speed.
PCI-X 133Mhz, 64 bit fully loaded by traffic generator peripheral.
Task suffers continuous cache misses.
Up to 44% wcet increase.
8
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Experiment: 2 Cores Interference
• Task A suffers max number of cache misses (92% stall time).
• Task B has variable cache stall time.
• Adding PCI-E peripheral interference -> 196% WCET increase!
Max WCET increase ~=
interferencecache
is stall time of task A
Multicore
a serious problem!!!
WCET increase proportional
to cache stall time
9
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Bus Contention
• Two DMA peripherals transmitting at full
speed on PCI-X bus.
• Round-robin arbitration does not allow
timing guarantees.
10
Transaction Length
Bandwidth (256B)
No interference
596MB/s (100%)
128 bytes
441MB/s (74%)
256 bytes
346MB/s (58%)
512 bytes
241MB/s (40%)
CPU
RAM
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Bus Contention
• Two DMA peripherals transmitting at full
speed on PCI-X bus.
• Round-robin arbitration does not allow
timing guarantees.
CPU
RAM
NO BUS SHARING
3
t
6
t
0
11
8
16
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Bus Contention
• Two DMA peripherals transmitting at full
speed on PCI-X bus.
• Round-robin arbitration does not allow
timing guarantees.
CPU
RAM
BUS CONTENTION, 50% / 50%
6
4
t
10
t
0
11
8
16
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Problem: Bus Contention
• Two DMA peripherals transmitting at full
speed on PCI-X bus.
• Round-robin arbitration does not allow
timing guarantees.
CPU
RAM
Integration
BUS CONTENTION, 33% / 66%
Nightmare!!!
9
t
9
t
0
11
8
16
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Cache Delay Analysis (contention-based access)
Cache fetches
t
wcet (no interfence)
Peripherals
Bandwidth
Task
Cache fetches
• Compute worst case increase on task computation time due to
peripheral interference (single core system).
• Main idea: treat the memory subsystem as a switch that
multiplexes accesses between the CPU and peripherals.
• The same analysis was later extended to multicore platforms.
t
wcet increase
R. Pellizzoni and M. Caccamo, "Impact of Peripheral-Processor Interference on WCET Analysis of Real-Time Embedded Systems" IEEE
Transactions on Computers (TC), Vol. 59, No. 3, March 2010.
t
12
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Modeling I/O traffic: Peripheral Arrival Curve
• Key idea: the maximum task
delay depends on the amount of
peripheral traffic (single core).
•
 i (t ) : maximum amount
 i (t )
of
time required by all peripherals
to access main memory.
• Can be obtained using…
– Measurement
– Distributed traffic analysis
– Enforced through engineering solution (more on that later…)
14
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
The Need for Engineering Solutions
• Analysis bounds are tight but depend on very peculiar arrival
patterns.
• Average case significantly lower than worst case.
– Main issue: COTS arbiters are not designed for predictability.
• We propose engineering solutions to:
1.
2.
3.
26
schedule memory accesses at high level (coarse granularity)
 memory-centric real-time scheduling,
control cores’ memory bandwidth usage,
manage cache space in a predictable manner
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Outline
• Motivation
• PRedictable Execution Model (PREM)
– Peripheral scheduler & real-time bridge
– Memory-centric scheduling
• MemGuard
– Memory bandwidth Isolation
• Colored Lockdown
– Cache space management
17
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Peripheral Scheduling
• Solution: enforce peripheral
schedule (single resource
scheduling).
• No need to know low-level
parameters!
CPU
RAM
COTS peripherals do not provide
block functionality,
IMPLICIT SCHEDULE ENFORCEMENT
3
so how do we do this?
t
BLOCK
BLOCK
t
0
28
8
16
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Real-Time I/O Management System
CPU
RAM
• RT-Bridge buffers
incoming/outgoing data and
delivers it predictably.
• Peripheral Scheduler enforces
traffic isolation.
RT
Bridge
PCIe
RT
Bridge
ATA
RT
Bridge
RT
Bridge
North
Bridge
South
Bridge
PCI-X
• Real-Time Bridge interposed
between peripheral and bus.
Peripheral
Scheduler
E. Betti, S. Bak, R. Pellizzoni, M. Caccamo and L. Sha, "Real-Time I/O Management System with COTS Peripherals" IEEE
Transactions on Computers (TC), Vol. 62, No. 1, pp. 45-58, January 2013.
29
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Peripheral Scheduler
• Peripheral Scheduler receives data_rdyi information from
Real-Time Bridges and outputs blocki signals.
• Server provides isolation by enforcing a timing reservation.
• Fixed priority, cyclic executive etc. can be implemented in HW
with very little area.
EXEC1
EXEC2 = READY2 and
not EXEC1
EXEC2
READY2
...
EXECi = READYi and
EXECi
not EXEC1 … and not EXECi-1
READYi
30
Server1
block1
data_rdy2
Server2
block2
...
READY1
...
EXEC1 = READY1
data_rdy1
data_rdyi
Serveri
blocki
...
Scheduler (FP)
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Real-Time Bridge
• FPGA System-on-Chip design with CPU, external memory, and custom
DMA Engine.
• Connected to main system and peripheral through available PCI/PCIe
bridge modules.
31
IntMain
PCI
Bridge
IntFPGA
Interrupt
Controller
PLB
PCI
Controlled
Peripheral
Memory
Controller
DMA
Engine
data_rdy
Main
Memory
FPGA CPU
block
System +
PCI
FPGA
PCI
Bridge
Host
CPU
Local RAM
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Real-Time Bridge
• The controlled peripheral reads/writes to/from Local RAM instead of
Main Memory (completely transparent to the peripheral).
• DMA Engine transfers data from/to Main Memory to/from Local RAM.
32
IntMain
PCI
Bridge
IntFPGA
Interrupt
Controller
PLB
PCI
Controlled
Peripheral
Memory
Controller
DMA
Engine
data_rdy
Main
Memory
FPGA CPU
block
System +
PCI
FPGA
PCI
Bridge
Host
CPU
Local RAM
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Peripheral Virtualization
• RT-Bridge supports
peripheral virtualization.
• Single peripheral (ex:
Network Interface Card)
can service different
software partitions.
• HW virtualization enforces
strict timing isolation.
33
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Implemented Prototype
• Xilinx TEMAC 1Gb/s ethernet card (integrated on FPGA).
• Optimized virtual driver implementation with no software
packet copy (PowerPC running Linux).
• Full VHDL HW code and SW implementation available.
34
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Evaluation
• 3 x Real-Time Bridges, 1 x Traffic Peripheral Transfer
Time
Generator with synthetic traffic.
• Rate Monotonic with Sporadic
Servers.
Budget
Period
RT Bridge
7.5ms
9ms
72ms
Generator
4.4ms
5ms
8ms
Utilization 1, harmonic periods.
Generator
RT-Bridge
RT-Bridge
Scheduling flows without peripheral
scheduler (block always low) leads to
deadline misses!
RT-Bridge
35
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Evaluation
• 3 x Real-Time Bridges, 1 x Traffic Peripheral Transfer
Time
Generator with synthetic traffic.
Budget
Period
RT Bridge
7.5ms
9ms
72ms
Generator
4.4ms
5ms
8ms
• Rate Monotonic with Sporadic
Servers.
No deadline misses with peripheral scheduler
Generator
RT-Bridge
RT-Bridge
RT-Bridge
36
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Testbed (single core, distributed)
• Embedded testbed used to prove the applicability of our techniques.
• System objective: control a 3DOF Quanser helicopter.
– Non-linear control.
– 100 Hz sensing and actuation.
• End-to-end delay control using:
– I/O Management System.
– Real-Time Bridge
38
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Testbed (single core, distributed)
• Sensor Node performs sensing/actuation.
• Control node executes control algorithm.
• Data exchanged on real-time network.
RT Network
Sensor Node
Quanser 3DOF
helicopter
Control Node
39
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Testbed
Mem
logic
CPU
RAM
RT Bridge
PCI
Peripheral
Scheduler
Disturb
Traffic
Generator
RT NIC
Card
NIC
ADC/DAC
Card
RT
Switch
Actuation
RT NIC
Card
Sensing
data
NIC
Sensing /
actuation node
Control Node
GUI Node
40
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Real-Time Bridge Demo
41
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Execution Model (PREM uni-core)
• (The rule) Real-time embedded applications should be
compiled according to a new set of rules to achieve
predictability
• (The effect) The execution of a task can be distinguished
between a memory intensive phase (with cache prefetching)
and a local computation phase (with cache hits)
• (The benefit)High-level coscheduling can be enforced among
all active components of a COTS system
 contention for accessing shared resources is implicitly
resolved by the high-level coscheduler without relaying on
low level arbiters
R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, R. Kegley, "A Predictable Execution Model for COTS-based Embedded Systems",
Proceedings of 17th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Chicago, USA, April 2011.
30
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory-centric scheduling (multicore)
• It uses the PREM task model: each task is composed by a
sequence of intervals, each including a memory phase followed
by a computation phase.
• It enforces a coarse-grain TDMA schedule for granting memory
access to each core.
• Each core can be analyzed in isolation as if tasks were running
on a “single-core equivalent ” platform.
G. Yao, R. Pellizzoni, S. Bak, E. Betti, and M. Caccamo, "Memory-centric scheduling for multicore hard realtime systems", Real-Time Systems Journal, Vol. 48, No. 6, pp. 681-715, November 2012.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Two cores example: TDMA slot of core 1
J1
J2
J3
4
0
memory phase
8
12
computation phase
With a coarse-grained TDMA, tasks on one core can perform the
memory access only when the TDMA slot is granted
Core Isolation
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory-centric scheduling: three rules
• Assumption: fixed priority, partitioned scheduling
• Rule 1: enforce a coarse-grain TDMA schedule among
the cores for granting access to main memory;
• Rule 2: raise scheduling priority of memory phases over
execution phases when TDMA memory slot is granted;
• Rule 3: memory phases are non-preemptive.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Raise priority of mem. phases during TDMA slot
J1
J2
J3
4
0
memory phase
8
12
computation phase
J1
J2
J3
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Make memory phases non-preemptive
J1
J2
J3
0
4
8
12
0
4
8
12
J1
J2
J3
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Summary of two cores example
J1
J2
J3
Rule 1 – TDMA memory
schedule
4
8
(a) TDMA-only Scheduling
0
12
J1
J2
J3
Rule 2 – Prioritize memory phases
during a TDMA memory slot
0
4
8
(b) TDMA + Memory Promotion Scheduling
12
J1
J2
J3
Rule 3 – memory phases
are non-preemptive
0
4
8
(c) Real-Time Memory Centric Scheduling
Memory Phase
Execution Phase
12
TDMA Memory Slot
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Intuition of response time analysis
Memory chain
Execution chain
J1
J2
J3
J4
J5
40
30
20
0
10
The linearized TDMA model:
1. b is the memory bandwidth assigned to the core (b = TDMA_slot/ TDMA_period).
2. each memory phase is inflated by a factor 1/b; each execution phase is inflated
by a factor 1/(1-b);
3. Interfering jobs that contribute to worst case response time can be separated as
a memory chain followed by an execution chain;
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Pipelining memory and exec. phases
Memory chain
Execution chain
J1
J2
J3
J4
J5
0
key observations:
•
•
10
20
30
40
The inflated memory and execution phases can run in parallel.
Only ONE joint job contributes to both memory and execution chains (in this
figure, J3 is the joint job).
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Worst-case response time of Job Ji
2. Memory
blocking from
one lower
priority job
3. Either memory or
computation from hp(i)
1. Upper bound of
the memory phase
of the joint job
4. Computation of
job under analysis
1. Both the memory and the computation of the joint job
2. Longest memory phase of one job with lower priority (due to non-preemptive
memory)
3. The max of memory and computation phase for each higher priority job
4. The computation phase of the job under analysis
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Schedulability of synthetic tasks
Schedulability
ratio
In an 8-core, 10-task system, the memory-centric scheduling bound is superior to
the contention-based scheduling bound.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Schedulability of synthetic tasks
Schedulability
ratio
Ratio = .5
The contour line at 50%
schedulable level
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Outline
• Motivation
• PRedictable Execution Model (PREM)
– Peripheral scheduler & real-time bridge
– Memory-centric scheduling
• MemGuard
– Memory bandwidth Isolation
• Colored Lockdown
– Cache space management
48
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Interference
foreground
background
X-axis
470.lbm
(2.1GB/s)
Foreground slowdown ratio
2.2
2.0
1.8
Core
Core
1.6
L2
L2
1.4
Shared Memory
1.2
Intel Core2
1.0
437.leslie3d
(1.6GB/s)
462.libquantum
(1.5GB/s)
410.bwaves
471.omnetpp
(1.5GB/s)
(1.4GB/s)
• Key observations:
– Memory bandwidth(variable) != CPU bandwidth (constant)
– Memory controller  queuing/access delay is unpredictable
49
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Access Pattern
LLC misses
LLC misses
Time(ms)
Time(ms)
• Memory access patterns vary over time
• Static resource reservation is inefficient
50
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Bandwidth Isolation
• MemGuard provides an OS mechanism to enforce
memory bandwidth reservation for each core
H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, L. Sha, "MemGuard: Memory Bandwidth Reservation System for
Efficient Performance Isolation in Multi-core Platforms", to appear at IEEE RTAS, April 2013.
51
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
MemGuard
• Characteristics
– Memory bandwidth reservation system
– Memory bandwidth: guaranteed + best-effort
– Prediction based dynamic reclaiming for efficient
utilization of guaranteed bandwidth
– Maximize throughput by utilizing best-effort
bandwidth whenever possible
• Goal
– Minimum memory performance guarantee
– A dedicated (slower) memory system for each core in
multi-core systems
52
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Bandwidth Reservation
• Idea
– Control interference by regulating per-core memory traffic
– OS monitor and enforce each core’s memory bandwidth usage
• Using per-core HW performance counter(PMC) and scheduler
Enqueue tasks
2
Budget 1
Core
activity
0
10
Dequeue tasks
20
Dequeue tasks
computation
memory fetch
53
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Guaranteed Bandwidth: rmin
• Definition
– Minimum memory transfer rate
• when requests are back-logged in the DRAM controller
• worst-case access pattern: same bank & row miss
• Example (PC6400-DDR2*)
– Peak B/W: 6.4GB/s
– Measured minimum B/W: 1.2GB/s
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
54
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Bandwidth Reservation
• System-wide reservation rule
– up to the guaranteed bandwidth rmin
m
åB £ r
i
min
1
m: #of cores
• Memguard approximates a dedicated (ideal)
memory subsystem
– bandwidth: Bi (bytes/sec)
– latency: 1/Bi (sec/byte)
55
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Memory Bandwidth Reclaim
• Key objective
– Utilize guaranteed bandwidth efficiently
• Regulator
– Predicts memory usage based on history
– Donates surplus to the reclaim manager at the beginning of
every period
– When remaining budget (assigned – donated) is depleted,
tries to reclaim from the reclaim manager
• Reclaim manager
– Collects the surplus from all cores
– Grants reclaimed bandwidth to individual cores on demand
56
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Hard/Soft Reservation on MemGuard
• Hard reservation (w/o reclaiming)
– Guarantee memory bandwidth Bi regardless of other cores
– Selectively applicable on per-core basis
• Soft reservation (w/ reclaiming)
– Does not guarantee reserved bandwidth due to potential
misprediction
– Error cases can occur due to misprediction
– Error rate is small (shown in evaluation)
• Best-effort bandwidth
– After all cores use their given budgets, and before the next
period begins, MemGuard broadcasts all cores to continue
to execute
57
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Evaluation Platform
Intel Core2Quad
Core 0
L1-I
Core 2
Core 1
L1-D
L1-I
L1-I
L1-D
Core 3
L1-D
L1-I
L1-D
L2 Cache
L2 Cache
System Bus
DRAM
• Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM
• Modified Linux kernel 3.6.0 + MemGuard kernel module
– https://github.com/heechul/memguard/wiki/MemGuard
• Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks
58
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Isolation Effect of Reservation
Isolation
Core 2: 0.2 – 2.0 GB/s for lbm
Solo [email protected]/s
Core 0: 1.0 GB/s for X-axis
• Sum b/w reservation ≤ rmin (1.2GB/s) Isolation
– 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin
59
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Effects of Reclaiming and Spare Sharing
• Guarantee foreground ([email protected]/s)
• Improve throughput of background ([email protected]/s): 368%
60
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Effect of MemGuard
•
•
Soft real-time application on each core.
Provides differentiated memory bandwidth
– weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled
61
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Outline
• Motivation
• PRedictable Execution Model (PREM)
– Peripheral scheduler & real-time bridge
– Memory centric scheduling
• MemGuard
– Memory bandwidth Isolation
• Colored Lockdown
– Cache space management
62
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
LVL3 Cache & Storage Interference
• Inter-core interference
– The biggest issue wrt modular certification
– Fetches by one core might evict cache blocks owned by
another core
– Hard to analyze!
• Inter-task/inter-partition interference
• Intra-task interference
– Also present in single-core systems; intra-task interference
is mainly a result of cache self-eviction.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Inter-Core Interference: Options
• Private cache
– It is often not the case: majority of COTS multicore platforms have last
level cache shared among cores
• Cache-Way Partitioning
– Easy to apply, but inflexible
– Reducing number of ways per core can greatly increase cache conflicts
• Colored Lockdown 
–
–
–
–
Our proposed approach
Use coloring to solve cache conflicts
Fine-grained assignment of cache resources (page size – 4Kbytes)
Use cache locking instructions to lock “hot” pages of rt critical tasks
 locked pages can not be evicted from cache
R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, R. Pellizzoni, "Real-Time Cache Management
Framework for Multi-core Architectures", to appear at IEEE RTAS, Philadelphia, USA, April 2013.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
How Coloring Works
• The position inside the cache of a cache block depends on the
value of index bits within the physical address.
• Key idea: the OS decides the physical memory mapping of task’s
virtual memory pages  manipulate the indexes to map
different pages into non-overlapping sets of cache lines (colors)
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
How Coloring Works
• The position inside the cache of a cache block depends on the
value of index bits within the physical address.
• Key idea: the OS decides the physical memory mapping of task’s
virtual memory pages  manipulate the indexes to map
different pages into non-overlapping sets of cache lines (colors)
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
How Coloring Works
• You can think of a set associative cache as an array…
16 colors
32 ways
...
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
How Coloring Works
• You can think of a set associative cache as an array…
• Using only cache-way partitioning, you are restricted to assign
cache blocks by columns.
• Note: assigning one way turns it into a direct-mapped cache!
...
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
How Coloring + Locking Works
• You can think of cache as an array…
• Combining coloring and locking, you can assign arbitrary
position to cache blocks independently of replacement policy
...
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Colored Lockdown Final goal
•
Aimed model - suffer cache misses in hot memory regions only once:
– During the startup phase, prefetch & lock the hot memory regions
– Sharp improvement in terms of WCET reduction (and schedulability)
T1
CPU1
T2
CPU2
hot
region
T1
startup
CPU1
memory
access
T2
CPU2
execution
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Detecting Hot Regions
• In the general case, the size of the cache is not enough to keep
the working set of all running rt critical tasks.
• For each rt critical task, we can identify some high usage virtual
memory regions, called: hot memory regions (
). Such
regions can be identified through profiling.
• Critical tasks do NOT color dynamically linked libraries. Dynamic
memory allocation is allowed only during the startup phase.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Detecting Hot Regions
• How can we detect hot pages? Given an addr. space:
Process Addr. Space
 Their location is unknown
data
 Their absolute virtual memory addresses change
from run to run
text
heap
hot
region
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Detecting Hot Regions
• Execute the unmodified task inside a profiling environment
Profiling Environment
 Instrumentation code added at run-time
Observed Task
 Memory accesses are caught
• The output is the list of every single accessed virtual memory
address
• We keep per-page access counters. Hotter pages will record
a higher number of accesses.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Detecting Hot Regions
• Rank the virtual pages by number of accesses.
• Since absolute addresses change from run to run, identify
each page as a pair of values:
– The index of the section which contains the page
– The offset, expressed in pages, from the beginning of the section
E.g.: virtual page #: 0x8040A → Section #3 (text) +
0x3
• Execute the task again outside the profiling environment
to obtain an unaltered list of sections.
• Compute the relative position of a hot page according to
the unaltered list of sections.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Detecting Hot Regions
• The final memory profile will look like:
# + page offset
A 1 + 0x0002
B 1 + 0x0004
C 25 + 0x0000
D 1 + 0x0001
E 25 + 0x0003
I 3 + 0x0000
K 4 + 0x0000
O 6 + 0x0002
P 1 + 0x0005
Q 1 + 0x0000
...
 Where A, B, … is the page ranking;
 Where “#” is the section index;
 It can be fed into the kernel to perform selective
Colored Lockdown
 How many pages should be locked per process?
 Task WCET reduction as function of locked pages
has approximately a convex shape; convex optimization
can be used for allocating cache among rt critical tasks
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
• EEMBC Automotive benchmarks
– Benchmarks converted into periodic tasks
– Each task has a 30 ms period
• ARM-based platform
– 1 GHz Dual-core Cortex-A9 CPU
– 1 MB L2 cache + private L1 (disabled)
• Tasks observed on Core 0
– Each plotted sample summarizes execution of 100 jobs
• Interference generated with synthetic tasks on Core 1
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
• Angle to time conversion benchmark (a2time)
• Baseline reached when 4 hot pages are locked / 81% accesses caught
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
•
CAN remote data request benchmark (canrdr)
• Baseline reached when 3 pages are locked / 91% accesses caught
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
• Same experiment executed on 7 EEMBC benchmarks
Benchmark
Total Pages
Hot Pages
% Accesses in
Hot Pages
a2time
15
4
81%
basefp
21
6
97%
bitmnp
19
5
80%
cacheb
30
5
92%
canrdr
16
3
85%
rspeed
14
4
85%
tblook
17
3
81%
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
• One benchmark at the time scheduled on Core 0
• Only the hot pages are locked
No Prot.
No Interf.
No Prot.
Interf.
Prot.
Interf.
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
EEMBC Results
• Four benchmarks at the time scheduled on Core 0
• Only the hot pages are locked
Prio 4
(top priority)
Prio 3
Prio 2
Prio 1
(low priority)
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Conclusions
• In a multicore chip, memory controllers, last level cache, memory,
on chip network and I/O channels are globally shared by cores.
Unless a globally shared resource is over provisioned, it must be
partitioned/reserved/scheduled.
• We proposed a set of engineering solutions to:
1.
2.
3.
schedule memory accesses at high level (PREM + memory-centric
scheduling),
control cores’ memory bandwidth usage (MemGuard),
manage cache space in a predictable manner (Colored Lockdown).
• We demonstrated our techniques on different platforms based on
Intel and ARM, and tested them against other options.
• Questions?
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems
Acknowledgements
• Part of this research is joint work with prof. Lui Sha and prof. Rodolfo Pellizzoni
• This presentation is from selected research sponsored by
– National Science Foundation (NSF), Office of Naval Research (ONR)
– Lockheed Martin Corporation
– Rockwell Collins
• Graduate students and Postdocs involved in this research: Stanley Bach, Heechul
Yun, Renato Mancuso, Roman Dudko, Emiliano Betti, Gang Yao
References
•
E. Betti, S. Bak, R. Pellizzoni, M. Caccamo and L. Sha, "Real-Time I/O Management System with COTS
Peripherals”, IEEE Transactions on Computers (TC), Vol. 62, No. 1, pp. 45-58, January 2013.
R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, R. Kegley, "A Predictable Execution Model
for COTS-based Embedded Systems", Proceedings of 17th RTAS, Chicago, USA, April 2011.
G. Yao, R. Pellizzoni, S. Bak, E. Betti, and M. Caccamo, "Memory-centric scheduling for multicore hard
real-time systems", Real-Time Systems Journal, Vol. 48, No. 6, pp. 681-715, November 2012.
H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, L. Sha, "MemGuard: Memory Bandwidth Reservation System
for Efficient Performance Isolation in Multi-core Platforms", to appear at IEEE RTAS, April 2013.
R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, R. Pellizzoni, "Real-Time Cache Management
Framework for Multi-core Architectures", to appear at IEEE RTAS, Philadelphia, USA, April 2013.
•
•
•
•
1
83
Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems

similar documents