Presentazione di PowerPoint - Micrel Lab @ DEIS

Report
Software for embedded
multiprocessors





Introduction on embedded OS
Code parallelization
Interprocessor communication
Task allocation
Power management
Introduction on embedded
operating systems
OS Overview


Kernel, device drivers, boot loader, user interface, file system and utilies
Kernel components:






Interrupt handler
scheduler
Memory manager
System services (networking and IPC)
It runs in protected memory space – kernel space and full access to HW,
while apps run in user-space
Apps communicate with kernel via system calls
app

app

printf()
strcpy()

write()

app


open()

library functions

system calls

OS Overview (II)

Operating system takes control of the
execution




Timer interrupt
I/O interrupts
System calls
Exceptions (undef instruction, data abort,
page faults etc…)
Processes

A process is a unique execution of a
program.


A process has its own state:



Several copies of a program may run
simultaneously or at different times.
registers;
memory.
The operating system manages
processes.
Process state

A process can be in
one of three states:



executing on the
CPU;
ready to run;
waiting for data.

gets
CPU


executing
preempted



gets data
and CPU

needs
data
gets data
ready


needs data
waiting
Processes and CPUs


Activation record:
copy of process
state.
Context switch:


current CPU context
goes out;
new CPU context
goes in.


process 1


process 2


...
memory
PC
registers

CPU
Terms


Thread = lightweight process: a
process that shares memory space with
other processes.
Reentrancy: ability of a program to be
executed several times with the same
results.
Processes in POSIX

Create a process
with fork:


parent process keeps
executing old
program;
child process
executes new
program.


process a
process a

process b
fork()

The fork process creates child:
childid = fork();
if (childid == 0) {
/* child operations */
} else {
/* parent operations */
}
execv()

Overlays child code:
childid = fork();
if (childid == 0) {
execv(“mychild”,childargs);
perror(“execv”);
exit(1);
}

file with child code
Context switching


Who controls when the context is
switched?
How is the context switched?
Co-operative multitasking

Improvement on co-routines:




hides context switching mechanism;
still relies on processes to give up CPU.
Each process allows a context switch at
cswitch() call.
Separate scheduler chooses which
process runs next.
Problems with co-operative
multitasking

Programming errors can keep other
processes out:


process never gives up CPU;
process waits too long to switch, missing
input.
Context switching



Must copy all registers to activation
record, keeping proper return value for
PC.
Must copy new activation record into
CPU state.
How does the program that copies the
context keep its own context?
Preemptive multitasking
Most powerful form of multitasking:


Use timer to call OS, switch contexts:


CPU
interrupt
timer

OS controls when contexts switches;
OS determines what process runs next.


Flow of control with
preemption


P1
interrupt

OS


P1
interrupt

OS

P2

time
Preemptive context switching



Timer interrupt gives control to OS,
which saves interrupted process’s state
in an activation record.
OS chooses next process to run.
OS installs desired activation record as
current CPU state.
Operating systems

The operating system controls
resources:




who gets the CPU;
when I/O takes place;
how much memory is allocated.
The most important resource is the CPU
itself.

CPU access controlled by the scheduler.
Design Issues


Kernel space/user space/real-time space
Monolithic versus micro-kernel





Monolithic: OS services (including DDs, network
and filesystem) run in privileged mode (easier to
make efficient) (Linux, WinNT)
Microkernel: privileged mode only for task
management, scheduling, IPC, interrupt handling,
memory management (more robust) (QNX,
VxWorks)
Pre-emptable kernel or not
Memory management versus shared memory
Dedicated versus general
Embedded vs General Purpose
OS





Small footprint
Stability (must run for years without
manual intervention)
Hardware watchdogs
Little power
Autonomous reboot (safely and
instantly)
Taxonomy

High-end embedded systems


Deeply embedded OS




Down sized derivatives of existing GP OSes (routers, switches, PDA,
set-top boxes)
Small OSes with a handful of basic functions. They are designed
from the ground for a particular application
They typically lack high-performance GUI and networking
(automotive control, digital camera, mobile phones)
They are statically linked to the application. After the compilation
the whole package containing OS kernel and applications are
concatenated to a single package that can be loaded to the
embedded machine
Run-time environment


Boot routine + embedded libraries
Java, C++ offers functionalities such as memory management,
threading, task synchronization, exception handling
Embedded operating system
User Programs
User Program


Operating
Including Operating


Hardware
Hardware


System

Typical OS Configuration

System Components

Typical Embedded Configuration

Real-time operating system
Dynamic VS Static Loading

Dynamic loading


Static loading



OS is loaded as a separate entity and applications are
dynamically loaded in memory (more flexibility, code
relocation is needed) (e.g. uClinux)
OS is linked and loaded together with applications (no
flexibility, higher predictability) (e.g. eCos, RTEMS)
OS is a set of libraries that provide OS services
How about




Memory protection? (shared address space)
System calls? (no cpu mode change required)
Process creation? (fork, exec)? (shared address space, no
overloading)
File system? (only for input/output data)
Static Loading



No address space separation
User applications run with the same
access privilege as the kernel
Functions are accessed as function
calls, no “system calls”



No need for copying parameters and data
No need for state saving
Speed and control
Dynamic Loading
system memory

File system

run time

process

boot time

relocated (compile addresses !=
run time addresses

address space separation

OS

constant address

Focus on software for
embedded multiprocessors
Embedded vs. General Purpose
Server Applications
Embedded Applications

Asymmetric Multi-Processing







Coherent memory
Shared local memories
HW FIFOS, other direct connections
Dataflow programming models
Classical example – Smart
mobile – RISC + DSP + Media
processors
Symmetric Multi-Processing


Mapped to dedicated processors
Configurable and extensible
processors: performance, power
efficiency
Communication


Differentiated Processors
Specific tasks known early


General tasks known late




Tasks run on any core
High-performance, high-speed
microprocessors
Communication


Homogeneous cores
large coherent memory space on
multi-core die or bus
SMT programming models
(Simultaneous Multi-Threading)
Examples: large server chips (eg
Sun Niagara 8x4 threads),
scientific multi-processors
Parallel programming of
embedded multiprocessors
Parallelism & Programming Models



MP is difficult: Concurrency, and “Fear of
Concurrency”
No robust and general models to automatically
extract concurrency in 20-30+ years of research
Many programming models/libraries - SMT, SMP


OpenMP, MPI (message passing interface)
Users manually modify code




Concurrent tasks or threads
Communications
Synchronisation
Today:


Coarse-grained (whole application/data-wise)
concurrency – unmodified source + MP scheduler
API for communications and synchronisation
Sequential execution model

The most common



The most well understood



Supported by traditional (imperative) languages (C, C++,
Fortran, etc.)
Huge bulk of legacy code
We are trained to solve problems algorithmically (sequence
of steps)
Microprocessors have been originally designed to run
sequential code
The easiest to debug


Tracing the state of the CPU
Step-by-step execution
But… it HIDES parallelism!!
Types of Parallelism
Instruction Level Parallelism (ILP)
 Compilers & HW are mature
Scatter
Data Parallel
Gather
Pipeline
Scatter
Gather
Task Parallelism
 Parallelism explicit in algorithm
 Between filters without
producer/consumer relationship
Data Parallelism
 Between iterations of a stateless filter
 Place within scatter/gather pair (fission)
 Can’t parallelize filters with state
Pipeline Parallelism
 Between producers and consumers
 Stateful filters can be parallelized
Data
Task
Parallelizing Loops: a Key Problem


90% of execution time is in loops
Partial success in automatic extraction



Mostly “well-behaved loops”
Challenges: dependency analysis & interaction with data placement
Cooperative approaches are common

The programmers drives automatic parallelization (openMP)

FORALL


No “loop carried
dependences”
Fully parallel

FORACROSS

Some “loop carried
dependences”
Parallelized loops rely on Barrier Synchronization
Barrier with Pthreads
BARRIER
Master core
only
initializes
synchronization
structures
pthread_mutex_init()
pthread_create()
SERIAL REGION
PARALLEL REGION
Pthreads on Heterogeneous CPUs?
Heterogeneous MPSoC
Master
CPU
Slave
CPU
Slave
CPU
Slave
CPU
INTERCONNECT
Private Private Private
Mem
Mem
Mem
Private
Mem
Shared
Mem
ISSUES
 There is an OS running on
each core.

No means for the master
core to fork new threads
on slave nodes.

Use of pthreads is not a
suitable solution.
SOLUTION

Standalone
implementation.

Master/Slave cores instead
of threads.

Synchronization through
shared memory.
SPMD Barrier
All cores
initialize
Slaves notify
synchronization
theirand
presence
structures
ondata
the barrier
to
common
in
the master
shared memory
Master releases
slaves as soon
as he’s ready to
start parallel
region
SERIAL REGION
Additional serial
code is only
executed by
master core
while slaves
wait on barrier
PARALLEL REGION
Code implementation flow
MPARM
original
C code
Master Slave
CPU CPU
Slave
CPU
Slave
CPU
INTERCONNECT
parallel
Parallel
compiler
code
Private Private Private Private
Mem Mem Mem Mem
binary
gcc
code
Shared
Mem
Runtime Library
The Runtime Library is responsible for


Initializing needed synchronization features,
creating new worker threads (in the original
implementation) and coordinating their
parallel execution over multiple cores
Providing implementation of synchronization
facilities (locks, barriers)
Code Execution



Each CPU execute the same program. Basing upon the CPU id we
separate portions of code to be executed by master and slaves.
Master CPU executes serial code, initializes synchronization
structures in shared memory, etc..
Slave CPUs only execute the parallel regions of code, behaving
like the typical slave threads
int main() {
Master
CPU
Slave
CPU
Slave
CPU
Slave
CPU
…
if (MASTERID) {
serial code
synchronization
}
NoC INTERCONNECT
…
if (SLAVEID) {
parallel code
Private Private Private Private
Mem Mem Mem Mem
Shared
Mem
}
…
}
Synchronization
structures
(barriers, locks)
Synchronization


Parallel programming through shared memory requires
global and point-to-point synchronization
On symmetric architectures, implementations use pthreads
library synchronization facilities, on heterogeneous
architectures hw semaphores must be used
void lock(pthread_mutex_t *lock)
{
pthread_mutex_lock(lock);
}
void lock(int *lock)
{
while(*lock);
}
void unlock(pthread_mutex_t *lock)
{
pthread_mutex_unlock(lock);
}
void unlock(pthread_mutex_t *lock)
{
*lock = 0;
}
Typical Barrier implementation
LOCK(bar->lock);
bar->entry_count++;
if (bar->entry_count < nproc) {
UNLOCK(bar->lock);
while(bar->entry_count != nproc);
LOCK(bar->lock);
bar->exit_count++;
if (bar->exit_count == nproc)
bar->entry_count = 0x0;
UNLOCK(bar->lock);
} else {
bar->exit_count = 0x1;
if (bar->exit_count == nproc)
bar->entry_count = 0x0;
UNLOCK(bar->lock);
}
while(bar->exit_count != nproc);
struct barrier {
lock_type lock;
int entry_count;
int exit_count;
} *bar;
Shared counters
protected by locks
Barrier Implementation Issues
ISSUES
 This approach is not very scalable
 Every processor notifies its arrival to the barrier
increasing the value of a common shared variable
 As the number of cores increases contention for the
shared resource may increase significantly
POSSIBLE SOLUTION
 A vector of flags, one per each core, instead of a
single shared counter
New Barrier Implementation
typedef struct Barrier {
int entered[NSLAVES];
int usecount;
} Barrier;
void Master_Wait (Barrier b, int num_procs ) {
Only the slaves
int i;
spin ;wait
for (i = 1; i < num_procs
i++)on a
shared counter
No busy-waiting
while
(!b−>entered[i]);
due
to contention
//Reset
flags to 0 that is updated
by the master
of a shared
}
counter. Each
the master
void Only
Slave_Enter
(Barrier b, int slave
id) { updates its
spin
waits on
own flag
int ent
= b−>usecount;
each slave’s=flag
b−>entered[id]
1;
detect
their
whileto(ent
== b−>usecount);
void Master_Release(Barrier b) {
presence on the
}
b−>usecount++;
barrier
}
Compiler aware of
synchronization cost?



A lightweight implementation of the synchronization structures
allows a parallelized code with a big number of barrier
instruction to still perform better than the serial version
It would be useful to let the compiler know about the cost of
synchronization. This would allow it not only to select the
parallelizable loops, but also to estabilish if the parallelization is
worthwhile
For well distributed workloads across the cores the proposed
barrier performs well, but for a high degree of load imbalance
an interrupt-based implementation may be best suited. The
compiler may choose which barrier instruction to insert
depending on the amount of busy waiting
Performance analysis
Timeparallel
neededexecution
for initializing
Ideal
synchronization
time
Ideal
timewas
was
synchronization
estimated
for thestructures
various
calculated
simulating
on oneincore
shared
memory
was
configurations
making
the
master
the
computational
loadmeasured
of the
on a wait
single
simulation.
core
oncore
the barrier
after all
various
configurations.
slaves
entered.to be invariant with
It isexpected,
expected
 As
it almost halves
increasing
numbers
of cores.
 with
In
the
case
synchronization
thereal
doubling
of the
number
requires
of
workingadditional
cores. waiting time.
 Simultaneous accesses to the


Upper triangular 32x32 matrix filling
2500
.
2000
Texec (overhead)
Texec (ideal)
Cost (*)
1500
Tsync (overhead)
Tsync (ideal)
1000
Tinit (overhead)
Tinit (ideal)
500
0
Serial
1
2
Cores
4
8

sharedadditional
memory generate
a traffic
Those
cycles
Overall
execution
time isalso
on the bus
that
a to
include
the by
contribution
due
lenghtened
theproduces
waiting
cycles
significant
overhead.
polling
on concurrent
the
synchronization
due
to the
accesses
structures
in shared memory.
to
shared memory.
(*) Overall number of cycles normalized by the number of cycles spent for an ideal bus transaction (1 read + 1 write)
Performance analysis 2

For bigger computational load
matrix filling
Upper triangular
initialization
and32x32
synchronization
contribution
go completely
2500
unnoticed. Big speedup margin.

Texec (overhead)
Speedup is heavily limited by
Texec (ideal)
1500
frequent
accesses to shared
Tsync (overhead)
Tsync (ideal)
memory.
Would pure
1000
Tinit (overhead)
computation follow the profile
of
Tinit (ideal)
the500blue bars?

.
2000
ForUpper
small
computational
load
filling(i.e.
1024x1024 matrix
triangular
few matrix elements) initialization
1400000
and
synchronization have a big
1200000
impact
on overall performance.
No
Texec (overhead)
1000000speedup.
.
Texec (ideal)
Cost
Cost
800000
 Possible

0
Would
cacheable shared
8
4
2
1
Serial
memory regions
Cores help?

optimizations on Tsync (overhead)
barriers
in order to reduce Tsync (ideal)
600000
Tinit (overhead)
accesses
to
shared
memory.
Tinit (ideal)
400000
200000
Possible
optimizations on
initialization,
serializing /
0
4 to 8bus.
2
1 accesses
Serial
interleaving
Cores
Example: MP-Queue library

MP-Queue is a library intended for message-passing
among different cores in a MPSoc environment.

Highly optimized C implementation:

Low level exploitation of data structures and semaphores:


low overhead;
data transfer optimized for performance:



analyses of disassembled code;
synch operations optimized for minimal interconnect utilization
Producer-consumer paradigm, different topologies:



1-N
N-1
N-N
Communication library API
1.
Autonit_system()
1.
2.
2.
Autoinit_producer()
1.
2.
3.
3.
2.
3.
To be called by a consumer core only.
Requires a queue id.
Waits for n producers to be bounded to the consumer structures.
Read()
1.
5.
To be called by a producer core only.
Requires a queue id.
Creates the queue buffers and signals its position to n consumers.
Autoinit_consumer()
1.
4.
Every core has to call it at the very beginning.
Allocates data structures and prepares the semaphore arrays.
Gets a message from the circular buffer (consumer only).
Write()
1.
Puts a message into the circular buffer (producer only).
Communication semantics

C1
P1
C2
P2


Notification mechanisms
available:

Round robin.

Notify all.

Target core specifying.
The i-th producer:
–
Gets the write position index.
–
Puts data onto the buffer.
–
Signals either one consumer
(round-robin / fixed) or all
consumers (notify all).
The i-th consumer:
–
Gets the read position index.
–
Gets data from the buffer.
–
Signals either one producer
(round-robin / fixed) or all
producers (notify all).
Architectural Flexibility
1. Multi core
architectures with
distributed memory.
2. Purely shared
memory based
architectures.
3. Hybrid platforms
Transaction Chart



Shares bus accesses
are minimized as
much as possible:

Local polling on
scratchpad
memories.
Insertion and
extraction indexes
are stored into
shared memory and
protected by mutex.
Data transfer section
involves shared bus

Critical for
performance.
Sequence diagrams


1 producer and 1
consumer (parallel
activity).
Synch time vs pure data
transfer.



Local polling onto scratch
semaphore
Signaling to remote core
scratch
“Pure” data transfer to and
from FIFO buffer in shared
memory
Message size 864WORDS
WORDS
Communication efficiency



Comparison against
ideal point-to-point
communication.
1-N queues
leverages bus
pipelining:

bigger
asymptotic
efficiency.
Interrupt based
notification allows
more than one task
per core.

significant
overhead (up to
15%).
Low-level optimizations are critical!
16 words per token
32 words per token
Not produced any
more by compiler!!!
Growth of assembly length for copy sections



Gcc compiler avoids to
insert the multiple load
/multiple store loop from
32 words on.
Code size would be
exponentially rising.
Where high throughput is
required, a less compact
but more optimized
representation is desired.
Compiler-aware optimization benefits



Compiler may be
forced to unroll data
transfer loops.
About 15%
improvement with 32
word sized
messages.
A Typical JPEG 8x8
block is encoded in a
32 word struct.

8x8x16 bit data.
Task allocation in MPSoC
architectures
Application Mapping
T
T

1

2
T
T

8

4
Proc. 1
T

3
T
T
T

5

7

6
Proc. 2



…
Proc. N

T
 1
T
INTERCONNECT

T
 2
 3
Private
Private

T
T
 5
T


Mem

Resources
T
 7
 8
Mem
 6

T
 4



…
Private

Mem

Deadline

T3
T2

T1

T5
T4
Time

T7

T8

The problem of allocating, scheduling and freq. selection for
task graphs on multi-processors in a distributed real-time
system is NP-hard.
New tool flows for efficient mapping of multi-task
applications onto hardware platforms
Approach
Focus:

Statically scheduled Applications;
Objectives:

Complete approach to allocation, scheduling and
frequency selection:



High computational efficiency w.r.t. commercial solvers;
High accuracy of generated solutions;
New methodology for multi-task application
development:


To quickly develop multi-task applications;
To easily apply the optimal solution found by our optimizer.
Target architecture

An architectural template for a message-oriented
distributed memory MPSoC:




Several MPSoC platforms available on the market
Act. A
match this template:





Support for message exchange between the computation tiles;
Single-token communication;
Availability of local memory devices at the computation tiles and
of remote memories for program data.
The Silicon Hive Avispa-CH1 processor;
The Cradle CT3600 family of multiprocessor;
The Cell Processor
The ARM MPCore platform.
period
T
.
.
.
.
The throughput requirement is reflected in the
maximum tolerable scheduling period T of each
processor;
Act. B
Act. N
Application model
A task graph represents:
–
–
–
–
–
A group of tasks T
Task dependencies
Execution times express in clock cycles: WCN(Ti)
Communication time (writes & reads) expressed as: WCN(WTiTj) and
WCN(RTiTj)
These values can be back-annotated from functional simulation
WCN(WT1T2)
WCN(T1) WCN(RT1T2)
WCN(T2) WCN(WT2T4) WCN(T4)
WCN(RT2T4)
Task2
Task4
WCN(WT4T6)
WCN(RT4T6) WCN(T6)
Task1
Task6
WCN(WT1T3)
WCN(RT1T3)
Task3
Task5
WCN(WT3T5)
WCN(T3) WCN(RT3T5) WCN(T5)
WCN(WT5T6)
WCN(RT5T6)
Example
P1

N
T
 11

Number of nodes : 12
Graph of activities
Node type




a2


fork

branch

Normal, Branch, Conditional, Terminator
TB
a3
Or, And, Fork, Branch
T
C
 44

T
B
 33 branch


 22

Node behaviour

P2

a1

Number of CPU : 2
a7

Task Allocation

Task Scheduling
T
N88

Arc priorities

Freq.
& Voltage
//Node
Type:
0 NORMAL; 1 BRANCH ; 2 STOCHASTICa11 or
uint queue_consumer [..] [..] = {
a4
a5

a6

T
C

TC
 5
T
C

 66
 7

a8
a9


T
N
N
T
10
 10
 99
a12

#define
TASK_NUMBER 12
N
T11
11
uint node_type[TASK_NUMBER]
= {0,1,1,0,..},
{1,2,2,1,..};

#define N_CPU 2
//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH

{0,0,0,1,1,.},
a14
uint task_on_core[TASK_NUMBER] = {1,1,2,1};
uint node_behaviour[TASK_NUMBER] = {2,3,3,..};

{0,0,0,0,0,1,1..},
int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

{0,0,0,0,..}..};

Resources

Deadline

BB
B2
 3 3
N

1
C C
C4
Time
 7
 7
N8

NN
N
 1 1
0 0
 1
1
TT
 1 1
2 2
or

a13

and

TT

 12
12
a10

Queue ordering optimization
CPU2
T1

Wait!
C2
CPU1
T4
RUN!


T2

T3

…

…
T5

…
T6

…
Communication ordering affects system performances
Queue ordering optimization
CPU2
T1

Wait!
C2
CPU1
T4
RUN!


T2

T3

…

…
T5

…
T6

…
Communication ordering affects system performances
Synchronization among tasks
Proc. 1
T1

T2

C2

T4

T1

T3

T4

Proc. 2
T2

T4
suspended
T4 is
re-activated
T3

Non blocked semaphores
Logic Based Benders Decomposition
Memory constraints

Allocation
& Freq. Assign.:

INTEGER PROGRAMMING

Valid
allocation


Real Time
 constraint
Obj. Function:
Communication cost

& energy
consumption

No good: linear
constraint



CONSTRAINT PROGRAMMING
Objective Function: minimizing energy consumption during execution and
communication of tasks
Scheduling → CP



Scheduling:
Decomposes a problem into 2 sub-problems:

Allocation & Assignment of freq. settings → IP



Objective Function: minimizing energy consumption during frequency switching
The process continues until the master problem and sub-problem
converge providing the same value.
Methodology has been proven to converge to the optimal solution [J.N.Hooker
and G.Ottosson].
Application Development
Methodology
Simulator
CTG
Characterization Application
Profiles
Phase
Optimizer
Optimization
Phase
Application
Development
Support
Optimal SW
Application
Implementation
Platform
Execution
GSM Encoder
Task Graph:
 10 computational tasks;
 15 communication tasks.


Throughput required: 1 frame/10ms.
With 2 processors and 4 possible
freq.&voltage settings:
Without optimizations:
50.9μJ
With optimizations:
17.1 μJ
-66,4%
Energy Management
o
o
Basic techniques: Shutdown and DVFS
Advanced techniques: Feedback control
Energy Optimization in
MPSoCs

Two main steps:


Workload allocation to processing elements: task mapping and
scheduling
After workload allocation, resource of processing elements
should be adapted to the required performance to minimize
energy consumption
shut-down

Urbino, 19-10-2006
voltage scaling

79
Shut-Down

When the system is idle the processor can
be placed into a low-power state
reactivity
power level
-core clock gating (waked-up by timer interrupt)
-core power gating (waked-up by on-chip peripherals)
-chip power gating (waked-up by external, on board interrupts)
no need for context restore
need for context restore
Urbino, 19-10-2006
80
Frequency/Voltage Scaling

DFVS:




Adapting frequency/voltage to the workload
Frequency must be scaled with voltage to
keep circuit functionality
Dynamic power goes with the square of Vdd
and linearly with clock speed
Scaling V/F by a factor of s -> power scales
as s3
2
P  Ceff Vdd    f
Urbino, 19-10-2006
81
Power Manager
Implementation
Power management policy consists of algorithms that use input
information and parameter settings to generate commands to steer
mechanisms [Pouwelse03]

parameter settings
workload
commands
policy
operational
conditions

operating points
A dynamic power management system is a set of rules and procedures that
move the system from one operating point to another as events occur
[IBM/Montavista 02]
Urbino, 19-10-2006
82
Power Manager Components

Monitoring


Prediction



Averaging (e.g. EMA), filtering (e.g. LMS)
Per-task based (e.g. Vertigo), global utilization (e.g. Grunwald)
Control



Utilization, idle times, busy times
Shutdown, DVFS
Open-loop, closed loop (e.g. adaptive control)
Update Rule

Establish decision points
Urbino, 19-10-2006
83
Traditional Approach
TASK 1
TASK 2
idle time
monitor
update rule
Urbino, 19-10-2006
global utilization
monitor
idle time
predictor
workload
predictor
shutdown
controller
DVFS
controller
per-task utilization
monitor
84
Limitations


Assuming no specific info from applications are available,
traditional approaches are based on observation of
utilization history
Slow adaptation impact system reactivity



Specific techniques for tracking interactive task have been
proposed [Flautner2002]
For soft real-time (multimedia) application deadline may
be missed
Frequent voltage oscillations impact energy efficiency


Square
BAD relationship between power and voltage
GOOD
Cost of switching (power/time/functionality)
Urbino, 19-10-2006
85
Multimedia applications

Multimedia applications can be
represented as communicating objects

Ex.: Gstreamer multimedia framework
data
data


data

pads

17/07/2015
OSHMA Workshop –
Brasov ,ROMANIA
86
Streaming Applications
Multimedia streaming applications are going multicore

Objects are mapped to tasks that are distributed on the
cores to increase performance
Specific tasks can be executed by hardware accelerators
such as IPU, GPU units


data
data

P0


P1

CORE #0

17/07/2015
P2

CORE #1

data

P3

CORE #2

87
Overcoming Traditional
Approaches

Key observation
Multimedia applications are mapped into MPSoCs as communicating
tasks

P2

Software FM Radio

P0

P1

P3

EXT.
PERIPHERAL

P5

split

P4

join


A pipeline with single and parallel blocks (split-join) communicating with

Feedback path are also common
queues
17/07/2015
88
Frequency, Voltage Setting Middleware
•
Migration+dynamic f,Vdd setting critical for energy management
–
–
DVFS for multi-stage producer-consumer streaming exploits info on
occupancy of the synchronization queues
@equilibrium, average output rate should match input rate in each queue
 occupancy level monitored to adjust PE speed and Vdd
STAGE
N
USER
F
R
E
Q.
OS/MW
HW
[Carta TECS07]
17/07/2015
FREQ.
CONTR.
N
Q
U
E
U
E OCC.
STAGE
N+1
D
A
T
A
D
A
T
A
O
U
T
I
N
STAGE
N
FREQ.
CONTR.
N+1
COMM. AND SYNCHRONIZ. LAYER
OP. SYST. N
OP. SYST. N+1
OP. SYST. N+2
PROCESSOR N
PROCESSOR N+1
PROCESSOR N+2
89
Middleware Support
main() {
produce_data();
write_queue();
}
main() {
read_queue();
consume_data();
}
communication library
check queue level();
run_control_algorithm();
set_frequency_and_voltage();

Almost OS-independent


If interrupts are used an OS-specific ISR must be written
Easy integration into communication library support
for
Gstreamer, openmax
17/07/2015
90
Feedback Controlled DVS
Technique to perform run-time energy optimization of
pipelined computation in MPSoCs



Queue occupancy provide estimation of the level of
performance required by the preceding block
Unlike traditional approaches, the idea is to look at level
of occupancy of inter-processor queues to compute the
speed of processing elements
queue
queue

17/07/2015

speed control

queue

91
Thermal optimization


Energy/power optimization is NOT
thermal optimization!
Need for temperature awareness
OS-MPSoC Thermal Studies
Spunti di ricerca





Focus on embedded multimedia streaming and
interactive applications
Efficient automatic code parallelization for embedded
multiprocessors
Efficient communication and synchronization
infrastructure
Static + dynamic task allocation and for
performance/energy/thermal balancing
EU projects:

PREDATOR, REALITY

similar documents