### slides

```SCHEDULING STREAMING
COMPUTATIONS
Kunal Agrawal
THE STREAMING MODEL

Computation is represented by a
directed graph:
Nodes: Computation Modules.
 Edges: FIFO Channels between nodes.
 Infinite input stream.
 We only consider acyclic graphs (dags).


When modules fire, they consume
data from incoming channels and
produce data on outgoing channels.
2
CACHE-CONSCIOUS SCHEDULING
OF STREAMING APPLICATIONS
with Jeremy T. Fineman, Jordan
Krage, Charles E. Leiserson, and
Sivan Toledo
GOAL: Schedule the computation to
minimize the number of cache misses
on a sequential machine.
DISK ACCESS MODEL
cache
M/B
CPU
block
Slow
Memory
B
The cache has M/B blocks each of size B.
 Cost = Number of cache misses.
 If CPU accesses data in cache, the cost is 0.
 If CPU accesses data not in cache, then there is a
cache miss of cost 1. The block containing the
requested data is read into cache.
 If the cache is full, some block is evicted from
cache to make room for new blocks.

CONTRIBUTIONS



The problem of minimizing cache misses
is reduced to a problem of graph
partitioning.
THEOREM: If the optimal algorithm has X
cache misses given a cache of size M,
there exists a partitioned schedule that
incurs O(X) cache misses given a cache
of size O(M).
In other words, some partitioned
schedule is O(1) competitive given O(1)
memory augmentation.
OUTLINE

Cache Conscious Scheduling
Streaming Application Model
 The Sources of Cache Misses and Intuition Behind
Partitioning
 Proof Intuition
 Thoughts


Model and Source of Deadlocks
 Deadlock Avoidance Using Dummy Items.
 Thoughts

STREAMING APPLICATIONS
i:1
a
s:60
i:4
o:2
b
s:20
o:2
i:1
o:4
i:4 s:40
When a module v fires, it
c
i:1
d
s:35
o:1
o:1
must load s(v) state,
 consumes i(u,v) items from incoming edge(s) (u,v), and
 produces o(v,w) items on outgoing edge(s) (v,w).

ASSUMPTIONS:
All items are unit sized.
 The source consumes 1 item each time it fires.
 Input/output rates and state sizes are known.
 The state size of modules is at most M.

DEFINITION: GAIN
i:1
a
s:60
i:4
o:2
b
gain: 1/2
s:20 o:2
i:1
o:4
i:4 s:40
c
i:1
o:1
gain: 1
d
s:35
o:1
gain: 1
VERTEX GAIN: Number of vertex u firings per source firing.
gain(u)  
o(x, y) /i(x, y) , where p is a path from s to u.
(x,y )p
EDGE GAIN: The number of items produced along the edge
(u,v) per source firing.
gain(u,v)  gain(u).o(u,v)
A graph is well-formed iff all gains are well-defined.
OUTLINE

Cache Conscious Scheduling
Streaming Application Model
 The Sources of Cache Misses and Intuition
Behind Partitioning
 Proof Intuition
 Thoughts


Model and Source of Deadlocks
 Deadlock Avoidance Using Dummy Items.
 Thoughts

CACHE MISSES DUE TO STATE LOAD
1
s:60
2
1
s:20
4
8
s:40
1
1
s:35
1
STRATEGY: Push items through.
COST PER INPUT ITEM: The sum of the
state sizes   u s(u) .


IDEA: Reuse the state once loaded.

B:1, M:100
Cache
Slow
Memory
CACHE MISSES DUE TO DATA ITEMS
1
s1:60
2
1
s2:20
4
8
s3:40
1
1
s4:35
1
STRATEGY: Once loaded, execute module
many times by adding large buffers
between modules.
COST PER INPUT ITEM: Total number of
items produced on all channels per input
item    gain(u,v) .
(u,v)


B:1, M:100
Cache
Slow
Memory
PARTITIONING: REDUCE CACHE MISSES
1
s1:60
2
1
s2:20
4
8
s3:40
1
1
s4:35
1
STRATEGY: Partition into segments that fit
in cache and only add buffers on cross edges
C --- edges that go between partitions.

COST PER INPUT ITEM:  
(u,v)C

gain(u,v) .
B:1, M:100
Cache
Slow
Memory
WHICH PARTITION?
1
s1:60
2
1
s2:20
4
8
s3:40
1
1
s4:35
1
STRATEGY: Partition into segments that fit
in cache and only add buffers on cross edges
C --- edges that go between partitions.

COST PER INPUT ITEM:  
(u,v)C
LESSON: Cut small gain edges.

gain(u,v) .
B:1, M:100
Cache
Slow
Memory
OUTLINE

Cache Conscious Scheduling
Streaming Application Model
 The Sources of Cache Misses and Intuition Behind
Partitioning
 Proof Intuition
 Thoughts


Model and Source of Deadlocks
 Deadlock Avoidance Using Dummy Items.
 Thoughts

IS PARTITIONING GOOD?




Show that the optimal scheduler can not do much better
than the best partitioned scheduler.
THEOREM: On processing T items, if the optimal
algorithm given M-sized cache has X cache misses, then
some partitioning algorithm given O(M) cache has at
most O(X) cache misses.
The number of cache misses due to a partitioned
scheduler is  (u,v)C gain(u,v) . The best partitioned
scheduler should minimize (u,v )C gain(u,v).


We must prove the matching lower bound on the

optimal scheduler’s
cache misses.
OPTIMAL SCHEDULER WITH CACHE M
S: segment with state size at least 2M.
e = gm(S): the edge with the minimum gain within S.
u
e
v
S
u fires X times.
CASE 1: At least 1 item produced by u is processed by v.
 Cost  (M).
CASE 2: All items are buffered within S.
 The cheapest place to buffer is at e.
  Cost (X.gain(e) /gain(u)).
 If X  2M.gain(u) /gain(e), Cost  (M).
In both cases, Cost/firing of u  (gain(e) /gain(u)).
LOWER BOUND
Divide the pipeline into segments of size between 2M and 3M.
e1
ui
vi
ei
ek
Si
Source node fires T times.
 Consider the optimal scheduler with M cache.
 Number of firings of ui  T.gain(ui ).
 Cost due to Si per firing of ui (gain(e) /gain(u)).
 Total cost due to Si (T.gain(e)).
 Total Cost over
 all segments   T  gain(ei ) .






MATCHING UPPER BOUND
Divide the pipeline into segments of size between 2M and 3M.
e1
ui
ei
vi
ek
Si
Source node fires T times.
 Cost of optimal scheduler with M cache   T


  gain(e ).
Consider the partitioned schedule that cuts all ei.
Each segment has size at most 6M.
is  O T
 The total cost of that schedule


i
  gain(e ).
i
Therefore, if this partitioned schedule has constant
factor memory augmentation,
it provides constant
competitiveness in the number of cache misses.
GENERALIZATION TO DAG

Say we partition a DAG such that


Each component has size at most O(M).
When contracted, the components form a
dag.
GENERALIZATION TO DAG

Say we partition a DAG such that





Each component has size at most O(M).
When contracted, the components form a
dag.
If C is the set of cross edges, (u,v )C gain(u,v)
is minimized over all such partitions.

The optimal schedule
 has cost/item

gain(u,v) .


(u,v)C
Given constant factor memory
augmentation, a partitioned schedule
has cost/item O 
gain(u,v) .

(u,v)C

WHEN B ≠ 1

LOWER BOUND: The optimal algorithm has cost

 1/B

(u,v )C

gain(u,v)
UPPER BOUND: With constant factor memory
augmentation:
 Pipelines: Upper bound matches the lower
bound.
 DAGs: Upper bound matches the lower bound
as long as each component of the partition has
O(M/B) incident cross edges.
FINDING A GOOD PARTITION



For pipelines, we can find a good-enough
partition greedily and the best partition using
dynamic programming.
For general DAGs, finding the best partition is
NP-complete.
Our proof is approximation-preserving. An
approximation algorithm for the partitioning
problem, will work for our problem.
CONCLUSIONS AND FUTURE WORK


We can reduce the problem of minimizing cache
misses to the problem of calculating the best
partition.
Solving the partitioning problem:
Approximation algorithms.
 Exact solution for special cases such as SP-DAGs.



Space bounds: Bound the buffer sizes on cross
edges.
Cache-conscious scheduling for multicores.
STREAMING COMPUTATIONS
WITH FILTERING
with Peng Li, Jeremy Buhler, and
Roger D. Chamberlain
GOAL: Devise mechanisms to avoid
deadlocks on applications with
filtering and finite buffers.
OUTLINE

Cache Conscious Scheduling
Streaming Application Model
 The Sources of Cache Misses and Intuition Behind
Partitioning
 Proof Intuition
 Thoughts


Model and Source of Deadlocks
 Deadlock Avoidance Using Dummy Items.
 Thoughts

FILTERING APPLICATIONS MODEL
Data dependent filtering: The number of items
produced depends on the data.
 When a node fires, it




has a compute index (CI), which monotonically increases,
consumes/produces 0 or 1 items from input/output channels,
input/output items must have index = CI.
A node can not proceed until it is sure that it has
received all items of its current CI.
 Channels can have unbounded delays.

3 1
3 2 1
2 1
A
B
C
U
3
2
1
X
Y
2
1
2
1 A item with index 1
Compute index
Filtering can cause deadlocks due to finite buffers.
3
2
4
5
v
3
2
6
u
1
1
x
4
3
w
A deadlock example (channel buffer size is 3).
CONTRIBUTIONS

Deadlock avoidance mechanism using dummy or
heartbeat messages sent at regular intervals
Provably correct --- guarantees deadlock freedom.
 No global synchronization.
 No dynamic buffer resizing.


Efficient algorithms to compute dummy intervals
for structured DAGs such as series parallel DAGs
and CS4 DAGs
OUTLINE

Cache Conscious Scheduling
Streaming Application Model
 The Sources of Cache Misses and Intuition Behind
Partitioning
 Proof Intuition
 Thoughts


Model and Source of Deadlocks
 Deadlock Avoidance Using Dummy Items.
 Thoughts

THE NAÏVE ALGORITHM

Filtering Theorem


If no node ever filters any token, then the system cannot deadlock
The Naïve Algorithm
Sends a dummy on every filtered item.
 Changes a filtering system to a non-filtering system.

2 1
1 A token with index 1
A
u
X
2 1
1 A dummy with index 1
COMMENTS ON THE NAÏVE ALGORITHM

Pros


Easy to schedule dummy items
Cons
Doesn’t utilize channel buffer sizes.
 Sends many unnecessary dummy items, wasting both
computation and bandwidth.


Next step, reduce the number of dummy items.
THE PROPAGATION ALGORITHM
Computes a static dummy schedule.
 Sends dummies periodically based on dummy intervals.
 Dummy items must be propagated to all downstream
nodes.

4
3
2
5
u
1
v
3
6
2
6
Dummy
interval
Channel buffer size
4
36
Comp. Index: 6
Index of last dummy: 0
6 – 0 >= 6, send a dummy
w
5
4
1
x
6
COMMENTS ON THE PROPAGATION
ALGORITHM

Pros
Takes advantage of channel buffer sizes.
 Greatly reduces the number of dummy items
compared to the Naïve Algorithm.


Cons
Does not utilize filtering history.
 Dummy items must be propagated.


Next step, eliminate propagation
Use shorter dummy intervals.
 Use filtering history for dummy scheduling.

THE NON-PROPAGATION ALGORITHM
Send dummy items based on filtering history
 Dummy items do not propagate.
 If (index of filtered item – index of previous
token/dummy) >= dummy interval, send a dummy

4
3
2
5
u
1
v
3
6
Dummy
interval
Channel buffer size
3
4
w
2
5
4
1
x
Data filtered
3 Current Index: 3
Index of last
token/dummy: 0
3 – 0 >= 3, send a dummy
COMPARISON OF THE ALGORITHMS
Performance measurement
# of dummies sent
 Fewer dummies are better


Non-Propagation Algorithm
is expected to be the best in
most cases
Dummies (M)

500000
50000
5000
Naïve
Prop.

Experimental data


Mercury BLASTN (biological
app.)
787 billion input elements
NonProp.
500
50
5
0.5
0.05
32
256
2048
HOW DO WE COMPUTE THESE INTERVALS


Exponential time algorithms for general DAGs,
since we have to enumerate cycles.
Can we do better for structured DAGs?
Yes.
 Polynomial time algorithms for SP DAGs
 Polynomial time algorithms for CS4 DAGs --- a class
of DAGs where every undirected cycle has a single
source and a single sink.

CONCLUSIONS AND FUTURE WORK




Designed efficient deadlock-avoidance algorithms
using dummy messages.
Find polynomial algorithms to compute dummy
interval for general DAGs.
Consider general models: allowing multiple
outputs from one input and feedback loops.
The reverse problem: computing efficient buffer
sizes from dummy intervals.
```