Report

Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring 2014 1 Routing • Will assume 2D mesh in this talk • How flits are routed from source to destination can greatly impact network congestion • Two types of routing: • Oblivious routing: routing does not consider or depend on the current traffic condition • Adaptive routing: takes into consideration current traffic condition to determine the routing path (tries to get around congested areas) • Oblivious routing simpler (less expensive) to implement • This talk will review existing oblivious routing algorithms 2 Routing Algorithm Objectives • Maximize throughput – How much load the network can handle • Minimize latency – Minimize routing delay between source and destination 3 Dimension-Ordered Routing (DOR) (also called XY routing) D S either minimal XY or YX routing to the destination (here it uses XY route with probability 1.0) 4 DOR (XY) Routing with Uniform Traffic • For an N = K x K mesh, N/2 nodes are in the top half. • 1/2 of its traffic will cross the bisection. • Traffic crossing bisection uniformly distributed across K channels. • Therefore, maximum channel load for DOR with uniform traffic is: ϒ(DOR, uniform) = [ (N/2) * (1/2) ] / K = K/4 5 Problem with DOR (XY) Routing • Minimal hop count • But, in the worst-case, the links can get overly congested. e.g., transpose traffic pattern. (0, 0) -> (3, 3) (1, 0) -> (3, 2) (2, 0) -> (3, 1) ϒwc(DOR) = K – 1 >> ϒ(DOR, uniform) in the worst-case. 6 Valiant Load-Balancing (VAL) [1981] D S randomly chosen intermediate node minimal XY routing to any intermediate node, then minimal XY routing to destination node 7 Valiant Load-Balancing (VAL) • Works by turning any traffic pattern into 2 phases of uniform traffic patterns, even adversarial or worst-case traffic patterns. • In effect, it evenly load-balances the traffic. • Worst-case channel load ϒwc(VAL) = 2 * ϒ(DOR, uniform) = 2 * (K/4) = K/2 which is 1/2 network capacity relative to DOR and uniform traffic. 8 Valiant Load-Balancing (VAL) • Effective network capacity normalized throughput is 1/2 capacity. • However, average hop count is 2X DOR. • 1/2 capacity was thought to be the optimal worst-case throughput for any routing algorithm. 9 ROMM [1995] D S intermediate node randomly chosen only in the minimal direction to destination minimal XY routing to an intermediate node only in the minimal direction, then minimal XY routing to the final destination node 10 ROMM • Tries to load-balance traffic by randomly distributing traffic along all possible minimal paths. • Good that minimal number of hops is guaranteed. • But, turns out in the worst-case, ROMM performs about as bad as DOR. 11 O1TURN [2005] D S use both minimal XY and YX routing to the destination (0.5 XY + 0.5 YX) 12 O1TURN • Even though it only considers XY or YX path, not all possible paths in VAL or all possible minimal paths in ROMM, it is guaranteed to achieve 1/2 capacity for the even radix case, which has been shown to be optimal. • For the odd radix case, O1TURN is very near the optimal 1/2 capacity. • Unlike VAL, O1TURN only uses minimal routing paths, thus no penalty in hop count. 13 Comparison Even Radix : Opt * 1 Odd Radix : Opt * (1 - 1 / K2) 0.5 0.4 0.3 0.2 0.1 VAL DOR ROMM O1TURN [adapted from Seo et al., O1TURN talk, ISCA 2005] 14 Simulation Results Average Latency (cycle) • 4 x 4 2D MESH – Uniform Random Traffic Pattern 200 DOR ROMM 150 O1TURN DUATO 100 50 0 0 0.2 0.4 0.6 0.8 1 Throughput (flits / node / cycle) [adapted from Seo et al., O1TURN talk, ISCA 2005] 15 Simulation Results • 4 x 4 2D MESH – Matrix Transpose Traffic Pattern Average Latency (cycle) – One of the worst-case traffic pattern for DOR 200 DOR ROMM 150 O1TURN DUATO 100 50 0 0 0.2 0.4 0.6 0.8 1 Throughput (flits / node / cycle) [adapted from Seo et al., O1TURN talk, ISCA 2005] 16 Simulation Results • 4 x 4 2D MESH – Bit Complement Traffic Pattern Average Latency (cycle) – Already balanced traffic pattern 200 DOR ROMM 150 O1TURN DUATO 100 50 0 0 0.2 0.4 0.6 0.8 1 Throughput (flits / node / cycle) [adapted from Seo et al., O1TURN talk, ISCA 2005] 17 Simulation Results • 4 x 4 2D MESH – HOT SPOT Traffic Pattern – 2 nodes have 20% of traffic Average Latency (cycle) 200 DOR ROMM 150 O1TURN DUATO 100 50 0 0 0.2 0.4 0.6 0.8 1 Throughput (flits / node / cycle) [adapted from Seo et al., O1TURN talk, ISCA 2005] 18 Simulation Results • Delay penalty of adaptive routing – How the complexity of router implementation affects on latency – Hot Spot Traffic Pattern Average Latency (FO4) 2000 DOR ROMM 1500 O1TURN DUATO 1000 500 0 0 0.2 0.4 0.6 0.8 1 Throughput (flits / node / cycle) [adapted from Seo et al., O1TURN talk, ISCA 2005] 19 U2TURN [2012] • 1/2 capacity has been thought to be optimal worst-case throughput for both odd and even radices, and O1TURN is the state-of-the-art for achieving this for the even radix case. • But, turns out 1/2 capacity is not optimal for the odd radix case. 20 U2TURN • U2TURN considers all possible XYX and YXY 2-TURN paths, and selects these paths with equal probability. • XYX paths: randomly select a node on the same row and route to it, followed by minimal YX routing to final destination. • YXY paths: randomly select a node on the same column and route to it, followed by minimal XY routing to final destination. 21 Analytical Results • For the even radix case, worst-case capacity of U2TURN = 1/2, same as VAL and O1TURN, which is optimal. • But, for the odd radix case, worst-case capacity of U2TURN = (K+1)/(2K+1) > 1/2 which is better than any existing routing algorithm. 22 Worst-Case Throughput ROMM VAL DOR U2TURN O1TURN Optimal routing 23 Throughput Comparison for Odd Radix 3X3 mesh VAL DOR O1TURN U2TURN Worst-case 0.5 0.33 0.44 Average-case 0.5 Transpose 5X5 VAL DOR O1TURN U2TURN 0.57 0.5 0.3 0.48 0.55 0.405 0.477 0.604 0.5 0.44 0.53 0.632 0.5 0.33 0.67 0.8 0.5 0.3 0.6 0.75 Random 0.5 1 1 0.72 0.5 1 1 0.685 DOR-WC 0.5 0.33 0.67 0.8 0.5 0.3 0.6 0.75 Complement 0.5 0.67 0.67 0.57 0.5 0.6 0.6 0.55 Nearest-Neighbor 0.5 1.33 1.33 0.75 0.5 2.4 2.4 1.17 24 Throughput Comparison for Even Radix 4X4 mesh VAL DOR O1TURN U2TURN Worst-case 0.5 0.33 0.5 Average-case 0.5 0.48 Transpose 0.5 Random 6X6 VAL DOR O1TURN U2TURN 0.5 0.5 0.3 0.5 0.5 0.54 0.64 0.5 0.47 0.556 0.65 0.33 0.67 0.8 0.5 0.3 0.6 0.75 0.5 1 1 0.7 0.5 1 1 0.682 DOR-WC 0.5 0.33 0.67 0.8 0.5 0.3 0.6 0.75 Complement 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 Nearest-Neighbor 0.5 2 2 1.1 0.5 3 3 1.27 25 Latency • A potential concern about U2TURN is that it uses non-minimal routing paths. • However, U2TURN does a better job of loadbalancing traffic than O1TURN for difficult traffic patterns. • Hence, the queuing delay can be very high for O1TURN for difficult traffic patterns, hence longer latency despite fewer number of hops. • Surprisingly, latency better for both odd and even radix cases for difficult traffic patterns. 26 Latency for 7 x 7 Mesh 27 Latency for 8 x 8 Mesh 28 References • • • • Valiant [L.G.Valiant et. al, ACM 1981] ROMM [T.Nesson et. al, ACM 1995] O1TURN [D. Seo et. al, ISCA 2005] U2TURN [G. Sun et. Al, ICCD 2012] 29