Report

Lazy Man’s Logic Synthesis Wenlong Yang Lingli Wang State Key Lab of ASIC and System Fudan University, Shanghai, China Alan Mishchenko Department of EECS University of California, Berkeley 1 Introduction Previous Work Lazy Man’s Logic Synthesis(LMS) Experimental Results Conclusion & Future Work 2 Goal of logic synthesis: Deriving a circuit or improving an available circuit We proposed a “Lazy” approach to reuse optimal structures derived by other synthesis tools based on a pre-computed library Other tools A Function with N variables AIG LMS precomputed library 3 Introduction Previous Work Lazy Man’s Logic Synthesis(LMS) Experimental Results Conclusion 4 • • • • Logic synthesis based on precomputed library have been proposed in several papers, but they are all different from LMS: Previous work Precompute structures in terms of LUTs [Kennings, IWLS, 2010 ] Didn't use preexisting benchmarks or tools [Bjesse, ICCAD , 2004] Look at only 4-5 input functions [Li, IWLS, 2011] Only compute multiple structure choices [Chatterjee, TCAD, 2006] LMS • Precompute structures in terms of AIGs • Use public benchmarks and existing tools • Look at 6-16 input functions • Store many equivalent structures 5 • • • • For each node Compute several k-input cuts Perform delay-optimal tree balancing of the SOP Select the best one to replace the current structure. F = !c*!b + !c*a An AIG subgraph found in benchmark s27.blif where SOP balancing loses to the proposed approach F’ = !c*!(b*!a) 6 Introduction Previous Work Lazy Man’s Logic Synthesis(LMS) Equivalence Classes Library Representation/Construction Implementation Experimental Results Conclusion 7 LMS is based on collecting, storing, and re-using circuit structures of Boolean functions with 6-16 input variables. The total number of completely-specified Boolean functions of N variables is 2^(2^N). Experiments shows that even for the practical functions, this number can be very large. To reduce the number and memory need to store functions in a library, a canonical form is used to break them into Equivalence Classes. 8 Two functions are NPN-equivalent if one of them can be obtained from the other by negation and/or permutation of the inputs and outputs. Drawbacks of NPN computation: • Time-consuming • Complicated Complete NPN canonical form is not affordable to LMS 9 The idea is to order the input variables and the polarities of inputs/outputs using the number of positive minterms and cofactors w.r.t. each variable. Input: TruthTable F 1. Determine the polarity of F by the number of 1’s in TruthTable 2. Determine the polarity of each variable by the number of 1s in the negative cofactor w.r.t. each variable 3. Sort input variables by the number of 1s in their negative cofactors and permute inputs accordingly Output: canonicized TruthTable F A reasonable trade-off between accuracy and speed 10 An N-input library contains functions up to N variables. Structures of all functions are represented as a shared AIG Each output of the AIG is the root node of one logic structure. When a library is loaded, the following actions are performed: A hash table is created to hash the outputs by its semi-canonical form. For each structure, the area and pin-to-output delays are computed and stored. 11 Suppose arrival time:{3, g:1 + Pin-to-output delay:{3, c:3 3, 3, 5, 5, 4, 1} = a:3 b:3 2, 4, 5, 2, 3, 1} f:4 d:5 e:5 {6, 5, 7, 10, 7, 7, 2} Example of using pin-to-output delays to compute structure delay If one structure’s pin-to-output delay is worse than another with respect to every input, the structure is dominated. 12 LUT mapper if in ABC is used as a structural cut browser to generate Kinput cuts whose logic structures are added to the library. Input: Cut C 1. If cut C does not meet the requirements return 2. Compute Boolean function F of cut C as a truthtable 3. Compute the semi-canonical form of F 4. 5. 6. Rebuild the structure of the cut in the library If ( the structure already exists or is dominated ) return Add a new primary output to store the structure in the hash table 13 Input: And-Inverter Graph For each node, in a topological order Compute several K-input cuts For each cut ▪ Compute truth table ▪ Look up in the library ▪ If there is no structure for this function Mark the cut to ensure it is not selected as best cut ▪ Else if the best structure found leads to smaller AIG level Save the cut as the best cut If there is an improvement in level, update AIG 14 The LMS algorithm is implemented in ABC. The LUT mapper if in ABC is used as: (a) A cut browser for computing the libraries (b) A mapper in the case study on AIG level minimization Commands related to library construction: rec_start: Starts the LMS recorder. rec_add: Add structures from benchmarks rec_filter: Removes the structures with less frequency rec_merge: Merges two previously computed libraries rec_ps: Prints statistics for the currently loaded library rec_use: Transforms the internal library to the current network in ABC rec_stop: Deletes the current library. Commands used to perform LMS mapping: if –y –K <num> -C<num> • • • -y enables level optimization by LMS -K <num> is the cut size -C <num> is the number of cuts used at each node 15 Introduction Previous Work Lazy Man’s Logic Synthesis(LMS) Experimental Results Library Coverage 6-input Library Optimize Delay After LUT Mapping Conclusion 16 This experiment was performed to show that LMS has practical memory requirements for functions up to 12 inputs. Semi-canonical classes of all functions appearing in the cuts of the benchmark circuits without synthesis, were collected and the frequency of their appearance was recorded. 1,500,000 1,000,000 500,000 Function # 2,000,000 • ~2 M classes in total • ~740 K classes for 90% functions • ~400MB for truth tables 0 25% 50% 75% 90% 95% 100% occurrence frequency 17 The goal of this experiment is to derive a 6-input library used in the following case study of AIG level minimization. The following ABC scripts are used to collect structures: • read file; st; rec_add; • dc2; rec_add; • if -K 8; bidec; st; rec_add; • if -K 8; mfs; st; rec_add; • if -K 8; bidec; st; rec_add; • if -g -K 6; st; rec_add; • if -g -K 6; st; rec_add; Statistics of the precomputed 6-input library Inputs 2 3 4 5 6 Total Classes # Structures # 3 3 32 88 2,430 12,673 98,208 471,973 1,148,556 5,202,924 1,249,229 5,687,661 Ratio 1.00 2.75 5.22 4.81 4.53 4.55 • ~77MB AIGER file 18 Two sets of benchmarks are used in this paper: 20 MCNC benchmarks and 10 large Altera benchmarks. LUT mapping was performed by the following scripts: Map: st; resyn2; if -K 4 or 6 MapC: st; resyn2; dch -f; if -K 4 or 6 SOPBC: st; if -gm -K 6; st; resyn2; dch -f; if -K 4 or 6 LMSC: st; if -ym -K 6; st; resyn2; dch -f; if -K 4 or 6 Benchmarks were run on a workstation with a Intel Xeon Quad Core CPU and 256 GBytes RAM (~4GB used for the experiment) The resulting networks were verified by command cec in ABC. 19 4-LUT count 4-LUT levels Design Map MapC SOPBC LMSC 40 38856 39842 42092 42371 88 76 17902 17401 18538 18800 19 19 14 8995 9114 12221 11158 20 19 17 13 10967 10940 14590 14321 radar20_opt.blif 39 38 23 16 16834 17216 17717 20663 screen_saver_cyclone.blif 18 18 16 17 35627 35183 35614 35900 sudoku_check.blif 11 11 10 10 20998 20774 21094 21286 top_rs_decode.blif 43 43 31 24 31381 30729 30798 30926 umass_weather.blif 38 38 25 17 15821 15734 18250 18292 uoft_raytracer.blif 70 69 58 30 33294 33852 37118 40147 1.00 0.99 0.80 0.63 1.00 1.00 1.11 1.13 Map MapC SOPBC LMSC 68 68 53 119 116 oc_video_compression_systems_dct_opt.blif 19 oc_video_compression_systems_jpeg_opt.blif carpat.blif fp_operators.blif Ratio LMSC reduced delay by 37% with an area increase of 13% 20 Design 6-LUT levels 6-LUT count Map MapC SOPBC LMSC Map MapC SOPBC LMSC carpat.blif 35 35 35 27 29826 31098 32243 33321 fp_operators.blif 67 66 57 50 10541 11118 12005 11982 oc_video_compression_systems_dct_opt.blif 10 10 12 9 7349 7566 8816 8606 oc_video_compression_systems_jpeg_opt.blif 10 10 12 9 7796 7822 8365 9537 radar20_opt.blif 20 20 13 10 12351 12705 12871 14964 screen_saver_cyclone.blif 13 12 12 12 27129 27113 27503 27373 sudoku_check.blif 7 7 7 7 14542 14355 14707 15501 top_rs_decode.blif 24 24 20 16 21271 21324 21668 21615 umass_weather.blif 24 24 16 10 12196 11990 13287 14123 uoft_raytracer.blif 36 35 31 19 26128 26666 29802 31356 1.00 0.99 0.92 0.74 1.00 1.02 1.08 1.13 Ratio LMSC reduced delay by 26% with an area increase of 13% 21 Design alu4 apex2 b14 b15 b17 b20 b21 b22 clma des elliptic ex5p frisc i10 pdc s38584 s5378 seq spla tseng Raito Map 7 8 21 22 31 23 23 23 13 6 8 6 20 14 9 9 6 6 9 13 1.00 4-LUT level MapC SOPBC 7 7 8 8 20 17 22 21 31 27 22 19 22 20 23 19 13 12 6 6 8 8 6 6 20 19 14 13 8 8 9 8 6 5 6 6 9 9 13 12 0.99 0.92 LMSC 7 8 17 21 26 19 19 19 12 6 8 6 16 12 8 8 5 6 8 10 0.90 Map 694 871 1761 3147 9676 3692 3768 5423 4016 1228 431 471 2279 746 1926 4021 459 946 1899 756 1.00 4-LUT count MapC SOPBC 701 702 867 874 1771 1913 3103 3186 9507 9527 3587 3886 3612 3847 5280 5693 4008 4189 1257 1249 432 442 462 472 2261 2332 741 743 2047 1925 3978 3985 451 470 935 948 1803 1860 800 743 1.00 1.02 LMSC 714 890 1849 3233 9570 3829 3908 5729 4150 1273 443 481 2279 741 2075 3980 468 941 1928 809 1.03 Map 5 6 13 15 21 15 15 15 9 5 6 5 13 9 7 6 4 5 7 8 1.00 6-LUT level MapC SOPBC 5 5 6 6 13 10 15 14 21 16 15 12 15 11 15 12 9 8 5 5 6 6 4 5 12 11 9 9 7 6 6 6 4 4 5 5 7 6 8 6 0.99 0.90 LMSC 5 6 11 13 16 12 12 11 8 4 6 4 9 9 7 6 4 5 6 6 0.88 Map 503 691 1275 2119 6510 2679 2701 3985 2975 824 317 351 1807 598 1428 2720 356 685 1414 648 1.00 6-LUT count MapC SOPBC 525 520 683 728 1263 1517 2211 2255 6356 6667 2619 3070 2577 3114 3847 4638 2894 3145 862 866 317 327 382 378 1811 1883 608 575 1350 1619 2802 2816 355 369 668 707 1361 1445 694 689 1.00 1.07 LMSC 532 711 1442 2419 6670 3044 3115 4677 3246 953 333 408 1948 583 1416 2831 358 696 1455 731 1.08 4-LUTs: LMSC reduced delay by 10% with an area increase of 3% 6-LUTs: LMSC reduced delay by 12% with an area increase of 8% 22 A new method to harvest and re-use circuit structures produced by different tools on benchmark circuits The “lazy” approach is made practical by A semi-canonical form to reduce the number of equivalence classes Using AIGs to store precomputed libraries in memory and on disk Using truth tables to manipulate Boolean functions As the case-study, the proposed approach was applied to improve delay after FPGA mapping For industrial benchmarks, compared to SOP balancing, the delay was reduced by 17% (18%) for LUT4 (LUT6) the area penalty was 2% (5%) 23 Improving implementation Reducing memory by using a low-memory AIG Building libraries in terms of multi-input gates Filtering libraries based on their performance Giving the user control over the area increase Continuing experiments Performing case studies with larger functions Evaluating delay improvements after P&R 24 Authors' E-mail: WenlongYang [email protected] Lingli Wang [email protected] Alan Mishchenko [email protected] 25 Deriving a circuit for a Boolean function or improving an available circuit are typical tasks solved by logic synthesis. Numerous algorithms in this area have been proposed and implemented over the last 50 years. This paper presents a "lazy” approach to logic synthesis based on the following observations: (a) optimal or near-optimal circuits for many practical functions are already derived by the tools, making it unnecessary to implement new algorithms or even run the old ones repeatedly; (b) larger circuits are composed of smaller ones, which are often isomorphic up to a permutation/negation of inputs/outputs. Experiments confirm these observations. Moreover, a case-study shows that logic level minimization using lazy man’s synthesis improves delay after LUT mapping into 4- and 6-input LUTs, compared to earlier work on high-effort delay optimization.