### An SMT Method for Optimizing Arithmetic Computations, March 14

```An SMT Based Method for
Optimizing Arithmetic Computations
in Embedded Software Code
Presented by: Kuldeep S. Meel
Adapted from slides by Hassan Eldib and Chao Wang
(Virginia Tech)
A Robotic Dream
• Having a tool that automatically synthesizes
the optimum version of a software program.
2
Embedded Software
3
Some bugs matter more than others
4
Some even more ……
Cost: US \$350 million
5
The Age of Sci-Fi CS
• Verification
– Verify if there is overflow/underflow bug in the
code
• Synthesis
– Given a reference implementation, synthesize an
implementation that does not have any of these
bugs
6
Objective
• Synthesizing an optimal version of the C code
with fixed-point linear arithmetic computation
for embedded devices.
– Minimizing the bit-width.
– Maximizing the dynamic
range.
7
Motivating Example
• Compute average of A and B on a microcontroller
with signed 8-bit fixed-point
• Given: A, B ∈ [-20, 80].
• (A +B)/2
may have overflow errors
• A/2 + B/2 may have truncation errors
• A + (B-A)/2 has neither overflow nor truncation errors
.
8
Bit-width versus Range
• Larger range requires a larger bit-width.
• Decreasing the bit-width, will reduce the range.
9
Fixed-point Representation
Representations for 8-bit fixed-point numbers
• Range: -128 ↔ 127
• Resolution = 1
• Range : -16 ↔ 15.875
• Resolution = 1/8
Range ∝ Bit-width
Resolution ∝ Bit-width
10
Problem Statement
Program:
Optimized program:
Range & resolution of the input variables:
A -1000 3000
res. 1/4
B -1000 3000
res. 1/4
…
11
Problem Statement
• Given
– The C code with fixed-point linear arithmetic computation
• Fixed-point type: (s,N,m)
– The range and resolution of all input variables
• Synthesize the optimized C code with
– Reduced bit-width with same input range, or
– Larger input range with the same bit-width
12
Restrictions
• Bounded loops (via unrolling)
• Same bit-width for all the variables and
constants
– Can have different bit representations
13
SMT-based Inductive Program Synthesis
14
SMT-based Inductive Program Synthesis
15
Step 1: Finding a Candidate Program
• Create the most general AST that can represent any
arithmetic equation, with reduced bit-width.
• Use SMT solver to find a solution such that
– For some test inputs (samples),
– output of the AST is the same as the desired computation
16
SMT-based Solution
Fig. General Equation AST.
• SMT encoding for the general equation AST structure
– Each Op node can any operation from *, +, -, >> or <<.
– Each L node can be an input variable or a constant value.
• SMT Solver finds a solution by equating the AST output to that
of the desired program
17
SMT Encoding
• Ψ = Φ ⋀ Φ ⋀ Φ ⋀ Φ ⋀Φ ⋀ Φ
– Φ : Desired input program to be optimized.
–
–
–
–
–
Φ : General AST with reduced bit-width.
Φ : Same input values.
Φ Same output value.
Φ : Test cases (inputs).
Φ : Blocked solutions.
18
SMT-based Solution (an example)

2
+

2
≡
19
SMT-based Inductive Program Synthesis
20
Step 2: Verifying the Solution
• Is the program good for all possible inputs?
– Yes, we found an optimized program
– No, block this (bad) solution, and try again
21
SMT Encoding
• Φ = Φ ⋀ Φ ⋀ Φ ⋀ Φ ⋀Φ ⋀ Φ
– Φ : Desired input program to be optimized.
–  : Found candidate solution.
– Φ : Same input values.
–  : Different output value.
– Φ : Ranges of the input variables.
– Φ : Resolution of the input variables.
22
SMT-based Inductive Program Synthesis
23
The Next Solution
B
+
−
2
≡
24
SMT-based Inductive Program Synthesis
25
Scalability Problem
• Advantage of the SMT-based approach
– Find optimal solution within an AST depth bound
– Cannot scale up to larger programs
• Sketch tool by Solar-Lezama & Bodik (5 nodes)
• Our own tool based on YICES (9 nodes)
26
Our Incremental Approach
27
Incremental Optimization
• Combine static analysis and SMT-based
inductive synthesis.
• Apply SMT solver only to small code regions
–
–
–
–
–
Identify an instruction that causes overflow/underflow.
Extract a small code region for optimization.
Compute redundant LSBs (allowable truncation error).
Optimize the code region.
Iterate until no more further optimization is possible.
28
Region for Optimization
• BFS order (Tree has return as root)
– Parent node doesn’t have overflow or underflow
– Parent node restores the value created in the
overflowing node back to the normal range in orig.
– Larger extracted region allows for more
opportunities
– AST with at most 5 levels, 63 AST nodes
29
Region for optimization
Detecting Overflow Errors
The parent nodes
Some sibling nodes
Some child nodes
• The addition of a and b may overflow
30
Truncation Error margins
Computing Redundant LSBs
• The redundant LSBs of a are computed as 4 bits
• The redundant LSBs of b are computed as 3 bits.
31
Truncation Error margins
• step(x*y) = step(x) + step(y)
• step(x+y) = min(step(x),step(y))
• step(x-y) = min(step(x),step(y))
• step(x << c) = step(x) + c
• step(x >> c) = max(step(x)-c,0)
32
Extracting Code Region
• Extract the code surrounding the overflow operation.
• The new code requires a smaller bit-width.
33
Inductive Generation of New Regions
–
–
–
–
–
–
: Extracted region of bit width (bw1).
: Desired solution with bit wi.
: Same input values.
: Same output value.
: random tests.
: Blocked programs
34
Verification
• Φ = Φ ⋀ Φ ⋀ Φ ⋀ Φ ⋀Φ ⋀ Φ
– Φ : Desired input program to be optimized.
–  : Found candidate solution.
– Φ : Same input values.
–  : Different output value.
– Φ : Ranges of the input variables.
– Φ : Resolution of the input variables.
35
The Incremental Approach
36
Implementation
• Clang/LLVM + Yices SMT solver
• Bit-vector arithmetic theory
• Evaluated on a set of public benchmarks for
embedded control and DSP applications
37
Benchmarks (embedded control software)
Benchmark
Bits
LoC
Arithmetic
Operations
Sobel Image filter
Bicycle controller
32
32
42
37
28
27
Locomotive controller
IDCT (N=8)
64
32
42
131
38
114
Controller impl.
Differ. image filter
FFT (N=8)
IFFT (N=8)
32
32
32
32
21
131
112
112
8
77
82
90
Citation
Qureshi, 2005
Rupak, Saha & Zamani, 2012
Martinez, Majumdar, Saha &
Kim, Kum, & Sung, 1998
Martinez, Majumdar, Saha
Burger, & Burge, 2008
All benchmark examples are public-domain examples
38
Experiment (increase in range)
Input/output range increase
10000
1000
100
Range increase
10
1
Sobel Image
Bicycle
Locomotive
IDCT
Controller
Diff. Image
FFT
IFFT
• Average increase in range is 307%
(602%, 194%, 5%,
40%, 32%, 1515%, 0% , 103%)
39
Experiment (decrease in bit-width)
• Required bit-width:
32-bit  16-bit
64-bit  32-bit
40
Experiment (scaling error)
Original program New program
If we reduce microcontroller’s bit-width, how much error will be introduced?
41
Experiment (runtime statistics)
Benchmark
Sobel image filter
Bicycle controller
Locomotive controller
IDCT (N=8)
Controller impl.
Differ. image filter
FFT (N=8)
IFFT (N=8)
Optimized
Code Regions
Time
22
2
1
3
1
23
14
1
2s
5s
5m 41s
2.7s
46s
10s
1m 9s
4s
64 bit
42
Related Work: Sketch
• CEGIS based
• Synthesis is over the whole program
• Solver not capable of efficiently handling
linear fixed-point arithmetic computations (?)
43
Related Work: Gulwani et al.
(PLDI 2011)
• User specified logical specification (e.g.,
unoptimized program)
• Uses CEGIS to find optimized version
• Slow (This technique performs worse than
enumeration/stochastic techniques)
44
Related Work: Rupak et al.
(ICES 12)
• Optimization of bit-vector representations
• Does not change structure of the program
• Uses Mixed Integer Linear Programming
(MILP) Solver
45
Conclusions
• We presented a new SMT-based method for optimizing
fixed-point linear arithmetic computations in
embedded software code
– Effective in reducing the required bit-width
– Scalable for practice use
• Future work
– Other aspects of the performance optimization, such as
execution time, power consumption, etc.
46
More on Related Work
• Solar-Lezama et al. Programming by sketching for bit-streaming
programs, ACM SIGPLAN’05.
– General program synthesis. Does not scale beyond 3-4 LoC for our application.
• Gulwani et al. Synthesis of loop-free programs, ACM SIGPLAN’11.
– Synthesizing bit-vector programs. Largest synthesized program has 16 LoC,
taking >45mins. Do not have incremental optimization.
• Jha. Towards automated system synthesis using sciduction, Ph.D.
dissertation, UC Berkeley, 2011.
– Computing the minimal required bit-width for fixed-point representation. Do
not change the code structure.
• Rupak et al. Synthesis of minimal-error control software, EMSOFT’12.
– Synthesizing fixed-point computation from floating-point computation. Again,
only compute minimal required bit-widths, without changing code structure.
48
```