Lecture 5

Report
Lecture 7
Advanced Topics in Testing
Mutation Testing
• Mutation testing concerns evaluating test suites
for their inherent quality, i.e. ability to reveal
errors.
• Need an objective method to determine quality
• Differs from structural coverage since it tries to
define “what is an error”
• Basic idea is to inject defined errors into the SUT
and evaluate whether a given test suite finds
them.
• “Killing a mutation”
Basic Idea
• We can statistically estimate (say) fish in a lake
by releasing a number of marked fish, and
then counting the number of marked fish in a
subsequent small catch.
• Example: release 20 fish
• Catch 40 fish, 4 marked.
• Then 1 in 10 is marked, so estimate 200 fish
• Pursue the same idea with marked SW bugs?
Mutations and Mutants
• The “marked fish” are injected errors, termed
mutations
• The mutated code is termed a mutant
• Example: replace < by > in a one Boolean
expression
• if ( x < 0 ) then … becomes if ( x > 0 ) then …
• If test suite finds mutation we say this particular
mutant is killed
• Make large set of mutants – typically using a
checklist of known mutations - mutation tool
Mutation score
• Idea is that if we can kill a mutant we can identify
a real bug too
• Mutants which are semantically equivalent to the
original code are called equivalents
• Write Q  P if Q and P are equivalents
• Clearly cannot kill equivalents
• Mutation score % =
Number of killed mutants
total number of non-equivalent mutants *100
Why should it work?
• Two assumptions are used in this field
Competent programmer hypothesis
i.e. “The program is mostly correct”
and the
Coupling Effect
Semantic Neighbourhoods
• Let Φ be the set of all programs semantically
close to P (defined in various ways)
• Φ is neighbourhood of P
• Let T be a test suite, f:DD be a functional spec
of P
• Traditionally assume
t  T P.x = f(x)  x  D P.x = f(x)
• i.e. T is a reliable test suite
• Requires exhaustive testing
Competent Programmer Hypothesis
• P is pathological iff P  Φ
• Assume programmers have some competence
Mutation testing assumption
Either P is pathological or else
t  T P.x = f(x)  x  D P.x = f(x)
• Can now focus on building a test suite T that
would distinguish P from all other programs in
Φ
Coupling Effect
• The competent programmer hypothesis limits
the problem from infinite to finite.
• But remaining problem is still too large
Coupling effect says that there is a small
subset μ  Φ such that:
We only need to distinguish P from all
programs in μ by tests in T
Problems
• Can we be sure the coupling effect holds? Do
simple syntactic changes define a set?
• Can we detect and count equivalents? If we
can’t kill a mutant Q is Q  P or is Q just hard
to kill?
• How large is μ? May still be too large to be
practical?
Equivalent Mutants
• Offut and Pan [1997] estimated 9% of all
mutants equivalent
• Bybro [2003] concurs with 8%
• Automatic detection algorithms (basically
static analysers) detect about 50% of these
• Use theorem proving (verification) techniques
Coupling Effect
• For Q an incorrect version of P
• Semantic error size = PrQ.x  P.x
• If for every semantically large fault there is an
overlap with at least one small syntactic fault
then the coupling effect holds.
• Selective mutation based on a small set of
semantically small errors - “Hard to kill”
Early Research: 22 Standard (Fortran)
mutation operators
AAR Array reference for array reference replacement
ABS Absolute value insertion
ACR Array reference for constant replacement
AOR Arithmetic operator replacement
ASR Array reference for scalar replacement
CAR Constant for array reference replacement
CNR Comparable array name replacement
CRP Constants replacement
CSR Constant for Scalar variable replacement
DER Do statement End replacement
DSA Data statement alterations
GLR Goto label replacement
LCR Logical connector replacement
ROR Relational operator replacement
RSR Return statement replacement
SAN Statement analysis
SAR Scalar for array replacement
SCR Scalar for constant replacement
SDL Statement deletion
SRC Source constant replacement
SVR Scalar variable replacement
UOI Unary operator insertion
Recent Research: Java Mutation
Operators
•
•
•
•
•
•
•
First letter is category
A = access control
E = common programming mistakes
I = inheritance
J = Java-specific features
O = method overloading
P = polymorphism
AMC Access modifier change
EAM Accessor method change
EMM Modifier method change
EOA Reference assignment and content assignment
replacement
EOC Reference comparison and content comparison
replacement
IHD Hiding variable deletion
IHI Hiding variable insertion
IOD Overriding method deletion
IOP Overriding method calling position change
IOR Overridden method rename
IPC Explicit call of parent’s constructor deletion
ISK super keyword deletion
JDC Java supported default constructor create
JID Member variable initialisation deletion
JSC static modifier change
JTD this keyword deletion
OAO Argument order change
OAN Argument number change
OMD Overloaded method deletion
OMR Overloaded method contents change
PMD Instance variable deletion with parent class type
PNC new method call with child class type
PPD Parameter variable declaration with child class type
PRV Reference asssignment with other compatible type
Practical Example
•
•
•
•
Triangle program
Myers “complete” test suite (13 test cases)
Bybro [2003] Java mutation tool and code
88% mutation score, 96% statement coverage
Status of Mutation Testing
• Various strategies: weak mutation, interface
mutation, specification-based mutation
• Our version is called strong mutation
• Many mutation tools available on the internet
• Cost of generating mutants and detecting
equivalents has come down
• Not yet widely used in industry
• Still considered “academic”, not understood?
Learning-based Testing
1. Specification-based Black-box Testing
2. Learning-based Testing paradigm (LBT)
- connections between learning and testing
- testing as a search problem
- testing as an identification problem
- testing as a parameter inference problem
3. Example frameworks:
1. Procedural systems
2. Boolean reactive systems
Specification-based Black-box Testing
1. System requirement (Sys-Req)
2. System under Test (SUT )
3. Test verdict pass/fail (Oracle step)
Sys-Req
Test
case
pass/fail
Output
TCG
SUT
Oracle
Constraint solver
Language runtime
Constraint
checker
Procedural System Example:
Newton’s Square root algorithm
Postcondition
Ι y*y – x Ι ≤ ε
precondition x ≥ 0.0
x=4.0
TCG
Input
Constraint solver
x=4.0 satisfies x ≥ 0.0
y=2.0
SUT
Newton Code
Output
Oracle
Constraint checker
Verdict
x=4.0, y=2.0 satisfies
Ι y*y – x Ι ≤ ε
Reactive System Example:
Coffee Machine
Sys-Req: always( in=$1 implies after(10, out=coffee) )
In0 := $1
TCG
Input
Constraint solver
SUT
Coffee machine
out11 := coffee
Output
pass/fail
Oracle
Constraint checker
in0 := $1, out11 := coffee Satisfies
always( in=1$ implies after(10, out=coffee))
Key Problem: Feedback
Problem: How to modify this architecture to..
1.Improve next test case using previous test
outcomes
2.Execute a large number of good quality tests?
3.Obtain good coverage?
4.Find bugs quickly?
Learning-Based Testing
Sys-Req
TCG
pass/fail
Input
SUT
Output
Oracle
Sys-Model
Verdict
Learner
“Model based testing without a model”
Basic Idea …
LBT is a search heuristic that:
1.Incrementally learns an SUT model
2.Uses generalisation to predict bugs
3.Uses best prediction as next test case
4.Refines model according to test outcome
Abstract LBT Algorithm
1.
2.
3.
4.
5.
Use (i1 , o1), … , (ik , ok) to learn model Mk
Model check Mk against Sys-Req
Choose “best counterexample” ik+1 from step 2
Execute ik+1 on SUT to produce ok+1
Check if (ik+1 , ok+1) satisfies Sys-Req
a) Yes: terminate with ik+1 as a bug
b) No: goto step 1
Difficulties lie in the technical details …
General Problems
Difficulty is to find combinations of
models, requirements languages and checking
algorithms (M, L, A)
so that …
1. models M are:
- expressive,
- compact,
- partial and/or local (an abstraction method)
- easy to manipulate and learn
2. M and L are feasible to model check with A
Incremental Learning
• Real systems are too large to be completely
learned
• Complete learning is not necessary to test
many requirements (e.g. use cases)
• We use incremental (on-the-fly) learning
– Generate a sequence of refined models
• M0  M1  …  Mi 
– Convergence in the limit
Example:
Boolean reactive systems
1. SUT: reactive systems
2. Model: deterministic Kripke structure
3. Requirements: propositional linear temporal
logic (PLTL)
4. Learning: IKL incremental learning algorithm
5. Model Checker: NuSMV
LBT Architecture
A Case Study: Elevator Model
C1 I C3
C3
C1
ck
W1, W3
!W1, W3
ck
C2
C2
!W2,cl, !Stop, [email protected], [email protected], [email protected]
C1 I C3
W2, cl,!Stop, [email protected], [email protected], [email protected]
C1 I C2 I C3
W3, cl,!Stop, [email protected], [email protected], [email protected]
C1 I C2 I C3
C1 I C2
C3
C2
ck
W1, W3
W1, !W3
C1
ck
C1
C1
C2
C2 I C3
C3
ck I C3
C2
ck I C2
!W1, !W2
Tick
ck
!W1, W2
ck
ck
!W3, Stop, [email protected], [email protected], @3
Tick I C3 I C2
C3
cl
Tick
C2 I C3
C2
!W1, W3
ck
C2
C1
C3
!W1, !W3
C3
ck
W1, W2
W1, !W2
ck
C3
C1
C3 I ck
C3 I ck
C3
C1 I C3 I ck
Tick cl
C2
C1
C3 !cl
C2 !cl
!cl,!W2, W3
C1
!cl,W2,!W3
W2, W3
!W1,!W3
!cl,!W1, W3 !cl,!W1, W3
C1
C1
C2
C1
!W1, Stop, @1, [email protected], [email protected]
C1
C2
!cl,W1,!W3
!cl,W1,!W3
C2
C2 C3
C1 I ck
C1 I ck
!W1,!W2
!cl,W1, W3
!cl,W1, W3
C2
C1 I C3 I ck
C2
C2
ck
ck
ck
C3
C3
W2, W3
C2
C1
C1 I C2 I C3
ck
!W1, !W3
!W1, W3
W2, !W3
C2
ck
C2
C1 I C2
C2 I C3
C3
C3
!W2, W3
C1
ck
ck
W1, !W3
W1, W3
!W2, !W3
C1
C1 I C2 I C3
C1 I C3
W1,cl, !Stop, [email protected], [email protected], [email protected]
C1 I C2
W2,cl, !Stop, [email protected], [email protected], [email protected]
ck
C2
C2
C3
W1, !W3
W1, W3
ck
C1 I C3
!W2, cl, !Stop, [email protected], [email protected], [email protected]
!cl,!W1, W2
C2 I ck
C1
!W2, Stop, [email protected], @2, [email protected]
ck
C3
C3
Tick
Tick
C1 !cl
!W2,!W3
cl
Tick
C3
C3
C1
C1
!cl,W1,!W2
C2
C1 I ck
!cl,W1, W2
C1 I C2 I ck
Elevator Results
Req
t first
(sec)
t total
(sec)
MCQ first MCQ tot
PQ first
PQ tot
RQ first
RQ tot
Req 1
0.34
1301.3
1.9
81.7
1574
729570
1.9
89.5
Req 2
0.49
1146
3.9
99.6
2350
238311
2.9
98.6
Req 3
0.94
525
1.6
21.7
6475
172861
5.7
70.4
Req 4
0.052
1458
1.0
90.3
15
450233
0.0
91
Req 5
77.48
2275
1.2
78.3
79769
368721
20.5
100.3
Req 6
90.6
1301
2.0
60.9
129384
422462
26.1
85.4
Conclusions
• A promising approach …
• Flexible general heuristic,
• many models and requirement languages seem
possible
• Many SUT types might be testable
• procedural, reactive, real-time etc.
Open Questions
• Benchmarking?
• Scalability? (abstraction, infinite state?)
• Efficiency? (model checking and learning?)

similar documents