Sequential Minimal Optimization
Advanced Machine Learning Course
2012 Fall Semester
Tsinghua University
• Implement SMO algorithm to classify the given set of
documents as one of two classes "+1 or -1”.
Xi : N dimension vector
Class -1
Class +1
wT x  b  
wT x  b  
|| w ||
SMO implementation
We see the main parts in the implementation aspect of SMO.
SMO overview
Data Processing
KKT conditions
Learned function
Heuristics to find alpha2
Updating w and b( for new lagrange values)
Data Preprocessing
The dataset contains two newsgroups, one is baseball and the other is hockey.
For each document(feature selection), you can simply select the top 50 words(50
features)with highest tf-idf values( see Ref 4 ). Of course, you can do advanced
preprocessing like stop word removal, word stemming, or define specific features by yourself.
An example feature file is given :
+1 1:0 2:6.87475407647932 3:3.58860800796358 4:0 22:7.2725475021777 30:0.228447564110819 49:2.1380940541094 56:3.84284214634782 90:2.83603740693284 114:8.60474895145252 131:2.41152410385398
-1 1:0 4:0 30:0.228447564110819 78:0.915163336038847 116:2.4241825007259 304:3.47309512084174 348:13.941644886054 384:1.46427114637585 626:2.85545549278994 650:3.003091491596
where each line "[label] [keyword1]:[value1] [keyword2]:[value2]… " represents a document
Label(+1, -1) is the classification of the document;
keyword_i is the global sequence number of a word in the whole dataset;
value_i is the tf-idf of the word appearing in the document.
The goal of data preprocessing:
Generate a feature file for the whole dataset, and then split it into five equal set of files say
s1 to s5 for 5-fold Cross-validation.
Learning & Test Processing
Input: feature files s1 to s5
1) Set i=1, take si as test set and others as training set
2) Take the training set and learn the SMO (w and b for the data points) and store
the learnt weights.
3) Take the test set and using the weights learnt from Step 2, classify the
documents as hockey or baseball.
4) Calculate Precision, Recall and F1 score.
5) i++ , Repeat Step 1, 2, and 3 until i > 5
Objective Function
• The lecture in class showed us, we can:
• Solve the dual more efficiently (fewer unknowns)
• Add parameter C to allow some misclassifications
• Replace xiTxj by more more general kernel term
Intuitive Introduction to SMO
• SMO is essentially doing same thing with
Perceptron learning algorithm
– find a linear separator by adjusting weights on
misclassified examples
• Unlike perceptron, SMO has to maintain
• Therefore, when SMO adjusts weight of one
alpha example, it must also adjust weight of
another alpha.
Size of alpha is the size of training examples
SMO Algorithm
• Input:
C(say 0.5), kernel(linear),
error cache, epsilon(tolerance)
• Initialize b, w and all ’s to 0
• Repeat until KKT satisfied (to within epsilon):
– Find an example ex1 that violates KKT (prefer unbound
examples (0<αi<C )here, choose randomly among those)
– Choose a second example ex2.
# Prefer one to maximize step size (in practice, faster to just
maximize |E1 – E2|).
# If that fails to result in change, randomly choose unbound
# If that fails, randomly choose example. If that fails, re-choose
– Update α1 and α2 in one step
– Compute new threshold b
Karush-Kuhn-Tucker (KKT) Conditions
• It is necessary and sufficient for a solution to our objective that all ’s
satisfy the following:
• An  is 0 iff that example is correctly labeled with room to spare
• An  is C iff that example is incorrectly labeled or in the margin
• An  is properly between 0 and C (is “unbound”) iff that example is
“barely” correctly labeled (is a support vector)
• We just check KKT to within some small epsilon, typically 10-3
Equality constraint
( cause to lie on diagonal line
Inequality constraint
( causes lagrange multipliers to
lie in the box )
• SMO spends most of the time on adjusting the alphas of the nonboundary examples, so error cache is maintained for them.
error_cache -> collection to hold the error of each training examples.
• i1 and i2 are the indexes corresponding to ex1 and ex2.
• Variables associated with i1 ends with 1 ( say alpha1 or alph1) and
associated with i2 ends with 2( say alpha2 or alph2).
• w – global weights - > size of unique number of words from given data
• xi is ith training example.
• eta - Objective function
• a2 is new alpha2 value.
• L & H -> Lower and upper range used to compute the feasibility range
of new alpha a2(because new alpha are found on derivatives or
objective function).
wxk-b -> learned_func
Error between predicted and actual
Common features between i1 and i2
Main Function
Given the first alpha, examineExample(i1) first
checks if it violates the KKT condition by more
than tolerance and if not, jointly optimize two
alphas by calling function takestep(i1,i2)
Heuristics to choose alpha2
Choose alpha2 such that (E1-E2)
is maximized.
Feasibility range of alph2
Even before finding new alpha2 value, we have to find of feasibility range (L and H) value.
Refer page number 11 of Reference 3 for the derivation regarding this.
New a2 (new alpha2)
Objective function
(page 9-11 of Reference 3 for derivation)
In indefinite case, SMO will move the Lagrange multipliers to
the end point that has the lowest value of the objective
function. Evaluated at each end(L and H). Read page 8 of Ref 2.
Lobj and Hobj – page 22 of Ref 3
Updating threshold b
New b is calculated every time so that KKT fullfilled for both alphas.
Updating the threshold with either of new alph1 or new alph2 non-boundary.
If both of them not-boundary then average b1 and b2 (b1 is valid because it forces x1 to
give y1 output) Refer page 9 of Reference 2.
Updating error cache using new lagrange multipliers
Change in alph1 and alpha2
Change in b
1) Error cache of all other non-bound training examples are updated.
2) Error cache of alph1 and alph2 are updated to zero
(since we already optimized together)
Updating w
t1 and t2 ->
Change in alpha1 and alpha2
Global w w.r.t ex1 and ex2
should be updated.
SVM prediction with w and b
• Now we have w and b calculated.
For new xi ,
svm_prediction -> wxi-b
Precision & Recall
Precision is the probability that a (randomly selected) retrieved document is relevant.
Recall is the probability that a (randomly selected) relevant document is retrieved in a
F1 Score
• eps too smaller value(more accuracy) may
sometimes force SMO in non-exit loop.
• After each alpha1 and alpha2 optimized, error
cache of these two examples should be set to
zero, and all other non-bound examples
should be updated.
• At each optimization step, w and b should be
Reference Code
• Pseudo-code
– Reference 2
• C++ code
– Reference 3
• Implementation: Refer complete algorithm (for implementation) :
• Report Precision, Recall, F1 measure
– Report time cost
– Submit as .tar file, including
1 ) Source Code with necessary comments, including
Data preprocessing
SMO training, testing and evaluation
2) ReadMe.txt explaining
How to extract features for the dataset, try for different C values and explain the results.
Explain main functions, e.g., select alpha1 and alpha2, update alpha1 and alpha2, update w
and b.
How to run your code, from data preprocessing to training/testing/evaluation, step by step.
What’s the final average precision, recall and F1 Score and time cost.
1. SVM by Sequential Minimal Optimization (SMO). Algorithm by John
Platt. Lecture by David Page.
2. Platt(1998):
1. Sequential Minimal Optimization for SVM .
java code for SMO
Thank You!

similar documents