### y - École Centrale Paris

Some Useful Machine Learning Tools
M. Pawan Kumar
École Centrale Paris
École des Ponts ParisTech
INRIA Saclay, Île-de-France
Outline
• Part I : Supervised Learning
• Part II: Weakly Supervised Learning
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine
– Structured output support vector machine
Image Classification
Is this an urban or rural area?
Input: x
Output: y  {-1,+1}
Image Classification
Is this scan healthy or unhealthy?
Input: x
Output: y  {-1,+1}
Image Classification
Which city is this?
Input: x
Output: y  {1,2,…,C}
Image Classification
What type of tumor does this scan contain?
Input: x
Output: y  {1,2,…,C}
Object Detection
Where is the object in the image?
Input: x
Output: y  {Pixels}
Object Detection
Where is the rupture in the scan?
Input: x
Output: y  {Pixels}
Segmentation
sky
tree
sky
car
grass
What is the semantic class of each pixel?
Input: x
Output: y  {1,2,…,C}|Pixels|
Segmentation
What is the muscle group of each pixel?
Input: x
Output: y  {1,2,…,C}|Pixels|
A Simplified View of the Pipeline
Extract Features
Input
x
http://deeplearning.net
Learn f
Prediction
y(f)
maxy f(Φ(x),y)
Features
Φ(x)
Compute
Scores
Scores
f(Φ(x),y)
Learning Objective
Data distribution P(x,y)
Distribution is unknown
Measure of prediction quality
f* = argminf EP(x,y) Error(y(f),y)
Expectation over
data distribution
Prediction
Ground Truth
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Measure of prediction quality
f* = argminf EP(x,y) Error(y(f),y)
Expectation over
data distribution
Prediction
Ground Truth
Learning Objective
Training data {(xi,yi), i = 1,2,…,n} Finite samples
Measure of prediction quality
f* = argminf Σi Error(yi(f),yi)
Expectation over Prediction
empirical distribution
Ground Truth
Learning Objective
Training data {(xi,yi), i = 1,2,…,n} Finite samples
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Relative weight
(hyperparameter)
Regularizer
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine
– Structured output support vector machine
Logistic Regression
Input: x
Features: Φ(x)
f(Φ(x),y) = yθTΦ(x)
Output: y  {-1,+1}
Prediction: sign(θTΦ(x))
P(y|x) = l(f(Φ(x),y))
l(z) = 1/(1+e-z)
Logistic function
Is the distribution normalized?
Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ R(θ)
Negative
Log-likelihood
Regularizer
Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Convex optimization problem
Proof left as an exercise.
Hint: Prove that Hessian H is PSD
aTHa ≥ 0, for all a
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
θt+1 
θt - μ dL(θ)
dθ
θt
Repeat until decrease in objective is below a threshold
Small μ
Large μ
Small μ
Large μ
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
θt+1 
θt - μ dL(θ)
dθ
θt
Small constant or
Line search
Repeat until decrease in objective is below a threshold
Newton’s Method
Minimize g(z)
Solution at iteration t = zt
Define gt(Δz) = g(zt + Δz)
Second-order Taylor’s Series
gt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2
Derivative wrt Δz = 0, implies g’(zt) + g’’(zt) Δz = 0
Solving for Δz provides the learning rate
Newton’s Method
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
θt+1

θt
μ-1 = d2L(θ)
- μ dL(θ)
dθ
θt
dθ2
θt
Repeat until decrease in objective is below a threshold
Logistic Regression
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}
Train C 1-vs-all logistic regression binary classifiers
Prediction: Maximum probability of +1 over C classifiers
Simple extension, easy to code
Loses the probabilistic interpretation
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine
– Structured output support vector machine
Multiclass Logistic Regression
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}
Joint feature vector of input and output: Ψ(x,y)
Ψ(x,1) = [Φ(x) 0 0 … 0]
…
Ψ(x,2) = [0 Φ(x) 0 … 0]
Ψ(x,C) = [0 0 0 … Φ(x)]
Multiclass Logistic Regression
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = θTΨ(x,y)
Prediction: maxy θTΨ(x,y))
P(y|x) = exp(f(Ψ(x,y)))/Z(x)
Partition function Z(x) = Σy exp(f(Ψ(x,y)))
Multiclass Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Convex optimization problem
Gradient Descent, Newton’s Method, and many others
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine
– Structured output support vector machine
Regularized Maximum Likelihood
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}m
Joint feature vector of input and output: Ψ(x,y)
[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]
[Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]
Regularized Maximum Likelihood
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}m
Joint feature vector of input and output: Ψ(x,y)
[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]
[Ψ(x,yi), for all i; Ψ(x,yij), for all i, j]
[Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]
Regularized Maximum Likelihood
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}m
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = θTΨ(x,y)
Prediction: maxy θTΨ(x,y))
P(y|x) = exp(f(Ψ(x,y)))/Z(x)
Partition function Z(x) = Σy exp(f(Ψ(x,y)))
Regularized Maximum Likelihood
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Partition function is expensive to compute
Approximate inference (Nikos Komodakis’ tutorial)
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine (multiclass)
– Structured output support vector machine
Multiclass SVM
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}
Joint feature vector of input and output: Ψ(x,y)
Ψ(x,1) = [Φ(x) 0 0 … 0]
…
Ψ(x,2) = [0 Φ(x) 0 … 0]
Ψ(x,C) = [0 0 0 … Φ(x)]
Multiclass SVM
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = wTΨ(x,y)
Prediction: maxy wTΨ(x,y))
Predicted Output: y(w) = argmaxy wTΨ(x,y))
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
Loss function for i-th sample
Δ(yi,yi(w))
Minimize the regularized sum of loss over training data
Highly non-convex in w
Regularization plays no role (overfitting may occur)
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w))
≤
wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi)
≤
maxy { wTΨ(x,y) + Δ(yi,y) } - wTΨ(x,yi)
Sensitive to regularization of w
Convex
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
for all y
Quadratic program with polynomial # of constraints
Specialized software packages freely available
http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html
Outline – Part I
• Introduction to Supervised Learning
• Probabilistic Methods
– Logistic regression
– Multiclass logistic regression
– Regularized maximum likelihood
• Loss-based Methods
– Support vector machine (multiclass)
– Structured output support vector machine
Structured Output SVM
Input: x
Features: Φ(x)
Output: y  {1,2,…,C}m
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = wTΨ(x,y)
Prediction: maxy wTΨ(x,y))
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
for all y
Quadratic program with exponential # of constraints
Many polynomial time algorithms
Cutting Plane Algorithm
Define working sets Wi = {}
REPEAT
Update w by solving the following problem
minw ||w||2 + C Σiξi
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y  Wi
Compute the most violated constraint for all samples
ŷi = argmaxy wTΨ(x,y) + Δ(yi,y)
Update the working sets Wi by adding ŷi
Cutting Plane Algorithm
Termination criterion: Violation of ŷi < ξi + ε, for all i
Number of iterations = max{O(n/ε),O(C/ε2)}
At each iteration, convex dual of problem increases.
Convex dual can be upper bounded.
Ioannis Tsochantaridis et al., JMLR 2005
http://svmlight.joachims.org/svm_struct.html
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
for all y  {1,2,…,C}m
Number of constraints = nCm
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
for all y  Y
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi) ≤ ξi
for all zi  Y
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ Σiξi
for all Z = {zi,i=1,…,n}  Yn
Equivalent problem to structured output SVM
Number of constraints = Cmn
1-Slack Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C ξ
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ
for all Z = {zi,i=1,…,n}  Yn
Cutting Plane Algorithm
Define working sets W = {}
REPEAT
Update w by solving the following problem
minw ||w||2 + C ξ
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ for all Z  W
Compute the most violated constraint for all samples
zi = argmaxy wTΨ(x,y) + Δ(yi,y)
Update the working sets W by adding {zi, i=1,…n}
Cutting Plane Algorithm
Termination criterion: Violation of {zi} < ξ + ε
Number of iterations = O(C/ε)
At each iteration, convex dual of problem increases.
Convex dual can be upper bounded.
Thorsten Joachims et al., Machine Learning 2009
http://svmlight.joachims.org/svm_struct.html
Outline – Part II
• Introduction to Weakly Supervised Learning
– Two types of problems
• Probabilistic Methods
– Expectation maximization
• Loss-based Methods
– Latent support vector machine
– Dissimilarity coefficient learning
Log (Size)
Computer Vision Data
~ 2000
Segmentation
Information
Log (Size)
Computer Vision Data
~1M
~ 2000
Bounding Box
Segmentation
Information
Log (Size)
Computer Vision Data
> 14 M
Image-Level
~1M
~ 2000
Bounding Box
Segmentation
Information
“Car”
“Chair”
Computer Vision Data
Log (Size)
>6B
> 14 M
~1M
Noisy Label
Image-Level
~ 2000
Bounding Box
Segmentation
Information
Data
Detailed annotation is expensive
Often, in medical imaging, annotation is impossible
Desired annotation keeps changing
Learn with missing information (latent variables)
Outline – Part II
• Introduction to Weakly Supervised Learning
– Two types of problems
• Probabilistic Methods
– Expectation maximization
• Loss-based Methods
– Latent support vector machine
– Dissimilarity coefficient learning
Annotation Mismatch
Learn to classify an image
Desired Output
y
Image x
h
Annotation y = “Deer”
Mismatch between desired and available annotations
Exact value of latent variable is not “important”
Annotation Mismatch
Learn to classify a DNA sequence
Latent Variables h
Sequence x
Annotation y  {+1, -1}
Desired Output y
Mismatch between desired and possible annotations
Exact value of latent variable is not “important”
Output Mismatch
Learn to detect an object in an image
Desired Output
(y,h)
Image x
h
Annotation y = “Deer”
Mismatch between output and available annotations
Exact value of latent variable is important
Output Mismatch
Learn to segment an image
Image
Desired Output
Output Mismatch
Learn to segment an image
(x, y)
Bird
(y, h)
Output Mismatch
Learn to segment an image
(x, y)
(y, h)
Cow
Mismatch between output and available annotations
Exact value of latent variable is important
Outline – Part II
• Introduction to Weakly Supervised Learning
– Two types of problems
• Probabilistic Methods
– Expectation maximization
• Loss-based Methods
– Latent support vector machine
– Dissimilarity coefficient learning
Expectation Maximization
Input: x
Annotation: y
Latent Variables: h
Joint feature vector: Ψ(x,y,h)
f(Ψ(x,y,h)) = θTΨ(x,y,h)
P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)
Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))
Prediction: maxy P(y|x;θ) = maxy Σh P(y,h|x;θ)
Expectation Maximization
Input: x
Annotation: y
Latent Variables: h
Joint feature vector: Ψ(x,y,h)
f(Ψ(x,y,h)) = θTΨ(x,y,h)
P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)
Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))
Prediction: maxy,h P(y,h|x;θ)
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
- log P(y|x;θ)
EP(h|y,x;θ’) log P(h|y,x;θ)
Maximized at θ = θ’
- EP(h|y,x;θ’) log P(y,h|x;θ)
Left as exercise
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
minθ - log P(y|x;θ)
EP(h|y,x;θ’) log P(h|y,x;θ)
Maximized at θ = θ’
- EP(h|y,x;θ’) log P(y,h|x;θ)
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
minθ - log P(y|x;θ)
minθ - EP(h|y,x;θ’) log P(y,h|x;θ)
Expectation Maximization
E-step: Compute P(h|y,x;θt)
M-step: Obtain θt+1 by solving the following problem
minθ Σi –EP(h|yi,xi;θt) log(P(yi,h|xi;θ)) + λ ||θ||2
Repeat until convergence
Outline – Part II
• Introduction to Weakly Supervised Learning
– Two types of problems
• Probabilistic Methods
– Expectation maximization
• Loss-based Methods
– Latent support vector machine
– Dissimilarity coefficient learning
Latent SVM
Input x
Output y  Y
Hidden Variable
hH
“Deer”
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Latent SVM
Feature (x,y,h)
(HOG, BoW)
Parameters w
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
Training samples xi
Ground-truth label yi
Loss Function
(yi, yi(w))
Annotation Mismatch
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
wT(xi,yi(w),hi(w)) +
(yi, yi(w))
- wT(xi,yi(w),hi(w))
“Very” non-convex
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
wT(xi,yi(w),hi(w)) +
(yi, yi(w))
- maxh wT(xi,yi,hi)
i
Upper Bound
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
maxy,h wT(xi,y,h) + (yi, y)
- maxh wT(xi,yi,hi)
i
Upper Bound
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i i
maxh wT(xi,yi,hi) - wT(xi,y,h)
i
≥ (yi, y) - i
So is this convex?
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Convex
maxy,h wT(xi,y,h) + (yi, y)
- maxh wT(xi,yi,hi)
i
Convex
Difference-of-convex !!
Concave-Convex Procedure
+
Linear upper-bound of concave part
Concave-Convex Procedure
+
Linear upper-bound of concave part
Concave-Convex Procedure
+
Until Convergence
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
maxy,h wT(xi,y,h) + (yi, y)
- maxh wT(xi,yi,hi)
i
Linear upper bound at wt
(xi,yi,hi*)
hi* = argmaxh wtT(xi,yi,hi)
i
Latent SVM
(y(w),h(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i i
maxh wT(xi,yi,hi) - wT(xi,y,h)
i
≥ (yi, y) - i
Solve using CCCP
CCCP for Latent SVM
Update
hi* = argmaxh H wtT(xi,yi,hi)
i
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y) - i
http://webdocs.cs.ualberta.ca/~chunnam/
CCCP for Human Learning
1+1=2
Math is for
losers !!
1/3 + 1/6
= 1/2
eiπ+1 = 0
FAILURE … BAD LOCAL MINIMUM
Self-Paced Learning
1+1=2
Euler was
a Genius!!
1/3 + 1/6
= 1/2
eiπ+1 = 0
SUCCESS … GOOD LOCAL MINIMUM
Self-Paced Learning
Start with “easy” examples, then consider “hard” ones
Simultaneously estimate easiness and parameters
Easiness is property of data sets, not single instances
Easy vs. Hard
Expensive
Easy for human
 Easy for machine
CCCP for Latent SVM
Update
hi* = argmaxh H wtT(xi,yi,hi)
i
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y) - i
Self-Paced Learning
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y, h) - i
Self-Paced Learning
vi  {0,1}
min ||w||2 + C∑i vii
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y, h) - i
Trivial Solution
Self-Paced Learning
vi  {0,1}
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
Self-Paced Learning
Alternating
Convex Search
vi  [0,1]
Biconvex
Problem
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
SPL for Latent SVM
Update
hi* = argmaxh H wtT(xi,yi,hi)
i
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i - ∑i vi/K
wT(xi,yi,hi*) - wT(xi,y,h)
≥ (yi, y) - i
Decrease K  K/
http://cvc.centrale-ponts.fr/personnel/pawan/
Outline – Part II
• Introduction to Weakly Supervised Learning
– Two types of problems
• Probabilistic Methods
– Expectation maximization
• Loss-based Methods
– Latent support vector machine
– Dissimilarity coefficient learning
(if time permits)