### M3N

```Maximum Margin Markov
Network
Carlos Guestrin
Daphne Koller
2004
Topics Covered

Main Idea.

Problem Setting.

Structure in classification problems.

Markov Model.

SVM

Combining SVM and Markov Network.

Generalization Bound.

Experiments and results.
Main Idea

Combining SVM(kernel based approach) and Markov network(graphical model)
for sequential,structured learning.

SVM 1) Ability to use high dimensional feature spaces.
2) Incapable of exploiting structure in the problem.

Markov Network 1) Ability to represent correlations between labels by exploiting
structure in the problem.
2) Incapable of dealing with high dimensional feature spaces.
Problem Setting

Multilabel Classification

Training data as Input:

Target is to predict y given new x.

We take example of OCR data.
Structure in classification problems

Feature function

Hypothesis

For multilabel classification number of possible assignments to y is
exponential in the number of labels making arg max over y difficult to
compute.

Alternative approach is to use probabilistic graphical models.
Markov Model

Use pairwise Markov Model.

Defined as a graph G=(Y,E).

Each edge (i,j) is associated with a potential function

The network encodes a conditional probability distribution as

Now we can take
for x.
=
f(x,y) to predict y
SVM
Combining SVM and Markov Network

For single-label binary classification,Crammer and Singer provide an extension
of SVM framework by maximizing the margin .
where


The constraints ensure that
Here we are predicting multiple labels so loss function won’t be simply as o-1
loss but per label loss.

More specifically margin between t(x) and y scales linearly with number of wrong
labels in y:
where :

However there is a problem with the above approach which is discussed in Taskar
et al.
This approach may give significant weight to output values that are not even close
to target values because every increase in the loss increases the required margin.

Now using standard transformation to eliminate
variables
we will have primal and dual:
and introducing slack
Generalization Bound

Relate training error to testing error.

Average per label loss:

margin per label loss:

with probability at least
where q is the maximum edge degree in the network,l is the number of
labels ,K is a constant and k is number of classes in a label.
Experiments and Results

Handwriting recognition:
1) Input corpus contains 6100 handwritten words.
2) Data set divided into 10 folds of 600 training and 5500 testing
examples.
3) Accuracy results are average over 10 folds.

Hypertext classification:
1) Dataset contains web pages from 4 different CS departments.
2) Each page is labelled as course,faculty,student,project ,other.
3) Learn model from three schools and test on remaining.
4) Error rate of M^3N is 40% lower than that of RMN’s and 51% lower than
multi-class SVMs.
THANK YOU
```