### pptx

```Online
Algorithms
Lecturer: Yishay Mansour
Alex Roitenberg
Introduction
 Up
input and work with it
 suppose input arrives a little at a time,
need instant response
Oranges example




Suppose we are to build a robot that removes
bad oranges from a kibutz packaging line
After classification the kibutz worker looks at
the orange and tells our robot if his
classification was correct
And repeat indefinitely
Our model:



Input: unlabeled orange
The algorithm then gets the correct
classification  ()
Introduction





At every step t, the algorithm predicts the
classification based on some hypothesis
classification ()
A mistake is an incorrect prediction: H  ≠
()
The goal is to build an algorithm with a bound
number of mistakes
Number of mistakes Independent of the input
size
Linear Separators
Linear saperator
 The
goal: find 0 and  defining a hyper
plane  ∗  = 0
 All positive examples will be on the one side
of the hyper plane and all the negative on
the other
 I.E.  ∗  > 0 for positive  only
 We will now look at several algorithms to find
the separator
Perceptron
 The
Idea: correct? Do nothing
 Wrong? Move separator towards mistake
 We’ll
2
scale all x’s so that  = 1, since
this doesn’t affect which side of the plane
they are on
The perceptron algorithm
1.
2.
3.
initialize 1 = 0,  = 0
Given  , predict positive IFF 1 ∗  >0
On a mistake:
1.
2.
Mistake on positive +1 ←  +
Mistake on negative +1 ←  −
The perceptron algorithm
 Suppose
a positive sample
 If we misclassified , then after the update
we’ll get +1 ∗  = ( + ) ∗  =  ∗  + 1
  was positive, but since we made a
mistake  ∗  was negative, so a
correction was made in the right direction
Mistake Bound Theorem



Let  =<  > consistent with
∗ ∗  > 0 ⟺   = 1
M= :   ≠ is the number of mistakes
Then  ≤
∗
1
2
where  =
∗ ∗
min

∈
the margin of

the minimal distance of the samples in S from  ∗ (after
normalizing both  ∗ and the samples)
Mistake Bound Proof
 WLOG,
the algorithm makes a mistake on
every step (otherwise nothing happens)
 Claim 1: +1 ∗  ∗ ≥  ∗  ∗ +
 Proof:


> 0: +1 ∗  ∗ =  +  ∗  ∗ =  ∗  ∗ +  ∗
∗ ≥  ∗  ∗ +
< 0: +1 ∗  ∗ =  −  ∗  ∗ =  ∗  ∗ −  ∗
∗ ≥  ∗  ∗ +
Proof Cont.
 Claim

> 0: +1
2
2
∗  =


2: +1
< || ||2 + 1
= || +
2
+
2
|2
= ||
|2
∗  + 1 ≤
+
2
2
+
+1
∗  < 0 since the algorithm made a mistake
< 0: +1
2
2  ∗  =

2
= || −  |2 = || |2 +
2
− 2  ∗  + 1 ≤
2
2
−
+1
∗  > 0 since the algorithm made a mistake
Proof Cont.
 From
Claim 1:
 From Claim 2:
 Also: w t  w *  w t

Since
w M 1  w  M 
*
w M 1 
M
w 1
*
 Combining: M   w M 1  w *  w M 1 

M  1

2
M
The world is not perfect
 What
if there is no perfect separator?
The world is not perfect






Claim 1(reminder):+1 ∗  ∗ ≥  ∗  ∗ +
previously we made γ progress on each
mistake
now we might be making negative progress
= total distance we would have to move the
points to make them separable by a mragin γ
*
So: w M 1  w  M   TD 
With claim 2: M   TD   w  w  w  M
*
M 1
M 1

M  1

2

2

TD 
The world is not perfect
 The
 Alt.
1

total hinge loss of  ∗ :
definition:
 Hinge
max( 0 ,1  y ),  y 
loss illustration:
l( x)  x  w

*
Perceptron for maximizing
margins

the idea: update
whenever the correct
classification margin is less

than
2

No. of steps polynomial in

Generalization:
1

2

Update margin: → (1 − )

No. of steps polynomial in
1

Perceptron Algorithm
(maximizing margin)

Assuming∀ ∈ ,  = 1



Init: 1 ←  1
Predict:

∗

≥ 2 →


∗

≤ − 2 →

∗

∈ − 2 , 2 →

On mistake (prediction or margin), update:

+1 ←  +
Mistake Bound Theorem

Let  =<  > consistent with :
∗ ∗  > 0


=1
M=No. of mistakes + No. of margin mistakes
Then  ≤
∗
12
2
where  =
∗ ∗
min

∈
the margin of
similar to the perceptron proof.
*
*
Claim 1 remains the same: w t 1  w  w t  w  
We only have to bound w t  1
Mistake bound proof

WLoG, the algorithm makes a mistake on every
step

Claim2: +1 ≤ || +

Proof: |+1 | =

1+

+
1+
1

2

And since 1 +  ≤ 1 +

≤
1+

2
1
2
+

2
1
2
2
+

2
2   ∗

2
+
1

2
≤
Proof Cont.
 Since


 And

the algorithm made a mistake on t
≤

2
thus:
+1 ≤
1
+
2

+
2
Proof Cont.
 So:
 If
+1 ≤  +
1
2
+

2
≥ 2, +1 <  +
2

+1 ≤ 1 + +
 From
+1
3

4
3
4
→
Claim 1 as before: M ≤ +1 ∗  ∗ ≤
 Solving
we get:  ≤
12
2
The mistake bound model
Con Algorithm
∈  set of concepts consistent on
1 , 2 . . −1
 At step t



Randomly choose concept c
Predict  =
CON Algorithm
 Theorem:
For any concept class C, Con
makes the most || − 1 mistakes
 Proof: at first 1 =  .
 After each mistake| | decreases by at
least 1
 | | >= 1,since  ∈  at any t
 Therefore number of mistakes is bound by
|| − 1
The bounds of CON
 This
bound is too high!
 There
 We

are 22 different functions on 0,1
can do better!

HAL – halving algorithm
∈  set of concepts consistent on
1 , 2 . . −1
 At step t



Conduct a vote amongst all c
Predict  with accordance to the majority
HAL –halving algorithm
 Theorem:
For any concept class C, Con
makes the most log 2 || − 1 mistakes
 Proof: 1 = . After each mistake +1 ≤
1
sine majority of concepts were
2
wrong.
 Therefore number of mistakes is bound by
log 2 ||
Mistake Bound model and
PAC
 Generates
strong online algorithms
 In the past we have seen PAC
 Restrictions for mistake bound are much
harsher than PAC
 If
we know that A learns C in mistake
bound model , should A learn C in PAC
model?
Mistake Bound model and
PAC





A – mistake bound algorithm
Our goal: to construct Apac a pac algorithm
Assume that after A gets xi he construct
hypothesis hi
Definition : A mistake bound algorithm A is
conservative iff for every sample xi if   =
ℎ−1 ( ) then in the ith step the algorithm will
make a choice ℎ = ℎ−1
change hypothesis
Conservative equivalent of
Mistake Bound Algorithm



Let A be an algorithm whose mistake is bound by
M
Ak is A’s hypothesis after it had seen {1 , 2 , . .  }
Define A’


Initially ℎ0 = 0 .
At  update:





Guess ℎ−1
If   = ℎ−1  , ℎ = ℎ−1
Else ℎ =
If we run A onS = { :   = ℎ−1  } it would
make || mistakes ⇒
′ makes as many mistakes as A
Building Apac


=  ln

Apac algorithm:

Run A’ over a sample of size  ln(  ) divided to M equal
blocks
Build hypothesis ℎ for each block

Run the hypothesis on the next block
If there are no mistakes output ℎ



,0 ≤  ≤  − 1

hk 0
h k1
inconsistent
M 
ln 


  

consistent
inconsistent
…
consistent
1
hk 0
h k1
Building Apac





If A’ makes at most M mistakes then Apac
guarantees to finish
→ APAC outputs a perfect classifier
What happens otherwise?
Theorem: Apac learns PAC
Proof: Pr ℎ    ℎ   −

M -1
Pr( A PAC outputs  - bad h )  Pr(  0  i  M  1 s.t. h k i is  - bad ) 

i0
Pr( h k i is  - bad) 
M -1

i0
M


Disjunction of Conjuctions
Disjunction of Conjunctions
 We
have proven that every algorithm in
mistake bound model can be converted
to PAC
 Lets
look at some algorithms in the
mistake bound model
Disjunction Learning

Our goal: classify the set of disjunctions
e.g. 1 ⋁2 ⋁ 6 ⋁8

Let L be the hypothesis set {1 , 1 , 2 , 2 … ,  }
h = ⋁:  ∈
Given a sample y do:
If our hypothesis does a mistake (ht () ≠ ct ()) Than:
← \S ℎ
1.
2.
3.
= {     ℎℎ
ℎℎ   }
1.
2.
Else do nothing
Example

If we have only 2 variables


L is {1 , 1 , 2 , 2 }
ℎ = 1 ∨ 1 ∨ 2 ∨ 2

Assume the first sample is y=(1,0)

ℎ  = 1
If   = 0



we update  = 2 , 1
ℎ+1 = 1 ∨ 2
Mistake Bound Analysis
 The
number of mistakes is bound by n+1
 n is the number of variables
 Proof:
 Let
R be the set of literals in
 Let  be the hypothesis after t samples
Mistake Bound Analysis



We prove by induction that  ⊆
For t=0 it is obvious that  ⊆ 0
Assume after t-1 samples  ⊆ −1

If   = 1 than   = ℎ  and we dont update
If  = 1 than ofcourse S and R don’t intersect.
Either way  ⊆

Thus we can only make mistakes when   = 0


Mistake analysis proof
 At
first mistake we eliminate n literals
 At any further mistake we eliminate at
least 1 literal
 L0 has 2n literals
 So
we can have at most n+1 mistakes
k-DNF

Definition: k-DNF functions are functions that
can be represented by a disjunction of
conjunctions in which there are at most k
literals
 E.g.

3-DNF
(1 ∧ 2 ∧ 6 ) ∨ (1 ∧ 3 ∧ 5 )
The number of conjunctions of i terms is

We choose i variables (
we choose a sign (2 )

2

) for each of which

k-DNF classification

We can learn this class by changing the
previous algorithm to deal with terms instead
of variables

Reducing the space  = 0,1
gives a disjunction on



to  = 0,1
2 usable algorithms
ELIM for PAC
 The previous algorithm (In mistake bound
model) which has   mistakes


Winnow
 Monotone
Disjunction: Disjunctions
containing only positive literals.
e.g. 1 ∨ 3 ∨ 5
 Purpose: to learn the class of monotone
disjunctions in a mistake-bound model
 We look at winnow which is similar to
perceptron
 One main difference: it uses multiplicative
Winnow



Same classification scheme as perceptron

ℎ  =  ∗  ≥  ⇒

ℎ  =  ∗  <  ⇒
Initialize 0 = 1,1, … , 1
Update scheme:


On positive misclassification
(ℎ() =1, ()=0)

∀ = 1:  ←

2
On negative misclassification : ∀ = 1:  ← 2
Mistake bound analysis
 Similar
to perceptron if the margin is
bigger than  then we can prove the error
1
rate is Θ( 2)

Winnow Proof:Definitions
 Let
= {1 , 2 , . . ,  } be the set of relevant
variables in the target concept
 I.e.  = 1 ∨ 2 ∨. .
 We
define  = 1 , 2 , . . ,  the weights
of the relevant variables
 Let   be the weight w at time t
 Let TW(t) be the total weight of all w(t) of
both relevant and irrelevant variables
Winnow Proof: Positive
Mistakes






Lets look at the positive mistakes
Any mistake on a positive example doubles (at
least) 1 of the relevant weights
∃ ∈  . .   + 1 = 2()
If ∃ . .  ≥  we get  ∗  ≥  therefore always a
positive classification
So, ∀ :  can only be doubled at most 1 + log
times
Thus, we can bind the number of positive mistakes:
+ ≤ (1 + log  )
Winnow Proof: Positive
Mistakes
 For
a positive mistake

ℎ  = 1  1 . … +    <

+ 1 =   +(1  1 . … +    )
 (1)
+ 1 <   +
Winnow Proof: Negative
Mistakes



On negative examples none of the relevant
weight change
Thus ∀ ∈  ,   + 1 ≥ w(t)
For a negative mistake to occur:

1  1 +. . +   ≥

+ 1 =   −

⇒ (2)  + 1 ≤   −
1  1 +..+
2

2
Winnow Proof:Cont.
 Combining

the equations (1),(2):
(3)0 <   ≤  0 + + −

2
 At
−
the beginning all weight are 1 so
0 =
 (3)(4)
⇒ − < 2 + 2+ ≤ 2 +
2  + 1
⇒
− + + ≤ 2 + 3( + 1)
What should we know? I
 Linear
Separators

Perceptron algorithm :  ≤

Margin Perceptron :  ≤
 The


1
2
12
2
mistake bound model
CON algorithm :  <  − 1 but C may be
very large!
HAL the halving algorithm:  < log 2 ||
What should you know? II
 The
relation between PAC and the
mistake bound model
 Basic algorithm for learning disjunction of
conjunctions
 Learning K-DNF functions
 Winnow algorithm : ≤ 2 + 3(log  + 1)
Questions?
```