Report

Online Algorithms Lecturer: Yishay Mansour Elad Walach Alex Roitenberg Introduction Up until now, our algorithms start with input and work with it suppose input arrives a little at a time, need instant response Oranges example Suppose we are to build a robot that removes bad oranges from a kibutz packaging line After classification the kibutz worker looks at the orange and tells our robot if his classification was correct And repeat indefinitely Our model: Input: unlabeled orange Output: classification (good or bad) The algorithm then gets the correct classification () Introduction At every step t, the algorithm predicts the classification based on some hypothesis The algorithm receives the correct classification () A mistake is an incorrect prediction: H ≠ () The goal is to build an algorithm with a bound number of mistakes Number of mistakes Independent of the input size Linear Separators Linear saperator The goal: find 0 and defining a hyper plane ∗ = 0 All positive examples will be on the one side of the hyper plane and all the negative on the other I.E. ∗ > 0 for positive only We will now look at several algorithms to find the separator Perceptron The Idea: correct? Do nothing Wrong? Move separator towards mistake We’ll 2 scale all x’s so that = 1, since this doesn’t affect which side of the plane they are on The perceptron algorithm 1. 2. 3. initialize 1 = 0, = 0 Given , predict positive IFF 1 ∗ >0 On a mistake: 1. 2. Mistake on positive +1 ← + Mistake on negative +1 ← − The perceptron algorithm Suppose a positive sample If we misclassified , then after the update we’ll get +1 ∗ = ( + ) ∗ = ∗ + 1 was positive, but since we made a mistake ∗ was negative, so a correction was made in the right direction Mistake Bound Theorem Let =< > consistent with ∗ ∗ > 0 ⟺ = 1 M= : ≠ is the number of mistakes Then ≤ ∗ 1 2 where = ∗ ∗ min ∈ the margin of the minimal distance of the samples in S from ∗ (after normalizing both ∗ and the samples) Mistake Bound Proof WLOG, the algorithm makes a mistake on every step (otherwise nothing happens) Claim 1: +1 ∗ ∗ ≥ ∗ ∗ + Proof: > 0: +1 ∗ ∗ = + ∗ ∗ = ∗ ∗ + ∗ ∗ ≥ ∗ ∗ + < 0: +1 ∗ ∗ = − ∗ ∗ = ∗ ∗ − ∗ ∗ ≥ ∗ ∗ + Proof Cont. Claim > 0: +1 2 2 ∗ = 2: +1 < || ||2 + 1 = || + 2 + 2 |2 = || |2 ∗ + 1 ≤ + 2 2 + +1 ∗ < 0 since the algorithm made a mistake < 0: +1 2 2 ∗ = 2 = || − |2 = || |2 + 2 − 2 ∗ + 1 ≤ 2 2 − +1 ∗ > 0 since the algorithm made a mistake Proof Cont. From Claim 1: From Claim 2: Also: w t w * w t Since w M 1 w M * w M 1 M w 1 * Combining: M w M 1 w * w M 1 M 1 2 M The world is not perfect What if there is no perfect separator? The world is not perfect Claim 1(reminder):+1 ∗ ∗ ≥ ∗ ∗ + previously we made γ progress on each mistake now we might be making negative progress = total distance we would have to move the points to make them separable by a mragin γ * So: w M 1 w M TD With claim 2: M TD w w w M * M 1 M 1 M 1 2 2 TD The world is not perfect The Alt. 1 total hinge loss of ∗ : definition: Hinge max( 0 ,1 y ), y loss illustration: l( x) x w * Perceptron for maximizing margins the idea: update whenever the correct classification margin is less than 2 No. of steps polynomial in Generalization: 1 2 Update margin: → (1 − ) No. of steps polynomial in 1 Perceptron Algorithm (maximizing margin) Assuming∀ ∈ , = 1 Init: 1 ← 1 Predict: ∗ ≥ 2 → ∗ ≤ − 2 → ∗ ∈ − 2 , 2 → On mistake (prediction or margin), update: +1 ← + Mistake Bound Theorem Let =< > consistent with : ∗ ∗ > 0 =1 M=No. of mistakes + No. of margin mistakes Then ≤ ∗ 12 2 where = ∗ ∗ min ∈ the margin of similar to the perceptron proof. * * Claim 1 remains the same: w t 1 w w t w We only have to bound w t 1 Mistake bound proof WLoG, the algorithm makes a mistake on every step Claim2: +1 ≤ || + Proof: |+1 | = 1+ + 1+ 1 2 And since 1 + ≤ 1 + ≤ 1+ 2 1 2 + 2 1 2 2 + 2 2 ∗ 2 + 1 2 ≤ Proof Cont. Since And the algorithm made a mistake on t ≤ 2 thus: +1 ≤ 1 + 2 + 2 Proof Cont. So: If +1 ≤ + 1 2 + 2 ≥ 2, +1 < + 2 +1 ≤ 1 + + From +1 3 4 3 4 → Claim 1 as before: M ≤ +1 ∗ ∗ ≤ Solving we get: ≤ 12 2 The mistake bound model Con Algorithm ∈ set of concepts consistent on 1 , 2 . . −1 At step t Randomly choose concept c Predict = CON Algorithm Theorem: For any concept class C, Con makes the most || − 1 mistakes Proof: at first 1 = . After each mistake| | decreases by at least 1 | | >= 1,since ∈ at any t Therefore number of mistakes is bound by || − 1 The bounds of CON This bound is too high! There We are 22 different functions on 0,1 can do better! HAL – halving algorithm ∈ set of concepts consistent on 1 , 2 . . −1 At step t Conduct a vote amongst all c Predict with accordance to the majority HAL –halving algorithm Theorem: For any concept class C, Con makes the most log 2 || − 1 mistakes Proof: 1 = . After each mistake +1 ≤ 1 sine majority of concepts were 2 wrong. Therefore number of mistakes is bound by log 2 || Mistake Bound model and PAC Generates strong online algorithms In the past we have seen PAC Restrictions for mistake bound are much harsher than PAC If we know that A learns C in mistake bound model , should A learn C in PAC model? Mistake Bound model and PAC A – mistake bound algorithm Our goal: to construct Apac a pac algorithm Assume that after A gets xi he construct hypothesis hi Definition : A mistake bound algorithm A is conservative iff for every sample xi if = ℎ−1 ( ) then in the ith step the algorithm will make a choice ℎ = ℎ−1 Mistake made change hypothesis Conservative equivalent of Mistake Bound Algorithm Let A be an algorithm whose mistake is bound by M Ak is A’s hypothesis after it had seen {1 , 2 , . . } Define A’ Initially ℎ0 = 0 . At update: Guess ℎ−1 If = ℎ−1 , ℎ = ℎ−1 Else ℎ = If we run A onS = { : = ℎ−1 } it would make || mistakes ⇒ ′ makes as many mistakes as A Building Apac = ln Apac algorithm: Run A’ over a sample of size ln( ) divided to M equal blocks Build hypothesis ℎ for each block Run the hypothesis on the next block If there are no mistakes output ℎ ,0 ≤ ≤ − 1 hk 0 h k1 inconsistent M ln consistent inconsistent … consistent 1 hk 0 h k1 Building Apac If A’ makes at most M mistakes then Apac guarantees to finish → APAC outputs a perfect classifier What happens otherwise? Theorem: Apac learns PAC Proof: Pr ℎ ℎ − M -1 Pr( A PAC outputs - bad h ) Pr( 0 i M 1 s.t. h k i is - bad ) i0 Pr( h k i is - bad) M -1 i0 M Disjunction of Conjuctions Disjunction of Conjunctions We have proven that every algorithm in mistake bound model can be converted to PAC Lets look at some algorithms in the mistake bound model Disjunction Learning Our goal: classify the set of disjunctions e.g. 1 ⋁2 ⋁ 6 ⋁8 Let L be the hypothesis set {1 , 1 , 2 , 2 … , } h = ⋁: ∈ Given a sample y do: If our hypothesis does a mistake (ht () ≠ ct ()) Than: ← \S ℎ 1. 2. 3. = { ℎℎ ℎℎ } 1. 2. Else do nothing Return to step 1 ( update our hypothesis) Example If we have only 2 variables L is {1 , 1 , 2 , 2 } ℎ = 1 ∨ 1 ∨ 2 ∨ 2 Assume the first sample is y=(1,0) ℎ = 1 If = 0 we update = 2 , 1 ℎ+1 = 1 ∨ 2 Mistake Bound Analysis The number of mistakes is bound by n+1 n is the number of variables Proof: Let R be the set of literals in Let be the hypothesis after t samples Mistake Bound Analysis We prove by induction that ⊆ For t=0 it is obvious that ⊆ 0 Assume after t-1 samples ⊆ −1 If = 1 than = ℎ and we dont update If = 1 than ofcourse S and R don’t intersect. Either way ⊆ Thus we can only make mistakes when = 0 Mistake analysis proof At first mistake we eliminate n literals At any further mistake we eliminate at least 1 literal L0 has 2n literals So we can have at most n+1 mistakes k-DNF Definition: k-DNF functions are functions that can be represented by a disjunction of conjunctions in which there are at most k literals E.g. 3-DNF (1 ∧ 2 ∧ 6 ) ∨ (1 ∧ 3 ∧ 5 ) The number of conjunctions of i terms is We choose i variables ( we choose a sign (2 ) 2 ) for each of which k-DNF classification We can learn this class by changing the previous algorithm to deal with terms instead of variables Reducing the space = 0,1 gives a disjunction on to = 0,1 2 usable algorithms ELIM for PAC The previous algorithm (In mistake bound model) which has mistakes Winnow Monotone Disjunction: Disjunctions containing only positive literals. e.g. 1 ∨ 3 ∨ 5 Purpose: to learn the class of monotone disjunctions in a mistake-bound model We look at winnow which is similar to perceptron One main difference: it uses multiplicative steps rather than additive Winnow Same classification scheme as perceptron ℎ = ∗ ≥ ⇒ ℎ = ∗ < ⇒ Initialize 0 = 1,1, … , 1 Update scheme: On positive misclassification (ℎ() =1, ()=0) ∀ = 1: ← 2 On negative misclassification : ∀ = 1: ← 2 Mistake bound analysis Similar to perceptron if the margin is bigger than then we can prove the error 1 rate is Θ( 2) Winnow Proof:Definitions Let = {1 , 2 , . . , } be the set of relevant variables in the target concept I.e. = 1 ∨ 2 ∨. . We define = 1 , 2 , . . , the weights of the relevant variables Let be the weight w at time t Let TW(t) be the total weight of all w(t) of both relevant and irrelevant variables Winnow Proof: Positive Mistakes Lets look at the positive mistakes Any mistake on a positive example doubles (at least) 1 of the relevant weights ∃ ∈ . . + 1 = 2() If ∃ . . ≥ we get ∗ ≥ therefore always a positive classification So, ∀ : can only be doubled at most 1 + log times Thus, we can bind the number of positive mistakes: + ≤ (1 + log ) Winnow Proof: Positive Mistakes For a positive mistake ℎ = 1 1 . … + < + 1 = +(1 1 . … + ) (1) + 1 < + Winnow Proof: Negative Mistakes On negative examples none of the relevant weight change Thus ∀ ∈ , + 1 ≥ w(t) For a negative mistake to occur: 1 1 +. . + ≥ + 1 = − ⇒ (2) + 1 ≤ − 1 1 +..+ 2 2 Winnow Proof:Cont. Combining the equations (1),(2): (3)0 < ≤ 0 + + − 2 At − the beginning all weight are 1 so 0 = (3)(4) ⇒ − < 2 + 2+ ≤ 2 + 2 + 1 ⇒ − + + ≤ 2 + 3( + 1) What should we know? I Linear Separators Perceptron algorithm : ≤ Margin Perceptron : ≤ The 1 2 12 2 mistake bound model CON algorithm : < − 1 but C may be very large! HAL the halving algorithm: < log 2 || What should you know? II The relation between PAC and the mistake bound model Basic algorithm for learning disjunction of conjunctions Learning K-DNF functions Winnow algorithm : ≤ 2 + 3(log + 1) Questions?