Course Summer Mining Data Summer Course: Data Mining Vector Machines SupportSupport Vector Machines and other penalization classifiers Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov August 2009 2/20 Course Summer Mining Data Contents Purpose Linear Support Vector Machines Nonlinear Support Vector Machines (Theoretical justifications of SVM) Marketing Examples Other penalization classification methods Conclusion and Q & A (some extensions) 3/20 Course Summer Mining Data Purpose Task to be solved (The Classification Task): Classify cases (customers) into “type 1” or “type 2” on the basis of some known attributes (characteristics) Chosen tool to solve this task: Support Vector Machines 4/20 Course Summer Mining Data The Classification Task Given data on explanatory and explained variables, where the explained variable can take two values { 1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases: Given: ( x1, y1 ), … , ( xm , ym ) Find: : {1} n {1} n “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is minimal Existing techniques to solve the classification task: Linear and Quadratic Discriminant Analysis Logit choice models (Logistic Regression) Decision trees, Neural Networks, Least Squares SVM 5/20 Course Summer Mining Data Support Vector Machines: Definition Support Vector Machines are a non-parametric tool for classification/regression Support Vector Machines are used for prediction rather than description purposes Support Vector Machines have been developed by Vapnik and co-workers 6/20 Course Summer Mining Data Linear Support Vector Machines Number of art books purchased ∆ ● “The Art History of Florence” buyers non-buyers ∆ ∆ Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003). Problem: How to identify buyers and nonbuyers using the two variables: Months since last purchase Number of art books purchased ∆ ∆ ∆ ∆ ● ∆ ● ● ∆ ● ● A direct marketing company wants to sell a new book: ● ● ∆ ● ● ● ● Months since last purchase 7/20 Course Summer Mining Data Linear SVM: Separable Case Main idea of SVM: Number of art books purchased separate groups by a line. ∆ ● buyers non-buyers ∆ ∆ ∆ However: There are infinitely many lines that have zero training error… ∆ ∆ ∆ ● ● ● ● ● ● ● ● Months since last purchase … which line shall we choose? 8/20 Course Summer Mining Data Number of art books purchased Linear SVM: Separable Case ∆ ● buyers non-buyers ∆ ∆ ∆ SVM use the idea of a margin around the separating line. The thinner the margin, the more complex the model, The best line is the one with the largest margin. margin ∆ ∆ ∆ ● ● ● ● ● ● ● ● Months since last purchase 9/20 Course Summer Mining Data Linear SVM: Separable Case w1x1 + w2x2 + b = 0 x2 Number of art books purchased The line having the largest margin is: w ∆ ∆ ∆ ∆ ∆ ∆ ● margin ● ● ● ● ● ● ● x1 = months since last purchase x2 = number of art books purchased Note: x1 Months since last purchase Where w1xi 1 + w2xi 2 + b +1 w1xj 1 + w2xj 2 + b –1 for i ∆ for j ● 10/20 Course Summer Mining Data Linear SVM: Separable Case Number of art books purchased x2 The width of the margin is given by: w ∆ ∆ margin ∆ 1 ( 1) w12 w 22 2 || w || ∆ ∆ ∆ ● margin ● ● ● ● 2 w ● maximize the margin ● ● x1 Months since last purchase Note: w 2 minimize w 2 2 minimize 11/20 Course Summer Mining Data Linear SVM: Separable Case 2 w x2 maximize the margin ∆ ∆ w 2 w 2 2 minimize minimize ∆ ∆ ∆ margin ∆ minimize L( w ) w ● ● ● ● ● The optimization problem for SVM is: ● ● x1 2 subject to: ● 2 w1xi 1 + w2xi 2 + b +1 w1xj 1 + w2xj 2 + b –1 for i ∆ for j ● 12/20 Course Summer Mining Data Linear SVM: Separable Case “Support vectors” x2 ∆ ∆ ∆ “Support vectors” are those points that lie on the boundaries of the margin ∆ ∆ ∆ ● ● ● ● ● ● ● ● x1 The decision surface (line) is determined only by the support vectors. All other points are irrelevant 13/20 Course Summer Mining Data Linear SVM: Nonseparable Case Non-separable case: there is no line separating errorlessly the two groups Here, SVM minimize L(w,C) : Training set: 1000 targeted customers x2 ∆ ● buyers non-buyers ∆ ∆ L( w,C ) w ∆ ∆ ∆ ∆ ● ● ∆ ● ● ● 2 C i i ∆ ● 2 maximize the margin minimize the training errors ● L(w,C) = Complexity + ∆ ● ● ● ● subject to: x1 Errors w1xi 1 + w2xi 2 + b +1 – i w1xj 1 + w2xj 2 + b –1 + i I,j 0 for i ∆ for j ● 14/20 Course Summer Mining Data Linear SVM: The Role of C x2 ∆ x2 ∆ ∆ ● C=5 ∆ ● ∆ ● ∆ ∆ ● ● ● x1 Bigger C increased complexity ( thinner margin ) ∆ C=1 ∆ ● ● ∆ ● ● x1 Smaller C decreased complexity ( wider margin ) smaller number errors bigger number errors ( better fit on the data ) ( worse fit on the data ) Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors 15/20 Course Summer Mining Data Bias – Variance trade-off 16/20 Course Summer Mining Data From Regression into Classification We have a linear model, such as y b * x const We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). The training data set we have consists of only MANY observations, for instance: Training data: Output (y) Input (x) -1 0.2 1 0.5 1 0.7 ... .. -1 -0.7 17/20 Course Summer Mining Data From Regression into Classification We have a linear model, such as y b * x const We have to estimate this relation using our training data set and having in mind the socalled “accuracy”, or “0-1” loss function (our evaluation criterion). The training data set we have consists of only MANY observations, for instance: y 1 -1 Training data: Output (y) Input (x) -1 0.2 1 0.5 1 0.7 ... .. -1 -0.7 x x Support vector “margin” Support vector 18/20 Course Summer Mining Data From Regression into Classification: Support Vector Machines flatter line greater penalization equivalently: smaller slope bigger margin y y b * x const 1 -1 x x “margin” 19/20 Course Summer Mining Data From Regression into Classification: Support Vector Machines y b1 * x1 b2 * x2 const y b1 * x1 b2 * x2 const y x2 x2 x1 x1 flatter line greater penalization “margin” equivalently: smaller slope bigger margin 20/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Mapping into a higher-dimensional space x2 ∆ ∆ ∆ ∆ ∆ x11 x 21 xl1 ∆ ● ∆ ● ● ∆ ● ● ● ● ∆ ● ● ● x12 x 22 xl 2 ● x1 2 x11 x12 2 x 21 x 22 2 xl 1xl 2 x122 2 x 22 2 xl 2 Optimization task: minimize L(w,C) L(w,C) w x112 2 x 21 2 xl 1 subject to: 2 2 Ci i w1xi21 w2 2 xi 1xi 2 w3 xi22 b 1 i ∆ w1x 2j 1 w 2 2 x j 1x j 2 w 3 x 2j 2 b 1 j ● 21/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Map the data into higher-dimensional space: x1 x 2 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 x 2 x1 x2 x2 2 2 1 x2 ● (-1,1) (-1,-1) ∆ ∆ ● ● x22 ∆ ∆ (1,1) x1 ∆ 2 3 ● ● x12 (1,-1) 2 x1 x2 22/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Find the optimal hyperplane in the transformed space x1 x 2 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 x 2 x1 x2 x2 2 2 1 x2 ● (-1,1) (-1,-1) ∆ ● ● x22 ∆ ∆ (1,1) x1 ∆ ∆ ● ● x12 (1,-1) 2 x1 x2 23/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Observe the decision surface in the original space (optional) x1 x 2 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 1, 2 , 1 1, 1 x 2 x1 x2 x2 2 2 1 x2 ● ∆ ● ● x22 ∆ ∆ ● x1 ∆ ∆ ● x12 2 x1 x2 24/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem Primal m in w 2 2 Dual C i yi w xi b 1 i yi 1 i i i Subject to i 0 max Subject to 0 i C i i yi 0 yi 1 1 2 i i j j yi yj xi xj 25/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem x1 x 2 x12 2 x1 x2 x2 2 ( xi ) ( x j ) x , 2 i1 ( x 2 xi1 xi 2 , xi22 x 2 j1 i1 , xi 2 ) ( x j1 , x j 2 ) x x i Dual max i 1 2 i i i j i j yi yj xi xj , 2 x j1 x j 2 , x 2j 2 2 2 j K ( xi , x j ) ( xi ) ( x j ) (kernel function) Subject to 0 i C i yi 0 yi 1 26/20 Course Summer Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem x1 x 2 x12 2 x1 x2 x2 2 Dual max x , ( x i1 2 xi1 xi 2 , x 2 j1 , xi 2 ) ( x j1 , x j 2 ) x x i x 2 i2 , 2 x j1 x j 2 , x 2 i 1 2 i i ( xi ) ( x j ) 2 i1 2 j2 max i yi yj xi xj j 1 i 2 i j yi yj ( xi ) ( xj ) i i j 2 j max i 12 i j yi yj xi xj i K ( xi , x j ) ( xi ) ( x j ) (kernel function) j i j i 2 Subject to 0 i C i yi 0 yi 1 27/20 Course Summer Mining Data Strengths and Weaknesses of SVM Strengths of SVM: Training is relatively easy No local minima It scales relatively well to high dimensional data Trade-off between classifier complexity and error can be controlled explicitly via C Robustness of the results The “curse of dimensionality” is avoided Weaknesses of SVM: What is the best trade-off parameter C ? Need a good transformation of the original space 28/20 Course Summer Mining Data The Ketchup Marketing Problem Two types of ketchup: Heinz and Hunts Seven Attributes Feature Heinz Feature Hunts Display Heinz Display Hunts Feature&Display Heinz Feature&Display Hunts Log price difference between Heinz and Hunts Training Data: 2498 cases (89.11% Heinz is chosen) Test Data: 300 cases (88.33% Heinz is chosen) 29/20 Course Summer Mining Data The Ketchup Marketing Problem Cross-validation mean squared errors, SVM with RBF kernel Choose a kernel mapping: K (xi , x j ) (xi x j ) Linear kernel K (xi , x j ) (xi x j 1)d Polynomial kernel K (xi, xj ) e C min max σ xi xj / 2 2 2 RBF kernel Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ) 30/20 Course Summer Mining Data The Ketchup Marketing Problem – Training Set Model Linear Discriminant Analysis Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 68 204 272 Heinz 58 2168 2226 Hunts 25.00% 75.00% 100.00% Heinz 2.61% 97.39% 100.00% 89.51% 31/20 Course Summer Mining Data The Ketchup Marketing Problem – Training Set Model Logit Choice Model Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 214 58 272 Heinz 497 1729 2226 Hunts 78.68% 21.32% 100.00% Heinz 22.33% 77.67% 100.00% 77.79% 32/20 Course Summer Mining Data The Ketchup Marketing Problem – Training Set Model Support Vector Machines Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 255 17 272 Heinz 6 2220 2226 Hunts 93.75% 6.25% 100.00% Heinz 0.27% 99.73% 100.00% 99.08% 33/20 Course Summer Mining Data The Ketchup Marketing Problem – Training Set Model Majority Voting Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 0 272 272 Heinz 0 2226 2226 Hunts 0% 100% 100.00% Heinz 0% 100% 100.00% 89.11% 34/20 Course Summer Mining Data The Ketchup Marketing Problem – Test Set Model Linear Discriminant Analysis Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 3 32 35 Heinz 3 262 265 Hunts 8.57% 91.43% 100.00% Heinz 1.13% 98.87% 100.00% 88.33% 35/20 Course Summer Mining Data The Ketchup Marketing Problem – Test Set Model Logit Choice Model Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 29 6 35 Heinz 63 202 265 Hunts 82.86% 17.14% 100.00% Heinz 23.77% 76.23% 100.00% 77% 36/20 Course Summer Mining Data The Ketchup Marketing Problem – Test Set Model Support Vector Machines Heinz Predicted Group Membership Hunts Original Count % Total Heinz Hit Rate Hunts 25 10 35 Heinz 3 262 265 Hunts 71.43% 28.57% 100.00% Heinz 1.13% 98.87% 100.00% 95.67% •37/36 •Part II •Penalized classification and regression methods Support Hyperplanes Nearest Convex Hull classifier Soft Nearest Neighbor Application: An example Support Vector Regression financial study Conclusion •38/36 •Classification: •Support Hyperplanes •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •Consider a (separable) binary classification case: training data (+,-) and a test point x. •There are infinitely many hyperplanes that are semi-consistent (= commit no error) with the training data. •39/36 •Classification: •Support Hyperplanes •+ •Support hyperplane of x •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •For the classification of the test point x, use the farthest-away hplane that is semi-consistent with training data. •The SH decision surface. Each point on it has 2 support h-planes. •40/36 •Classification: •Support Hyperplanes SH, linear kernel •+ •+ •+ •+ •+ •+ SVM, linear kernel •+ •+ •+ •+ •+ •+ SH, RBF kernel, =5 •+ •+ •+ •+ •+ •+ SVM, RBF kernel, =5 •+ •+ •+ •+ •+ •+ SH, RBF kernel,=35 •+ •+ •+ •+ •+ •+ SVM, RBF kernel,=35 •+ •+ •+ •+ •+ •+ •Toy Problem Experiment with Support Hyperplanes and Support Vector Machines •41/36 •Classification: •Support Vector Machines and Support Hyperplanes • Support Vector Machines • Support Hyperplanes •42/36 •Classification: •Support Vector Machines and Nearest Convex Hull cl. • Support Vector Machines • Nearest Convex Hull classification •43/36 •Classification: •Support Vector Machines and Soft Nearest Neighbor • Support Vector Machines • Soft Nearest Neighbor •44/36 •Classification: Support Hyperplanes • Support Hyperplanes • (bigger penalization) • Support Hyperplanes •45/36 •Classification: Nearest Convex Hull classification • Nearest Convex Hull classification • (bigger penalization) • Nearest Convex Hull classification •46/36 •Classification: Soft Nearest Neighbor Soft Nearest Neighbor • • (bigger penalization) • Soft Nearest Neighbor •47/36 •Classification: Support Vector Machines, •Nonseparable Case • Support Vector Machines •48/36 •Classification: Support Hyperplanes, •Nonseparable Case • Support Hyperplanes •49/36 •Classification: Nearest Convex Hull classification, •Nonseparable Case • Nearest Convex Hull classification •50/36 •Classification: Soft Nearest Neighbor, •Nonseparable Case • Soft Nearest Neighbor •51/36 Summary: Penalization Techniques for Classification •Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+). 52/20 Course Summer Mining Data Conclusion Support Vector Machines (SVM) can be applied in the binary and multi-class classification problems SVM behave robustly in multivariate problems Further research in various Marketing areas is needed to justify or refute the applicability of SVM Support Vector Regressions (SVR) can also be applied http://www.kernel-machines.org Email: [email protected]