### Time Series Prediction with Machine Learning Models

```Machine Learning Based Models
for Time Series Prediction
2014/3
Outline

Support Vector Regression

Neural Network


Comparison
Support Vector Regression

Basic Idea

Given a dataset  = ( ,  ) 1 ≤  ≤  ,  ∈   ,  ∈

Our goal is to find a function   which deviates by at most  from the actual
target  for all training data.

The linear function case, the  is in the form   = ,  +



∙,∙ denotes the dot product in
“Flatness” in this case means a small  (less sensitive to the perturbations in the
features).
Therefore, we can write the problem as following
1
min

Subject to
2

2

− ,  −  ≤
,  +  −  ≤
f1(x) = <w, x> + b + ε
f2(x) = <w, x> + b
f3(x) = <w, x> + b - ε
w
+ε
-ε
Note that +ε and -ε are not actual geometric interpretation

Soft Margin and Slack Variables

approximates all pairs ( ,  ) with  precision, however, we also may allow some errors.

The soft margin loss function and slack variables were introduced to the SVR.

1
min 2
2
+
+

=1(ξ
+ ξ−
)
− ,  −  ≤  + ξ+

,  +  −  ≤  + ξ−
 Subject to

+ −
ξ , ξ ≥ 0

is the regularization parameter which determines the trade-off between flatness and the
tolerance of errors.

−
ξ+
, ξ are slack variables which determine the degree of error that far from -insensitive tube.

The -insensitive loss function
ξ

=
0,
−  ,  ≤
−  ,  − ,
ℎ
+
ξ
w
+ε
-
ξ
-insensitive loss function
+
ξ
-ε
-
ξ
-ε
+ε


The key idea is to construct a Lagrange function from the objective (primal) and
the corresponding constraints, by introducing a dual set of variables.

The dual function has a saddle point with respect to the primal and dual variables
at the solution.

Lagrange Function

−
, , ξ+
, ξ =

−
=1

1
2

2
+
+ ξ−
+  − ,  −
+(−)
Subject to
+(−)
≥ 0 and
+

=1(ξ
−
=1
≥0
+ ξ−
)−
+ ξ+
+
+

=1
− ξ−

+ ξ+
−  + ,  +  −

Taking the partial derivatives (saddle point condition), we get





L

L
b

=1
=0→=

=1
=0→
L
+
L
−
+ − −
+ − − = 0
= 0 → C − + − + = 0
= 0 → C − − − − = 0
The conditions for optimality yield the following dual problem:


L = −
1
2
Subject to

=1
0

=1

=1
≤ +
+ − − + − −  ,  −
− − + = 0
≤ , 0 ≤ − ≤

=1
+ + − −

=1
− − +

Finally, we eliminate dual variables by substituting the partial derivatives and we get

=

=
+

=1(
+

=1(
− − ) ,  +
− − )

This is called “Support Vector Expansion” in which  can be completely described as a
linear combination of the training patterns  .

The function is represented by the SVs, therefore it’s independent of dimensionality of
input space   , and depends only on the number of SVs.

We will define the meaning of “Support Vector” later.

Computing + and − is a quadratic programming problem and the popular methods are
shown below:

Interior point algorithm

Simplex algorithm

Computing

The parameter  can be computed by KKT conditions (slackness), which state that
at the optimal solution the product between dual variables and constrains has to
vanish.
+  + ξ+ −  + ,  +  = 0
−  + ξ− +  − ,  −  = 0
+ +
+ ξ+
= C −   ξ = 0
− −
− ξ−
= C −   ξ = 0

KKT(Karush–Kuhn–Tucker) conditions:

KKT conditions extend the idea of Lagrange multipliers to handle inequality
constraints.

Consider the following nonlinear optimization problem:


Minimizing F()

Subject to  () ≤ 0,   = 0, where 1 ≤  ≤  and 1 ≤  ≤
To solve the problem with inequalities, we consider the constraints as equalities
when there exists critical points.

The following necessary conditions hold:
∗ is the local minimum and there exists constants  and  called KKT multipliers.

Stationary condition: F  ∗ +

=1
∗ +

=1
∗ = 0
(This is the saddle point condition in the dual problem.)

Primal Feasibility:  ( ∗ ) ≤ 0 and   ∗ = 0

Dual Feasibility:  ≥ 0

Complementary slackness:    ∗ = 0
(This condition enforces either  to be zero or   ∗ to be zero)


Original Problem:
1
2
+
+ ξ−
)
min

− ,  −  ≤  + ξ+

Subject to ,  +  −  ≤  + ξ−

+ −
ξ , ξ ≥ 0
2

+

=1(ξ

Standard Form for KKT

Objective


1
min 2
2
+

+
=1(ξ
+ ξ−
)
Constraints

+ ξ+ −  + ,  +  ≥ 0 → −( + ξ+ −  + ,  + ) ≤ 0

+ ξ− +  − ,  −  ≥ 0 → −( + ξ− +  − ,  − ) ≤ 0

ξ+ , ξ− ≥ 0 → −ξ+ , −ξ− ≤ 0

Complementary slackness condition:


+(−)
There exists KKT multipliers
condition.
+(−)
and
(Lagrange multipliers in  ) that meet this

+  + ξ+
−  + ,  +  = 0 … (1)
−
+ ξ −
+  − ,  −  = 0 … (2)

+ +
+ ξ+
= C −  ξ = 0 … (3)
− −
− ξ−
= C −  ξ = 0 … (4)

From (1) and (2), we can get + − = 0

From (3) and (4), we can see that for some
+(−)
= , the slack variables can be nonzero.
Conclusion:
+(−)

Only samples ( ,  ) with corresponding
=  lie outside the -insensitive tube.

+ − = 0, i.e. there can never be a set of dual variables + and − which are both
simultaneously nonzero.


From previous page, we can conclude:

, + <
−  + ,  +  ≥ 0 , ξ+
=0
−  + ,  +  ≤ 0
, + > 0

−
+  − ,  −  ≥ 0 , ξ−
= 0 ,  <
, − > 0
+  − ,  −  ≤ 0
We can form inequalities in conjunction with the two sets of inequalities above

max{ − ,  − |+ <   − > 0} ≤  ≤ min{ − ,  + |+ > 0  − < }

If some
+(−)
∈ (0, ) the inequalities becomes equalities.

Sparseness of the Support Vector

Previous conclusion show that only for |( ) −  | ≥  the Lagrange multipliers can
be nonzero.

In other words, for all samples inside the -insensitive tube the + and − vanish.

=

Therefore, we have a sparse expansion of  in terms of  and the samples that
come with non-vanishing coefficients are called “Support Vectors”.

+
=1(
− − )

Kernel Trick

The next step is to make the SV algorithm nonlinear. This could be achieved by simply
preprocessing the training patterns  by a map .

∙ ∶  → ℎ ,  ∈ ℎ

The dimensionality ℎ of this space is implicitly defined.

Example:  ∶  2 → 3 ,  1 , 2 = (12 , 21 2 , 22 )

It can easily become computationally infeasible

The number of different monomial features (polynomial mapping) of degree p =

The computationally cheaper way:

,  = (), ( )

+−1


In the end, the nonlinear function takes the form:

(+ − − ) (), ( ) +
=
=1

(+ − − )( )
=
=1

Possible kernel functions

Linear kernel:  ,  = ,

Polynomial kernel:  ,  = ( ,  + )

Multi-layer Perceptron kernel:  ,  = ℎ( ,  + )

Gaussian Radial Basis Function kernel:  ,  = (
− − 2
)
2 2
Neural Network

The most common example is using feed-forward networks which employ the
sliding window over the input sequence.

For each neuron, it consists of three parts: inputs, weights and (activated)
output.

1, , … , , = (0 + 1 1, , … ,  , )


Sigmoid function   =
1
1+ −
2,
…
Hyperbolic tangent   =
2 −1
2 +1
1
1,
,
0
1
2

1


Example: 2-layer feed-forward Neural Network


=
Neural Network:


Evolutionary methods
2,

…

,   ,
1
1,
,

Their simple implementation and the existence of mostly local dependencies
exhibited in the structure allows for fast, parallel implementations in hardware.


Learning (Optimization) Algorithm

Error Function: E =

Chain Rule:
E

∆2() = −η

∆1(,) = −η
2()

=1
= −η
E
1(,)
−

2

2()
= −η
ℎ
ℎ 1(,)
where 1 ≤  ≤  and 1 ≤  ≤  ( ℎ input, ℎ hidden neuron)

Batch Learning and Online Learning using NN

the universal approximation theorem states that a feed-forward network with
a single hidden layer can approximate continuous functions under mild
assumptions on the activation function.

Combines the advantages of fuzzy logic and neural network.

Fuzzy rules is generated by input space partitioning or Fuzzy C-means clustering

Gaussian Membership function

= exp(−(

TSK-type fuzzy IF-THEN rules:
− 2
) )

: IF 1 IS 1 AND 2 IS 2 AND … AND  IS  THEN  = 0 + 1 1 + ⋯ +

Input space partitioning

For 2-dimensional data input:(1 , 2 )
2
23
3
6
9
22
2
5
8
21
1
4
7
1
11
12
13

Fuzzy C-means clustering

For  clusters, the degree of belonging :






=1
=
= 1, ∀ = 1, … , , … (5)
1
2

(  )−1
=1

where  =  −  … (6)
Objective function

=1
=

=1

2
=1( ) ( ,  )

, 1 , … ,  =

, 1 , … ,  =  , 1 , … ,  +

=1  ( =1
… (7)
− 1) … (8)
To minimize  , we take the derivatives of  and we can get the mean of cluster  =
Fuzzy C-means algorithm

Randomly initialize  and satisfies (5)

Calculate the means of each cluster

Calculate  according to new updated  in (6)

Stops when  or ( −  ) is small enough

=1( )

=1( )
… (9)
…
μ11
μi1
C1
П
N
R1
…
x1
…
…
…
П
N
Rj
…
…
…
…
μ1j
П
N
RJ
…
…
μn1
μij
Cj
…
xi
…
μ1J
…
…
μnj
μiJ
CJ
…
xn
μnJ
x
Σ
y

1st layer: fuzzification layer
(1)


−
= exp −

2nd layer: conjunction layer
2
,
1 ≤  ≤ , 1 ≤  ≤

(2)

(1)
=

1

3rd layer: normalization layer
(2)
(3)


4th layer: inference layer
(4)


(3)
=
=

(2)

=1
× (0 + 1 1 + ⋯ +   )
5th layer: output layer

=
(5)
(4)
=

=1
Comparison


Neural Network vs. SVR

Local minimum vs. global minimum

Choice of kernel/activation function

Computational complexity

Parallel computation of neural network

Online learning vs. batch learning
ANFIS vs. Neural Network

Convergence speed

Number of fuzzy rules
SVR
NN
ANFIS
Parameters
ε、C
kernel function
#. of hidden
#. of rules
neuron
membership
activation function function
Solution
Global minimum
Local minimum
Local minimum
Complexity
High
Low
Medium
Convergence
speed
Slow
Slow
Fast
Parallelism
Infeasible
Feasible
Feasible
Online learning
Infeasible
Feasible
Feasible
Example: Function Approximation (1)
50

ANFIS
Training Data
ANFIS Output
x = (0:0.5:10)';
40
w=5*rand;
35
b=4*rand;
30
y = w*x+b;
25
trnData = [x y];
20
tic;
15
numMFs = 5;
10
mfType = 'gbellmf';
5
epoch_n = 20;
0
in_fis = genfis1(trnData,numMFs,mfType);
0
1
2
3
4
5
6
7
8
9
10
out_fis = anfis(trnData,in_fis,20);
time=toc;
Time = 0.015707 RMSE = 5.8766e-06
h=evalfis(x,out_fis);
plot(x,y,x,h);
legend('Training Data','ANFIS Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])
45
45
40

NN
Training Data
NN Output
35
x = (0:0.5:10)';
30
w=5*rand;
25
b=4*rand;
y = w*x+b;
20
trnData = [x y];
15
tic;
10
net = feedforwardnet(5,'trainlm');
5
model = train(net,trnData(:,1)', trnData(:,2)');
0
time=toc;
0
1
2
3
4
5
6
7
8
9
h = model(x')';
Time = 4.3306 RMSE = 0.00010074
plot(x,y,x,h);
legend('Training Data','NN Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])
10
clear;
clc;

SVR
40
Training Data
SVR Output
35
30
25
20
x = (0:0.5:10)';
w=5*rand;
b=4*rand;
y = w*x+b;
trnData = [x y];
tic;
model = svmtrain(y,x,['-s 3 -t 0 -c 2.2
time=toc;
15
10
5
0
0
1
2
3
4
5
6
7
8
9
10
Time = 0.00083499 RMSE = 6.0553e-08
-p 1e-7']);
h=svmpredict(y,x,model);
plot(x,y,x,h);
legend('Training Data','LS-SVR Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])

Given function w=3.277389450887783 and b=0.684746751246247

% model struct

% SVs : sparse matrix of SVs

% sv_coef : SV coefficients

% model.rho : -b of f(x)=wx+b

% for lin_kernel : h_2 = full(model.SVs)'*model.sv_coef*x-model.rho;

full(model.SVs)'*model.sv_coef=3.277389430887783

-model.rho=0.684746851246246
Example: Function Approximation (2)
1
Training Data
ANFIS Output
0.8

ANFIS
x = (0:0.1:10)';
y = sin(2*x)./exp(x/5);
trnData = [x y];
tic;
numMFs = 5;
mfType = 'gbellmf';
epoch_n = 20;
in_fis = genfis1(trnData,numMFs,mfType);
out_fis = anfis(trnData,in_fis,20);
time=toc;
h=evalfis(x,out_fis);
plot(x,y,x,h);
legend('Training Data','ANFIS Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
1
2
3
4
5
6
7
8
9
Time = 0.049087 RMSE = 0.042318
10
1
Training Data
NN Output
0.8
0.6

NN
x = (0:0.1:10)';
y = sin(2*x)./exp(x/5);
trnData = [x y];
tic;
net = feedforwardnet(5,'trainlm');
model = train(net,trnData(:,1)', trnData(:,2)');
time=toc;
h = model(x')';
plot(x,y,x,h);
legend('Training Data','NN Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
1
2
3
4
5
6
7
8
9
Time = 0.77625 RMSE = 0.012563
10
clear;
clc;

SVR
x = (0:0.1:10)';
y = sin(2*x)./exp(x/5);
trnData = [x y];
tic;
model = svmtrain(y,x,['-s 3 -t 0 -c 2.2
time=toc;
-p 1e-7']);
h=svmpredict(y,x,model);
plot(x,y,x,h);
legend('Training Data','LS-SVR Output');
RMSE=sqrt(sum((h-y).^2)/length(h));
disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)])
1
Training Data
SVR Output
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
1
2
3
4
5
6
7
8
9
Time = 0.0039602 RMSE = 0.0036972
10
1
Training Data
SVR Output
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
1
2
3
4
5
6
7
8
Time = 20.9686 RMSE = 0.34124
9
10
1
Training Data
SVR Output
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
1
2
3
4
5
6
7
8
9
Time = 0.0038785 RMSE = 0.33304
10
```