ω - Institute of Statistical Science, Academia Sinica

Report
Tutorial 1
General Introduction to SDA
Yin-Jing Tien (田銀錦)
Institute of Statistical Science
Academia Sinica
[email protected]
June 13, 2014
Symbolic data Analysis (SDA)
(Diday 1987)
Text:
Billard and Diday (2006):
Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.
Diday, E., Noirhomme-Fraiture, M. (2008):
Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England.
Symbolic data
(Diday 1987)
• Classical Data : Individuals: single value
Single player
age = 25, eye color = blue
• Symbolic Data : Symbolic units (Concept: groups)
Team
interval : age range = [20, 36]
multiple values: eye color = {blue,brown,black}
Symbolic data analysis
When?
• When we are interested the higher level units
(Concept: groups/classes ).
• When the initial data are composed by
Symbolic data tables
• When the data is BIG
Symbolic data types
Symbolic data types (quantitative)
Multi-valued symbolic random variable Y is
one or more values
{12,23,20}
Interval-valued symbolic random variable Y is
one that takes values in an interval
[17, 25]
Modal multi-valued
{0.5, 3/8, 1.5, 4/8, 2, 1/8}
Y (u)  {k ,  k ; k  1,2,...,su}
Modal interval-valued (Histogram)
{[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7}
Y (u)  {[auk , buk ), puk ; k  1,2,...,su}
6
Symbolic data types (qualitative)
Multi-valued symbolic random variable Y is
one or more values
E.g., Bird Colors, Y=color
Modal multi-valued
Y (u)  {k ,  k ; k  1,2,...,su}
{single, 3/8, married, 5/8}
Basic Descriptive Statistics: Interval Value
Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith
variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.
Sample Mean of Ii is
1
 =
2
Sample Variance of Zi is

( +  )
=1
Basic Descriptive Statistics: Interval Value
Rewrite 2 as
1
2 =
3

2
[  − 
+  −   −  +  −  2 ]
=1
Total Variation = Within Variation + Between Variation
2
1 1
[
 − 
3 2 
1
Within Variation =


1
+
 − 
2 
1
[  − 
3 
=1

1
Between Variation =

=1
1
[  − 
3 
1
 − 
2 
2
2
=
1
 − 
12 
+  −   −  +  −  2 ]
2
1
 = ( +  )
2
1
+
 − 
2 
+  −   −  +  −  2 ]
1
 =
2

( +  )
=1
2
Similarity between Variables (interval-valued data)
(Billard and Diday (2006))
The empirical covariance function between Zi and Zj is
  , 
1
=
4

 +   + 
=1

1
− 2
4

 + 
=1
 + 
=1
The empirical correlation coefficient between Zi and Zj is
  ,  =
Where
  , 
 
Distance between concept
Definition 7.6: The Cartesian join A⊕B between two sets A and B is their
componentwise union,
Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their
componentwise intersection,
Distance between concept
Distance between concept (Multi-valued)
The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)
(relative sizes)
(relative content)
Distance between concept (Multi-valued)
Example: Color and Habitat of Birds (Table 7.2)
Y1 = Color, Y2 = Habitat
For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2
D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2
The Gowda-Diday dissimilarity
For Y2: D11(ω1, ω2)=(|2-1|)/2 = 1/2
D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2
D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2
Normalized (adjust for scale) weights are
D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2 = 5/6
3
2
p=2
Distance between concept (Multi-valued)
The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994)
ϕj(ω1, ω2)= ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
For Y1: ϕ1(ω1, ω2)= 2-1+γ (2*1-2-1)
= 1-γ
For Y2: ϕ2(ω1, ω2)= 2-1+γ (2*1-2-1)
= 1-γ
Taking γ =0.5
Unweighted Minkowski distance
Dq (ω1, ω2)= (0.5q+0.5q)1/q
Weighted Minkowski distance (
Dq (ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q
)
Distance between concept (Interval-valued)
Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k
concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.
The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)
Dj(ω1, ω2) for the variable Yj

Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2)
D(ω1, ω2) =
=1
Dj1(ω1, ω2) = 1 − 1 − 2 − 2 /
(relative length)
Dj2(ω1, ω2) = 1 − 1 + 2 − 2 − 2 / (relative content)
Dj3(ω1, ω2) = 1 − 2 / Y
(relative position)
 =  1 , 2 −  1 , 2
 =
length of the entire distance spanned by ω1 and ω2
 1 , 2 −  1 , 2 , if the intervals overlap
, otherwise
0
Y = Max( ) − Min( )


length of the intersection
total length in Y covered by the observe values of Yj
Distance between concept (Interval-valued)
The Ichino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994)
ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
ω1 ⊕ ω2 =  1 , 2 ,  1 , 2
ω1 ⊗ ω2 =  1 , 2 ,  1 , 2
(empty if no interaction)
The generalized Minkowski distance of order q ≥1 between two interval-valued
observations ξ(ω1) and ξ(ω2) is
1/

dq(ω1, ω2) =
∗ ϕj(ω1, ω2)

=1
Where ϕj(ω1, ω2) is the Ichino-Yaguchi distance and ∗ is a weight function
associated with variable Yj .
When q = 1  City Block distance
When q = 2  Euclidean distance
ϕj(ω1, ω2) = a1j − a2j + b1j − b2j
Distance between concept (Interval-valued)
The Hausdorff Distance (Chavent and Lechevallier, 2002)

d(ω1, ω2) =
ϕj(ω1, ω2) = ( a1j − a2j , b1j − b2j )
ϕj(ω1, ω2)
=1
The Euclidean Hausdorff Distance
1/2

d(ω1, ω2) =
ϕj(ω1, ω2)
2
Where ϕj(ω1, ω2) is the Hausdorff Distance
=1
The Normalization Euclidean Hausdorff Distance
1/2

d(ω1, ω2) =
ϕj(ω1, ω2)/
1
Where 2 = 2
2
2
=1


 ( ,  )
=1 =1
2

The Span Normalization Euclidean Hausdorff Distance
1/2

d(ω1, ω2) =
ϕj(ω1, ω2)/|Y |
=1
2
Where the span |Y | = Max( ) − Min( )


Distance between concept (Interval-valued)
Example: Take the first 3 observations
only of veterinary data
1 =  11 , 21 −  11 , 21
= 180 − 120 = 60
2 = 355 − 222.2 = 132.8
1 =  11 , 21 −  11 , 21
= 158 − 160 = 2
2 = 322 − 354 = 32
Y1 = Max(1 ) − Min(1 )

Gowda-Diday dissimilarity
2
Y2

= 185 − 120 = 65
= 355 − 117.2 = 237.8
Dj1(ω1, ω2) + Dj2(ω1, ω2) + Dj3(ω1, ω2)
D(ω1, ω2) =
=1
(Y1)
= [ 60 − 2 /60 + 60 + 2 − 2 ∗ 2 /60 +|120-158|/65]
(Y2)
+ [ 131.8 − 33 /132.8 + 131.8 + 33 − 2 ∗ 32 /132.8 +|222.2 − 322|/237.8]
= 4.44
Dj1(ω1, ω2) = 1 − 1 − 2 − 2 /
Dj2(ω1, ω2) = 1 − 1 + 2 − 2 − 2 /
Dj3(ω1, ω2) = 1 − 2 / Y
Distance between concept (Interval-valued)
The Ichino-Yaguchi dissimilarity
ϕj(ω1, ω2) = ω1 ⊕ ω2 − ω1 ⊗ ω2 + γ(2 ω1 ⊗ ω2 − ω1 − ω2 )
ω1 ⊕ ω2 =  1 , 2 ,  1 , 2
ω1 ⊗ ω2 =  1 , 2 ,  1 , 2
(empty if no interaction)
ϕ1(ω1, ω2) = |180-120|− 160−158 + γ(2 160 − 158 − 180−160 − 160−158 )
= 58+γ(-58)
ϕ2(ω1, ω2) = |355-222.2|− 354−322 + γ(2 354 − 322 − 354−222.2 − 355−322 )
= 100.8+ γ(100.8)
The generalized Minkowski distance
1/

dq(ω1, ω2) =
∗ ϕj(ω1, ω2)
=1

When q = 1  City Block distance
When q = 2  Euclidean distance
Distance between concept (Interval-valued)
The Hausdorff Distance
ϕj(ω1, ω2) = ( a1j − a2j , b1j − b2j )

d(ω1, ω2) =
ϕj(ω1, ω2)
ϕ1(ω1, ω2) = ( 120 − 158 , 180 − 160 ) =38
=1
ϕ2(ω1, ω2) = ( 222.2 − 322 , 354 − 355 ) =99.8
= 38 + 99.8 = 137.8
The Euclidean Hausdorff Distance
1/2
2
= (382 +99.82 ) 1/2 = 106.97

=1
1
2 = 2
The Normalization Euclidean Hausdorff Distance
2
1/2
ϕj(ω1, ω2)
d(ω1, ω2) =
2
2
ϕj(ω1, ω2)/
d(ω1, ω2) =
2
= 2.633
=1
The Span Normalization Euclidean Hausdorff Distance
1/2

d(ω1, ω2) =
ϕj(ω1, ω2)/|Y |
=1
2
= 0.720
12 =

=1 =1
1
[382 +
2
2×3
 ( ,  )
2

552 + 272 ]=288.78
22 = 5150.39
|Y | = Max( ) − Min( )


|Y | = 185-120 = 65
|Y | = 355-117.2 = 237.8
Distance between concept (group) of interval-valued data
Comparison of between-concept distance measures
Interval-valued symbolic data analysis
• Books
(Bock and Diday (2000), Billard and Diday (2003,
2006), and Diday and Noirhomme-Fraiture (2008))
• PCA
(Chouakria, Cazes, and Diday (2000); Palumbo and
Lauro (2003); Gioia and Lauro (2006); Hamada,
Minami, and Mizuta (2008))
• Clustering analysis
( Brito (2002); Souza and de
Carvalho (2004); Chavent et al. (2006); Bock (2008))
• Discriminant analysis (Lauro, Verde, and Palumbo (2000);
Duarte Silva and Brito (2006))
• MDS (Groenen et al. (2006); Minami and Mizuta (2008)
• Regression (Billard and Diday (2000); de Carvalho et al.
(2004))
Visualization Tools for Symbolic Data (Analysis)
Symbolic Data Analysis Software
• SODAS (2003)
FREE from 2 European Consortium
• SYR (2008)
More professional from SYROKKO Company
www.syrokko.com

similar documents