You are on page 1of 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Multivariate Data Analysis

Special focus on Clustering and Multiway Methods Franois Husson & Julie Josse
Applied mathematics department, Agrocampus Rennes

useR! 2010, July 20, 2010

1 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Why a tutorial on Multivariate Data Analysis?

Our research focus is principal component methods We teach multivariate data analysis We have developed R packages:

FactoMineR to perform principal component methods



PCA, correspondence analysis (CA), multiple correspondence analysis (MCA), multiple factor analysis (MFA) complementarity between clustering and principal component methods

missMDA to handle missing values in and with multivariate


data analysis

perform principal component methods (PCA, MCA) with missing values simple and multiple imputation based on principal component models for continuous and categorical data
2 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Outline

Multivariate data analysis with a special focus on clustering and multiway methods
1 2 3

Principal Component Analysis (PCA) Multiple Factor Analysis (MFA) Complementarity between Clustering and Principal Component methods

Multidimensional descriptive methods Graphical representations

3 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Principal Component Analysis


1 2 3 4

Data - Issues - Preprocessing Individuals Study Variables Study Helps to Interpret

4 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Principal Component Analysis

Dimensionality reduction describes the dataset with a smaller number of variables Technique widely used for applications such as: data compression, data reconstruction, preprocessing before clustering, and ... Descriptive methods

5 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

PCA deals with which kind of data?

PCA deals with continuous variables, but categorical variables can also be included in the analysis Many examples: Sensory analysis: products - descriptors Ecology: plants - measurements; waters - physico-chemical analyses Economy: countries - economic indicators Microbiology: cheeses - microbiological analyses etc.

Figure: Data table in PCA

6 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Wine data

10 individuals (rows): white wines from Val de Loire 30 variables (columns):


27 continuous variables: sensory descriptors 2 continuous variables: odour and overall preferences 1 categorical variable: label of the wines (Vouvray - Sauvignon)
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene Astringency Sweetness O.passion Bitterness

O.citrus

O.fruity

Acidity

S S S S S V V V V V

Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux

4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1

2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9

5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7

3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8

5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1

4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3

1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3

7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3

6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6

5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3

6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0

5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7

Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray
7 / 40

Label

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Problems - objectives

Individuals study: similarity between individuals with respect to all the variables partition between individuals Variables study: linear relationships between variables visualization of the correlation matrix (denoted S ); nd synthetic variables Link between the two studies: characterization of the groups of individuals by the variables; specic individuals to better understand links between variables

8 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Two clouds of points

Individuals study

Variables study

K 1

I RK

I RI

var 1 ind i

ind 1

var k

9 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Preprocessing

Similarity between individuals: Euclidean distance

Choosing active variables


d
2

K
(i , i ) =

k =1

(xik xi k )2

Variables are always centred


K
d
2

(i , i ) =

k =1

k ) (xi k x k ))2 ((xik x

Standardizing variables or not?


K
d
2

(i , i ) =

1
s
2

k =1 k

(xik xi k )2
10 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Individuals cloud

Study the structure, i.e. the shape of the cloud of individuals Individuals are in RK
11 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Fit the individuals cloud

Find the subspace which better sums up the data

Figure: Camel vs dromedary?


Closest representation by projection Best representation of the diversity, variability
12 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Fit the individuals cloud

xi.
min max
P

Fi1

u1

1 u1 (xi . ) = u1 (u1 u1 ) u1 xi . = < xi . , u1 > u1 Fi 1 = < xi . , u1 >

Minimize the distance between individuals and their projections Maximize the variance of the projected data
u1

= arg max(var (F.1 )) = arg max(var (Xu1 )) with u1 RK u1 RK


1 1 1 1

u1 u1

=1

u rst eigenvector of the correlation matrix associated with the largest eigenvalue : Su = u
1

Var F.1

) = var (Xu1 ) = 1/I

u1 X Xu1

= u1 Su1 = 1 u1 u1 = 1
13 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Fit the individuals cloud

Additional axes are sequentially dened: each new direction maximizes the projected variance among all orthogonal directions Q eigenvectors u ,...,uQ associated to ,...,Q
1 1

Representation quality: dimensionality reduction loosing information

Total variance of the initial individuals cloud (total inertia): 1


I

K
x

i. g

= tr (S ) =

k =1
1 2

k (= K )

Variance of the projected individuals cloud (Q-dimensional representation): var (F ) + var (F ) + ... + var (FQ )
Q k =1 k K k =1 k
14 / 40

Percentage of variance explained:

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Example: wine data

Sensory descriptors are used as active variables: only these variables are used to construct the axes Variables are (centred and) standardized
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene

Astringency

Sweetness

O.passion

Bitterness

O.citrus

O.fruity

Acidity

S S S S S V V V V V

Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux

4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1

2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9

5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7

3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8

5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1

4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3

1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3

7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3

6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6

5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3

6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0

5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7

Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray

Label

15 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Example: graph of the individuals

V Aub Marigny V Font Coteaux S Renaudie S Michaud S Trotignon


2

V Font Brls

Dim 2 (25.14%)

S Buisse Cristal
-2

V Font Domaine

S Buisse Domaine
-4

V Aub Silex
-6 -6

-4

-2

0 Dim 1 (43.48%)

Need variables to interpret the dimensions of variability


16 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Individuals coordinates considered as variables

u2

1 1
V Aub Marigny V Font Coteaux

K F.1 F.2

S Renaudie S Michaud S Trotignon

Fi1
S Buisse Cristal

V Font Br ls

u1

V Font Domaine

xik

Fi1 Fi2

S Buisse Domaine

Fi2

V Aub Silex

17 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Interpretation of the individuals graph with the variables

Correlation between variable x.k and F. (and F. )


1 2

r (F.2 , x.k )

x.k
O.Vanilla

-1

r(F.1 , x.k )

-1

Correlation circle
18 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Interpretation of the individuals graph with the variables

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity

Dim 2 (25.14%)

-0.5

0.0

0.5

1.0

Sweetness
-1.0

-1.0

-0.5

0.0 Dim 1 (43.48%)

0.5

1.0

1.5

19 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Cloud of variables

x ik

Since variables are centred:


cos

(kl ) =

< x.k , x.l >


x.

x.

= (

I x x i =1 ik il = r (x.k , x.l ) I x 2 )( I x 2 ) i =1 ik i =1 il

20 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Fit the variables cloud

Find v (in RI , with v


1

v1

= 1) which best ts the cloud


1 v1 (x.k ) = v1 (v1 v1 ) v1 x.k Gk 1 = 1/I < v1 , x .k > < v1 , x .k > Gk 1 = 1/I v1 x.k

x.k
P

Gk1 v1

arg max

v1
1

v1 RI i =k

2 r (v1 , x.k ) k 1 = arg max v1 RI i =k 2

is the best synthetic variable v , ..., vQ are the eigenvectors of W = XX the inner product matrix associated with the largest eigenvalues: Wvq = q vq
21 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Fit the variables cloud

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity

Dim 2 (25.14%)

-0.5

0.0

0.5

1.0

Sweetness
-1.0

-1.0

-0.5

0.0 Dim 1 (43.48%)

0.5

1.0

1.5

Same representation! What a wonderful result!


22 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Projections...
r A B cos

( , ) = cos (A,B ) (A,B ) cos (HA ,HB ) if variables are well projected
A HA HB HD HE E D HC HA HB HD HE HC

Only well projected variables can be interpreted!


23 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Link between the two representations: transition formulae


v2 u2 Fi1 Fi2
1 1 k
i k Gk2

u1
Gk1

v1

K F.1 F.2

1 1

K v 1 v2

xik

Fi1 Fi2

xik

I u1 u2

G.1 G.2

Gk1 Gk2

24 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Link between the two representations: transition formulae

Su

=X

Xu

= u
W Xu Wv

XX Xu WF

= X u

) = (Xu )
F

= F and since

= v then

and v are collinear

Since, ||F || = and ||v || = 1 we have:


v u

= =

1 F 1 G

G F

1 =X v = X F 1 = Xu = XG

iq =

K
x

q k =1

ik Gkq

kq =

I
x

q i =1

ik Fiq

F.

G.

q : principal components, scores q : correlations between variables and principal components


25 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Link between the two representations: transition formulae

iq =

1
q

ik Gkq

kq =

1
q

ik Fiq

What does it mean? An individual is at the same side as the variables for which it takes high values
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity
-0.5
-4

V Aub Marigny S Renaudie S Michaud S Trotignon


2

Dim 2 (25.14%)

V Font Brls

Dim 2 (25.14%)

S Buisse Cristal
-2

V Font Domaine

S Buisse Domaine

V Aub Silex
-6

0.0

0.5

V Font Coteaux

1.0

Sweetness
-1.0

-6

-4

-2

0 Dim 1 (43.48%)

-1.0

-0.5

0.0 Dim 1 (43.48%)

0.5

1.0

1.5

26 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Supplementary information

For the continuous variables: projection of supplementary variables on the dimensions For the individuals: projection For the categories: projection at the barycentre of the individuals who take the categories
Sauvignon Vouvray
S Renaudie S Michaud S Trotignon Sauvignon
2

V Aub Marigny V Font Coteaux


0.5

V Font Brls Vouvray V Font Domaine


Dim 2 (25.14%)

Dim 2 (25.14%)

-2

S Buisse Domaine
-4

0.0

S Buisse Cristal

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Odor.preferene Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Overall.preference Oxidation O.fruity Typicity

V Aub Silex
-6

-0.5

1.0

Sweetness
-1.0
-6 -4 -2 0 Dim 1 (43.48%) 2 4

-1.0

-0.5

0.0 Dim 1 (43.48%)

0.5

1.0

1.5

Supplementary information do not create the dimensions

27 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Choosing the number of components


Eigenvalues

Bar plot, test on eigenvalues, condence interval, cross-validation (functions estim_ncpPCA and estim_ncp), etc.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

10

x.1

x.k

x.K

F1

FQ

FK

Two objectives: Interpretation Separate structure and noise

PCA Data Structure Noise

28 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Percentage of variance obtained under independence

Is there a structure on my data?


Number of variables nbind 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 100 4 96.5 93.3 90.5 88.1 86.1 84.5 82.8 81.5 80.0 79.0 78.1 77.3 76.5 75.5 75.1 74.1 72.0 69.8 68.5 67.5 66.4 65.6 60.9 5 93.1 88.6 84.9 82.3 79.5 77.5 75.7 74.0 72.5 71.5 70.3 69.4 68.4 67.6 67.0 66.1 63.3 61.1 59.6 58.3 57.1 56.3 51.4 6 90.2 84.8 80.9 77.2 74.8 72.3 70.3 68.6 67.2 65.7 64.6 63.5 62.6 61.8 60.9 60.1 57.1 55.1 53.3 52.0 50.8 49.9 44.9 7 87.6 81.5 77.4 73.8 70.7 68.2 66.3 64.4 62.9 61.5 60.3 59.2 58.2 57.1 56.5 55.6 52.5 50.3 48.6 47.3 46.1 45.2 40.0 8 85.5 79.1 74.4 70.7 67.4 65.0 62.9 61.2 59.4 58.1 57.0 55.6 54.7 53.7 52.8 52.1 48.9 46.7 44.9 43.4 42.4 41.4 36.3 9 83.4 76.9 72.0 68.2 65.1 62.4 60.1 58.3 56.7 55.1 53.9 52.9 51.8 50.8 49.9 49.1 46.0 43.6 41.9 40.5 39.3 38.4 33.3 10 81.9 75.1 70.1 66.1 62.9 60.1 58.0 55.8 54.4 52.8 51.5 50.3 49.3 48.4 47.4 46.6 43.4 41.1 39.5 38.0 36.9 35.9 31.0 11 80.7 73.2 68.3 64.0 61.1 58.3 56.0 54.0 52.2 50.8 49.4 48.3 47.1 46.3 45.5 44.7 41.4 39.1 37.4 36.0 34.8 33.9 28.9 12 79.4 72.2 67.0 62.8 59.4 56.5 54.4 52.4 50.5 49.0 47.8 46.6 45.5 44.6 43.7 42.9 39.6 37.3 35.6 34.1 33.1 32.1 27.2 13 78.1 70.8 65.3 61.2 57.9 55.1 52.7 50.9 48.9 47.5 46.1 45.2 44.0 43.0 42.1 41.3 38.1 35.7 34.0 32.7 31.5 30.5 25.8 14 77.4 69.8 64.3 60.0 56.5 53.7 51.3 49.3 47.7 46.2 44.9 43.6 42.6 41.6 40.7 39.8 36.7 34.4 32.7 31.3 30.2 29.2 24.5 15 76.6 68.7 63.2 59.0 55.4 52.5 50.1 48.2 46.6 45.0 43.6 42.4 41.4 40.4 39.6 38.7 35.5 33.2 31.6 30.1 29.0 28.1 23.3 16 75.5 68.0 62.2 58.0 54.3 51.5 49.2 47.2 45.4 44.0 42.5 41.4 40.3 39.3 38.4 37.5 34.5 32.1 30.4 29.1 27.9 27.0 22.3

Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables

29 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Percentage of variance obtained under independence


Number of variables nbind 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 100 17 74.9 67.0 61.3 57.0 53.6 50.6 48.1 46.2 44.4 42.9 41.6 40.4 39.4 38.3 37.4 36.7 33.5 31.2 29.5 28.1 27.0 26.1 21.5 18 74.2 66.3 60.7 56.2 52.5 49.8 47.2 45.2 43.4 42.0 40.7 39.5 38.5 37.4 36.5 35.8 32.5 30.3 28.6 27.3 26.1 25.3 20.7 19 73.5 65.6 59.7 55.4 51.8 49.0 46.5 44.4 42.8 41.3 39.8 38.7 37.6 36.7 35.8 34.9 31.8 29.5 27.9 26.5 25.4 24.6 19.9 20 72.8 64.9 59.1 54.5 51.2 48.3 45.8 43.8 41.9 40.4 39.1 37.9 36.9 35.8 34.9 34.2 31.1 28.8 27.1 25.8 24.7 23.8 19.3 25 70.7 62.3 56.4 51.8 48.1 45.2 42.8 40.7 39.0 37.4 36.2 35.0 33.8 32.9 32.0 31.3 28.1 26.0 24.3 23.0 21.9 21.1 16.7 30 68.8 60.4 54.3 49.7 45.9 42.9 40.6 38.5 36.8 35.2 34.0 32.8 31.7 30.7 29.9 29.1 26.0 23.9 22.2 21.0 20.0 19.1 14.9 35 67.4 58.9 52.6 47.8 44.4 41.4 39.0 36.9 35.1 33.6 32.4 31.1 30.1 29.1 28.3 27.5 24.5 22.3 20.7 19.5 18.5 17.7 13.6 40 66.4 57.6 51.4 46.7 42.9 40.1 37.7 35.5 33.9 32.3 31.1 29.8 28.8 27.8 27.0 26.2 23.3 21.1 19.6 18.4 17.4 16.6 12.5 50 64.7 55.8 49.5 44.6 41.0 38.0 35.6 33.5 31.8 30.4 29.0 27.9 26.8 25.9 25.1 24.3 21.4 19.3 17.8 16.6 15.7 14.9 11.0 75 62.0 52.9 46.4 41.6 38.0 35.0 32.6 30.5 28.8 27.4 26.0 24.9 23.9 22.9 22.2 21.4 18.6 16.6 15.2 14.1 13.2 12.5 8.9 100 60.5 51.0 44.6 39.8 36.1 33.2 30.8 28.8 27.1 25.7 24.3 23.2 22.2 21.3 20.5 19.8 17.0 15.1 13.7 12.7 11.8 11.1 7.7 150 58.5 49.0 42.4 37.6 34.0 31.0 28.7 26.7 25.0 23.6 22.4 21.2 20.3 19.4 18.6 18.0 15.2 13.4 12.1 11.1 10.3 9.6 6.4 200 57.4 47.8 41.2 36.4 32.7 29.8 27.5 25.5 23.9 22.4 21.2 20.1 19.2 18.3 17.5 16.9 14.2 12.5 11.1 10.2 9.4 8.7 5.7

Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables
30 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Quality of the representation: cos

2
2

For the variables: only well projected variables (high cos between the variable and its projection) can be interpreted!
Dim.1 Dim.2 Odor.Intensity.before.shaking 0.01 0.94 Odor.Intensity.after.shaking 0.01 0.89 Expression 0.11 0.71 round(res.pca$var$cos2,2)

For the individuals: (same idea) distance between individuals can only be interpreted for well projected individuals
round(res.pca$ind$cos2,2) Dim.1 Dim.2 S Michaud 0.62 0.07 S Renaudie 0.73 0.15 S Trotignon 0.78 0.07

31 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Contribution

Contribution to the construction of the dimension (percentage of variability):

for each individual:

Ctr

q (i ) =

Fiq Fiq I F 2 = q i =1 iq
2 2

Individuals with a large coordinate contribute the most


round(res.pca$ind$contrib,2) Dim.1 Dim.2 S Michaud 15.49 3.10 S Renaudie 15.56 5.56 S Trotignon 15.46 2.43

for each variable:

Ctr

2 Gkq r (x k ,vq )2 q (k ) = q = . q

Variables highly correlated with the principal component

contribute the most

32 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Description of the dimensions

By the continuous variables: correlation between each variable and the principal component of rank q is calculated correlation coecients are sorted and signicant ones are given
> dimdesc(res.pca) $Dim.1$quanti corr p.value O.candied.fruit 0.93 9.5e-05 Grade 0.93 1.2e-04 Surface.feeling 0.89 5.5e-04 Typicity 0.86 1.4e-03 O.mushroom 0.84 2.3e-03 Visual.intensity 0.83 3.1e-03 ... ... ... O.plante -0.87 1.0e-03 O.flower -0.89 4.9e-04 O.passion -0.90 4.5e-04 Freshness -0.91 2.9e-04 $Dim.2$quanti corr p.value Odor.Intensity.before.shaking 0.97 3.1e-06 Odor.Intensity.after.shaking 0.95 3.6e-05 Attack.intensity 0.85 1.7e-03 Expression 0.84 2.2e-03 Aroma.persistency 0.75 1.3e-02 Bitterness 0.71 2.3e-02 Aroma.intensity 0.66 4.0e-02

Sweetness

-0.78 8.0e-03
33 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Description of the dimensions

By the categorical variables: Perform a one-way analysis of variance with the coordinates of the individuals (F.q ) explained by the categorical variable

a F-test by variable for each category, a Student's t -test to compare the average of the category with the general mean
p.value 7.30e-05 p.value 7.30e-05 7.30e-05

> dimdesc(res.pca) Dim.1$quali R2 Label 0.874 Dim.1$category Estimate Vouvray 3.203 Sauvignon -3.203

34 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Practice with R
1 2 3 4 5 6

Choose active variables Scale or not the variables Perform PCA Choose the number of dimensions to interpret Simultaneously interpret the individuals and variables graphs Use indicators to enrich the interpretation

library(FactoMineR) Expert <- read.table("http://factominer.free.fr/useR2010/Expert_wine.csv", header=TRUE, sep=";",row.names=1) res.pca <- PCA(Expert,scale=T,quanti.sup=29:30,quali.sup=1) res.pca x11() barplot(res.pca$eig[,1],main="Eigenvalues",names.arg=1:nrow(res.pca$eig)) plot.PCA(res.pca,habillage=1) res.pca$ind$coord res.pca$ind$cos2 res.pca$ind$contrib plot.PCA(res.pca,axes=c(3,4),habillage=1) dimdesc(res.pca) write.infile(res.pca,file="my_FactoMineR_results.csv") #to export a list
35 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Practice with GUI


source("http://factominer.free.fr/install-facto.r")

36 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Handling missing values: missMDA package

Obtain the principal components from observed data with an

EM-type algorithm

Impute missing values with PCA using imputePCA function (tuning parameter: number of components) Perform the usual PCA on the completed data set

library(missMDA) data(orange) nb.dim <- estim_ncpPCA(orange,ncp.max=5) res.comp <- imputePCA(orange,ncp=2) res.pca <- PCA(res.comp$completeObs)

37 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

MCA: problems - objectives

Individuals study: similarity between individuals (for all the variables) partition between individuals Individuals are dierent if they don't take the same levels Variables study: nd some synthetic variables (continuous variables that sum up categorical variables); link between variables levels study Categories study:

two levels of dierent variables are similar if individuals that take these levels are the same (ex: 65 years and retired) two levels are similar if individuals taking these levels behave the same way, they take the same levels for the other variables (ex: 60 years and 65 years)

Link between these studies: characterization of the groups of individuals by the levels (ex: executive dynamic women)
38 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

MCA: a PCA on an indicator matrix

Binary coding of the factors: a factor with


variable 1 1 variable j

Kj levels Kj columns containing binary values, also called dummy variables

variable J

individuals

01000

0010

(i , i ) =

I J

Kj

1
I

j =1 k =1

(xik xi k )2
39 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

MCA: the superimposed representation

iq =

1
q k

ik G

kq

kq =

1
q i

ik F iq k

Individual i at the barycenter Level k at the barycenter of MCA factor map the individuals who take this level of its levels
tea shop

unpackaged p_upscale

Dim 2 (8.103%)

G G G

green dinner

G G G G G G G G G black G G G GG G G lemon tearoom G GG G G GG G G G G G G No.sugar Not.friends GG G Not.breakfast G G Not.resto Not.work G G G G chain store+tea shop GG Not.tea time G G GG G alone Not.lunch G G Not.pub GG tea bag+unpackaged always G Not.evening G other G G GG home G G G G G G Not.always G G Not.dinner G G evening G G Not.home G Not.tearoom G G G G G G friends G time G tea G G G G G G GG GG GG G GG G G G GG G G G pub G G G G G GGG G G sugar G G Gbreakfast G GG G G G p_variable GG GG G G G G G G G G G G GG G GG G G GG G GG p_cheap G G GG G G Earl GG GG GGGrey G G G G GG G G G G GG G G G G G GG G G G G G GG G work G GG G G G G G G G G G G G G G G GG G G G G G G milk G G G G bag G chain store G resto G GG G G G G G GG G lunch GG p_branded G G tea G G GG G G G G G G G G GG G G G G p_private label G G G G G G G G G G G G G G G G

G G

G G G G G G

p_unknown

40 / 40