Professional Documents
Culture Documents
Individuals Study
Variables Study
Helps to Interpret
Special focus on Clustering and Multiway Methods Franois Husson & Julie Josse
Applied mathematics department, Agrocampus Rennes
1 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Our research focus is principal component methods We teach multivariate data analysis We have developed R packages:
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Outline
Multivariate data analysis with a special focus on clustering and multiway methods
1 2 3
Principal Component Analysis (PCA) Multiple Factor Analysis (MFA) Complementarity between Clustering and Principal Component methods
3 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
4 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Dimensionality reduction describes the dataset with a smaller number of variables Technique widely used for applications such as: data compression, data reconstruction, preprocessing before clustering, and ... Descriptive methods
5 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
PCA deals with continuous variables, but categorical variables can also be included in the analysis Many examples: Sensory analysis: products - descriptors Ecology: plants - measurements; waters - physico-chemical analyses Economy: countries - economic indicators Microbiology: cheeses - microbiological analyses etc.
6 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Wine data
27 continuous variables: sensory descriptors 2 continuous variables: odour and overall preferences 1 categorical variable: label of the wines (Vouvray - Sauvignon)
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene Astringency Sweetness O.passion Bitterness
O.citrus
O.fruity
Acidity
S S S S S V V V V V
Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux
4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1
2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9
5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7
3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8
5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1
4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3
1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3
7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3
6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6
5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3
6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0
5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7
Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray
7 / 40
Label
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Problems - objectives
Individuals study: similarity between individuals with respect to all the variables partition between individuals Variables study: linear relationships between variables visualization of the correlation matrix (denoted S ); nd synthetic variables Link between the two studies: characterization of the groups of individuals by the variables; specic individuals to better understand links between variables
8 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Individuals study
Variables study
K 1
I RK
I RI
var 1 ind i
ind 1
var k
9 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Preprocessing
K
(i , i ) =
k =1
(xik xi k )2
(i , i ) =
k =1
(i , i ) =
1
s
2
k =1 k
(xik xi k )2
10 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Individuals cloud
Study the structure, i.e. the shape of the cloud of individuals Individuals are in RK
11 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
xi.
min max
P
Fi1
u1
Minimize the distance between individuals and their projections Maximize the variance of the projected data
u1
u1 u1
=1
u rst eigenvector of the correlation matrix associated with the largest eigenvalue : Su = u
1
Var F.1
u1 X Xu1
= u1 Su1 = 1 u1 u1 = 1
13 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Additional axes are sequentially dened: each new direction maximizes the projected variance among all orthogonal directions Q eigenvectors u ,...,uQ associated to ,...,Q
1 1
K
x
i. g
= tr (S ) =
k =1
1 2
k (= K )
Variance of the projected individuals cloud (Q-dimensional representation): var (F ) + var (F ) + ... + var (FQ )
Q k =1 k K k =1 k
14 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Sensory descriptors are used as active variables: only these variables are used to construct the axes Variables are (centred and) standardized
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene
Astringency
Sweetness
O.passion
Bitterness
O.citrus
O.fruity
Acidity
S S S S S V V V V V
Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux
4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1
2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9
5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7
3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8
5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1
4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3
1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3
7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3
6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6
5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3
6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0
5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7
Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray
Label
15 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
V Font Brls
Dim 2 (25.14%)
S Buisse Cristal
-2
V Font Domaine
S Buisse Domaine
-4
V Aub Silex
-6 -6
-4
-2
0 Dim 1 (43.48%)
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
u2
1 1
V Aub Marigny V Font Coteaux
K F.1 F.2
Fi1
S Buisse Cristal
V Font Br ls
u1
V Font Domaine
xik
Fi1 Fi2
S Buisse Domaine
Fi2
V Aub Silex
17 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
r (F.2 , x.k )
x.k
O.Vanilla
-1
r(F.1 , x.k )
-1
Correlation circle
18 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity
Dim 2 (25.14%)
-0.5
0.0
0.5
1.0
Sweetness
-1.0
-1.0
-0.5
0.5
1.0
1.5
19 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Cloud of variables
x ik
(kl ) =
x.
= (
I x x i =1 ik il = r (x.k , x.l ) I x 2 )( I x 2 ) i =1 ik i =1 il
20 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
v1
x.k
P
Gk1 v1
arg max
v1
1
v1 RI i =k
is the best synthetic variable v , ..., vQ are the eigenvectors of W = XX the inner product matrix associated with the largest eigenvalues: Wvq = q vq
21 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity
Dim 2 (25.14%)
-0.5
0.0
0.5
1.0
Sweetness
-1.0
-1.0
-0.5
0.5
1.0
1.5
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Projections...
r A B cos
( , ) = cos (A,B ) (A,B ) cos (HA ,HB ) if variables are well projected
A HA HB HD HE E D HC HA HB HD HE HC
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
u1
Gk1
v1
K F.1 F.2
1 1
K v 1 v2
xik
Fi1 Fi2
xik
I u1 u2
G.1 G.2
Gk1 Gk2
24 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Su
=X
Xu
= u
W Xu Wv
XX Xu WF
= X u
) = (Xu )
F
= F and since
= v then
= =
1 F 1 G
G F
1 =X v = X F 1 = Xu = XG
iq =
K
x
q k =1
ik Gkq
kq =
I
x
q i =1
ik Fiq
F.
G.
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
iq =
1
q
ik Gkq
kq =
1
q
ik Fiq
What does it mean? An individual is at the same side as the variables for which it takes high values
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity
-0.5
-4
Dim 2 (25.14%)
V Font Brls
Dim 2 (25.14%)
S Buisse Cristal
-2
V Font Domaine
S Buisse Domaine
V Aub Silex
-6
0.0
0.5
V Font Coteaux
1.0
Sweetness
-1.0
-6
-4
-2
0 Dim 1 (43.48%)
-1.0
-0.5
0.5
1.0
1.5
26 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Supplementary information
For the continuous variables: projection of supplementary variables on the dimensions For the individuals: projection For the categories: projection at the barycentre of the individuals who take the categories
Sauvignon Vouvray
S Renaudie S Michaud S Trotignon Sauvignon
2
Dim 2 (25.14%)
-2
S Buisse Domaine
-4
0.0
S Buisse Cristal
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Odor.preferene Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Overall.preference Oxidation O.fruity Typicity
V Aub Silex
-6
-0.5
1.0
Sweetness
-1.0
-6 -4 -2 0 Dim 1 (43.48%) 2 4
-1.0
-0.5
0.5
1.0
1.5
27 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Bar plot, test on eigenvalues, condence interval, cross-validation (functions estim_ncpPCA and estim_ncp), etc.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10
x.1
x.k
x.K
F1
FQ
FK
28 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables
29 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables
30 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
2
2
For the variables: only well projected variables (high cos between the variable and its projection) can be interpreted!
Dim.1 Dim.2 Odor.Intensity.before.shaking 0.01 0.94 Odor.Intensity.after.shaking 0.01 0.89 Expression 0.11 0.71 round(res.pca$var$cos2,2)
For the individuals: (same idea) distance between individuals can only be interpreted for well projected individuals
round(res.pca$ind$cos2,2) Dim.1 Dim.2 S Michaud 0.62 0.07 S Renaudie 0.73 0.15 S Trotignon 0.78 0.07
31 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Contribution
Ctr
q (i ) =
Fiq Fiq I F 2 = q i =1 iq
2 2
Ctr
2 Gkq r (x k ,vq )2 q (k ) = q = . q
32 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
By the continuous variables: correlation between each variable and the principal component of rank q is calculated correlation coecients are sorted and signicant ones are given
> dimdesc(res.pca) $Dim.1$quanti corr p.value O.candied.fruit 0.93 9.5e-05 Grade 0.93 1.2e-04 Surface.feeling 0.89 5.5e-04 Typicity 0.86 1.4e-03 O.mushroom 0.84 2.3e-03 Visual.intensity 0.83 3.1e-03 ... ... ... O.plante -0.87 1.0e-03 O.flower -0.89 4.9e-04 O.passion -0.90 4.5e-04 Freshness -0.91 2.9e-04 $Dim.2$quanti corr p.value Odor.Intensity.before.shaking 0.97 3.1e-06 Odor.Intensity.after.shaking 0.95 3.6e-05 Attack.intensity 0.85 1.7e-03 Expression 0.84 2.2e-03 Aroma.persistency 0.75 1.3e-02 Bitterness 0.71 2.3e-02 Aroma.intensity 0.66 4.0e-02
Sweetness
-0.78 8.0e-03
33 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
By the categorical variables: Perform a one-way analysis of variance with the coordinates of the individuals (F.q ) explained by the categorical variable
a F-test by variable for each category, a Student's t -test to compare the average of the category with the general mean
p.value 7.30e-05 p.value 7.30e-05 7.30e-05
> dimdesc(res.pca) Dim.1$quali R2 Label 0.874 Dim.1$category Estimate Vouvray 3.203 Sauvignon -3.203
34 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Practice with R
1 2 3 4 5 6
Choose active variables Scale or not the variables Perform PCA Choose the number of dimensions to interpret Simultaneously interpret the individuals and variables graphs Use indicators to enrich the interpretation
library(FactoMineR) Expert <- read.table("http://factominer.free.fr/useR2010/Expert_wine.csv", header=TRUE, sep=";",row.names=1) res.pca <- PCA(Expert,scale=T,quanti.sup=29:30,quali.sup=1) res.pca x11() barplot(res.pca$eig[,1],main="Eigenvalues",names.arg=1:nrow(res.pca$eig)) plot.PCA(res.pca,habillage=1) res.pca$ind$coord res.pca$ind$cos2 res.pca$ind$contrib plot.PCA(res.pca,axes=c(3,4),habillage=1) dimdesc(res.pca) write.infile(res.pca,file="my_FactoMineR_results.csv") #to export a list
35 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
36 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
EM-type algorithm
Impute missing values with PCA using imputePCA function (tuning parameter: number of components) Perform the usual PCA on the completed data set
library(missMDA) data(orange) nb.dim <- estim_ncpPCA(orange,ncp.max=5) res.comp <- imputePCA(orange,ncp=2) res.pca <- PCA(res.comp$completeObs)
37 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
Individuals study: similarity between individuals (for all the variables) partition between individuals Individuals are dierent if they don't take the same levels Variables study: nd some synthetic variables (continuous variables that sum up categorical variables); link between variables levels study Categories study:
two levels of dierent variables are similar if individuals that take these levels are the same (ex: 65 years and retired) two levels are similar if individuals taking these levels behave the same way, they take the same levels for the other variables (ex: 60 years and 65 years)
Link between these studies: characterization of the groups of individuals by the levels (ex: executive dynamic women)
38 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
variable J
individuals
01000
0010
(i , i ) =
I J
Kj
1
I
j =1 k =1
(xik xi k )2
39 / 40
Data - Issues
Individuals Study
Variables Study
Helps to Interpret
iq =
1
q k
ik G
kq
kq =
1
q i
ik F iq k
Individual i at the barycenter Level k at the barycenter of MCA factor map the individuals who take this level of its levels
tea shop
unpackaged p_upscale
Dim 2 (8.103%)
G G G
green dinner
G G G G G G G G G black G G G GG G G lemon tearoom G GG G G GG G G G G G G No.sugar Not.friends GG G Not.breakfast G G Not.resto Not.work G G G G chain store+tea shop GG Not.tea time G G GG G alone Not.lunch G G Not.pub GG tea bag+unpackaged always G Not.evening G other G G GG home G G G G G G Not.always G G Not.dinner G G evening G G Not.home G Not.tearoom G G G G G G friends G time G tea G G G G G G GG GG GG G GG G G G GG G G G pub G G G G G GGG G G sugar G G Gbreakfast G GG G G G p_variable GG GG G G G G G G G G G G GG G GG G G GG G GG p_cheap G G GG G G Earl GG GG GGGrey G G G G GG G G G G GG G G G G G GG G G G G G GG G work G GG G G G G G G G G G G G G G G GG G G G G G G milk G G G G bag G chain store G resto G GG G G G G G GG G lunch GG p_branded G G tea G G GG G G G G G G G G GG G G G G p_private label G G G G G G G G G G G G G G G G
G G
G G G G G G
p_unknown
40 / 40