Professional Documents
Culture Documents
Experiments
Gene 1
Gene 2
Black indicates a log
Gene 3 ratio of zero, i. e.,
Cy5 and Cy3 are very
Gene 4 close in value
Gene 5
Green indicates a
Gene 6
negative log ratio , i.e.,
Cy5 < Cy3
2
Log2(cy5/cy3) 0 1 2 3 4 5 6 7 8
-2
Experiment 2
Experiment 1
Distance and Similarity
-the ability to calculate a distance (or similarity,
it’s inverse) between two expression vectors is
fundamental to clustering algorithms
-distance between vectors is the basis upon which
decisions are made when grouping similar patterns
of expression
-selection of a distance metric defines the concept
of distance
Distance: a measure of similarity between genes.
3. Pearson correlation
Distance is Defined by a Metric
log2(cy5/cy3)
-2
σ = std.
deviation
of the
distribution
Population 1 Population 2
Less than a 5% chance that the sample with mean s came from population 1,
i.e., s is significantly different from “mean 1” at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from population 2.
Many biological variables, such as height and weight, can
reasonably be assumed to approximate the normal distribution.
Even when parametric statistical tests (the ones that make use
of normal and other distributions) are valid, randomization tests
are still useful.
Outline of a randomization test - 1
1. Compute the value of interest (i.e., the test-statistic s) from
your data set.
s
“fake” s
“fake” s
...
Randomized data sets
Outline of a randomization test - 2
Rationale
Our “fake” data sets are our best guess about this behavior (i.e.,
if we had been pulling data at random from an infinitely large
population, we might expect to get a distribution similar to what
we get by pulling random sub-samples, or by reshuffling the
order of the data in our sample)
The problem of multiple testing
(adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin, Germany
http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf)
*Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental
randomization. Biometrika 56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo
methods in biology: pg. 1. Chapman and Hall / CRC
Pearson correlation coefficient – r
p < 0.05 range, i.e., reject the null hypothesis that the variables
are not correlated, since the sample correlation coefficient is in
the rejection range of the correlation coefficient distribution
that has a mean = 0
• This is the p-value for the null hypothesis that the X and Y data for our sample
come from a population in which their correlation is zero, i.e., the null hypothesis
is that there is no linear relationship between X and Y.
• If p is sufficiently small (often p < 0.05), we can reject the null hypothesis, i.e., we
conclude that there is indeed a linear relationship between X and Y.
Pearson correlation - 3
(HCL-1)
Hierarchical Clustering
g1 g2 g3 g4 g5 g6 g7 g8
g1 is most like g8
g1 g8 g2 g3 g4 g5 g6 g7
g1 g8 g4 g2 g3 g5 g6 g7
(HCL-2)
Hierarchical Clustering
g1 g8 g4 g2 g3 g5 g6 g7
g5 is most like g7
g1 g8 g4 g2 g3 g5 g7 g6
g1 g8 g4 g5 g7 g2 g3 g6
(HCL-3)
Hierarchical Tree
g1 g8 g4 g5 g7 g2 g3 g6
(HCL-4)
Hierarchical Clustering
(HCL-5)
Agglomerative Linkage Methods
Linkage methods are rules or metrics that return a value that
can be used to determine which elements (clusters) should be
linked.
• Single Linkage
• Average Linkage
• Complete Linkage
(HCL-6)
Single Linkage
Cluster-to-cluster distance is defined as the minimum distance
between members of one cluster and members of the another
cluster. Single linkage tends to create ‘elongated’ clusters with
individual genes chained onto clusters.
DAB
(HCL-7)
Average Linkage
Cluster-to-cluster distance is defined as the average distance
between all members of one cluster and all members of
another cluster. Average linkage has a slight tendency to
produce clusters of similar variance.
(HCL-8)
Complete Linkage
Cluster-to-cluster distance is defined as the maximum distance
between members of one cluster and members of the another
cluster. Complete linkage tends to create clusters of similar
size and variability.
DAB
(HCL-9)
Comparison of Linkage Methods
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 1 Gene 1
Gene 2 Gene 2
Gene 3 Gene 3
Gene 4 Gene 4
Gene 5 Gene 5
Gene 6 Gene 6
Jackknifing (ST)
Jackknifing – resampling without replacement
Original expression matrix:
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6
Gene 1 Gene 1
Gene 2 Gene 2
Gene 3 Gene 3
Gene 4 Gene 4
Gene 5 Gene 5
Gene 6 Gene 6
Analysis of Bootstrapped and Jackknifed Support Trees
• Each time, hierarchical trees are created from the resampled matrices.
• The trees are compared to the tree obtained from the original data set.
• The more frequently a given cluster from the original tree is found in the
resampled trees, the stronger the support for the cluster.
• As each resampled matrix lacks some of the original data, high support for a
cluster means that the clustering is not biased by a small subset of the data.
K-Means / K-Medians Clustering (KMC)– 1
G3 G6 G1 G8 G4 G5 G2 G10 G9 G12
G11 G7 G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around any more,
OR a user-specified number of iterations has been reached.
K-Means / K-Medians is most useful when the user has an a-priori hypothesis
about the number of clusters the genes should group into.
Principal Components (PCAG and PCAE) – 1
x z
y
“Cloud” of data points (e.g., genes)
in 3-dimensional space
x-axis could mean a continuum from over-to under-expression (“blue” and “green”
genes over-expressed, yellow genes under-expressed)
y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in
The remaining expts, while “brown” genes are under-expressed in the first five expts, and
over-expressed in the remaining expts.
z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in
odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true
for “purple” genes.
Unassigned genes
4. Update the affinities of all the genes (new affinity of a gene =
its previous affinity + its similarity to the gene(s) newly added to the cluster C1)
ADD GENES:
5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the
user-specified threshold affinity, pick the unassigned gene whose affinity is the highest,
and add it to cluster C1. Update the affinities of all the genes accordingly.
CAST – 2
REMOVE GENES:
6. When there are no more unassigned high-affinity genes, check to see if cluster C1
contains any elements whose affinity is lower than the current threshold. If so, remove
the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from
each gene’s affinity, its similarity to the removed gene.
9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps
1-8.
10. Keep forming new clusters following steps 1-9, until all genes have been assigned
to a cluster
QT-Clust (from Heyer et. al. 1999) (HJC) -1
2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster
diameter the least. Continue adding genes until additional genes will exceed the
specified cluster diameter limit.
G6 G8 G9 G11
G4 G10 G11
G2 G5 G7 G1
“Seed” gene
G3
Currently unassigned genes G12
Current cluster
3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a
new cluster. All clusters are provisional at this point.
QT-Clust – 2
4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the
largest clusters at random.
G7
G11 G4 G4
G11
G1 G1 G8 G9
“Seed” gene G8 G2 G10 G3
“Seed” gene G9 “Seed” gene
G7 G12 G3
G5
5. All genes that are not in the cluster selected above are treated as currently unassigned.
Repeat steps 2-4 on these unassigned genes.
6. Stop when the last cluster thus formed has fewer genes than a user-specified number.
All genes that are not in a cluster at this point are treated as unassigned.
SOTA - 1
SOTA Characteristics
• Divisive clustering, allowing high level hierarchical
structure to be revealed without having to completely
partition the data set down to single gene vectors
• Data set is reduced to clusters arranged in a binary tree
topology
• The number of resulting clusters is not fixed before
clustering
• Neural network approach which has advantages similar to
SOMs such as handling large data sets that have large
amounts of ‘noise’
SOTA - 3
SOTA Topology
Centroid
Vector
Parent Node
p
Members
w s
Winning Sister
Cell Cell
Adaptation Overview
-each gene vector associated with the parent is compared
to the centroid vector of its offspring cells.
Growth Termination
Expansion stops
when the most
diverse cell’s
diversity falls below
a threshold.
SOTA - 8
0.1
0.05
0
0 100 200 300 400 500
Adaptation Epoch Number
Self-organizing maps (SOMs) – 1
1. Specify the number of nodes (clusters) desired, and also specify a 2-D
geometry for the nodes, e.g., rectangular or hexagonal
N = Nodes
G7
G1 G6 G8 G = Genes
G2 G5 G9
G4 G10
G3 G11
N1 N2
N3 N4 G12 G13
G14
G26 G27 G15
N5 N6
G28 G29
G16
G17
G18 G19
G23 G20
G21
G24 G22
G25
SOMs – 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved
the most, and the other nodes are moved by smaller varying amounts. The
further away the node is from N2, the less it is moved.
G7
G1 G6 G8
G2 G5 G9
G4 G10
G3 G11
N1 N2
N3 N4 G12 G13
G14
G26 G27 G15
N5 N6
G28G29
G16
G17
G18G19
G23 G20
G21
G24 G22
G25
SOM Neighborhood Options
Bubble Gaussian
Neighborhood Neighborhood
radius G7 G7
G8 G8
G9 G9
G10 G10
G11 G11
N1 N2 N1 N2
N3 N4 N3 N4
N5 N6 N5 N6
5. Finally, each node will “nestle” among a cluster of genes, and a gene will be
considered to be in the cluster if its distance to the node in that cluster is less than its
distance to any other node
G7
G1 G6 G8
G2 G5 G9N2
N1
G4 G10
G3 G11
G12N4G13
G14
G26 G27 G15
N3
G28G29
G16
G17
G18G19
G23 G20N6
N5 G24 G21
G25 G22
Template Matching
-template matching allows one to find expression vectors
which match a provided template
-a template can be derived from
- a gene known to be central to the area of study
- a sample or set of samples of a particular type
- a cluster with a mean pattern of interest
- a pattern constructed to reveal trends based on
knowledge of the experimental design
PTM-2
Choose cluster of
appropriate size as
Repeat until only one determined by gap
gene remains statistic calculation
Orthogonalize expression
matrix with respect to the
average gene in the cluster and
repeat shaving procedure
Gene Shaving
Gap statistic calculation
(choosing cluster size)
between variance
Quality measure for clusters: R2 =
within variance
between within
Large R implies a
2
variance of mean variance of each
tight cluster of gene across gene about the
coherent genes experiments cluster average
A .75 A
.92
.15
.37
E B B
E
.02
.28 .63
.51
.40
D C C
.11 D
Gene 1
Gene 1
Gene 2
Gene 2
Gene 3 Gene 3
Gene 4
Gene 4
Gene 5 Gene 5
Gene 6
Gene 6
A. Theoretical t-distribution
OR
B. Permutation tests.
TTEST - Between subjects - 3
Permutation tests
ii) Randomly shuffle the values of the gene between groups A and B,
such that the reshuffled groups A and B respectively have the same
number of elements as the original groups A and B.
Group A Group B
Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6
Group A Group B
Exp 3 Exp 2 Exp 6 Exp 4 Exp 5 Exp 1
OR
Use Bonferroni corrections to reduce the probability of
erroneously classifying non-significant genes as significant.
1. Used to test if the the mean expression of a gene over all experiments is
different from a hypothesized mean.
Gene 2 Vector 2
Gene 3 Vector 3
7. By permutation:
7a. Randomly pick some elements of the gene vector, and change their values,
such that the new value of the changed element is
Thus, if the original gene values are: 0.5 -1.3 2.4 1.2 -0.2 0.8
the randomized gene values could be: -0.5 -1.3 2.4 -1.2 0.2 -0.8
7c. Repeat 7a and 7b as many times as desired. If all permutations are chosen,
then every possible combination of elements in the gene vector is chosen for
flipping.
7d. The p-value = 1 – (the proportion of times that the original absolute
t-value exceeds the randomized absolute t-value over all the permutations
conducted).
Ex 1 Ex 2 Ex 7 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
2. Question: Is mean expression level of a gene the same across all groups?
ANOVA - 2
The larger the value of F, the greater the difference among the group means
relative to the sampling error variability (which is the within groups variability).
i.e., the larger the value of F, the more likely it is that the differences among the
group means reflect “real” differences among the means of the populations they
are drawn from, rather than being due to random sampling error.
ANOVA - 3
Therefore, the smaller the p-value, the less likely it is that the
null hypothesis is valid, i.e., the differences among group
means are more likely to reflect real population differences as
p-values decrease in magnitude.
ANOVA - 4
Male
Female
Gene expression TFA - 2
Gene expression
Female Female
Male Male
1 2 3 1 2 3
Strain Strain
No interaction Interaction
TFA - 3
Male
Female
Significance analysis of microarrays (SAM)
Gene 1
Gene 1
Gene 2
Gene 2
Gene 3 Gene 3
Gene 4
Gene 4
Gene 5 Gene 5
Gene 6
Gene 6
iii) Randomly shuffle the values of the genes between groups A and B,
such that the reshuffled groups A and B respectively have the same
number of elements as the original groups A and B. Compute the
d-value for each randomized gene
Group A Group B
Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6
Group A Group B
Exp 3 Exp 2 Exp 6 Exp 4 Exp 5 Exp 1
Gene 1 Randomized grouping
SAM Two-Class Unpaired - 3
iv) Rank the permuted d-values of the genes in ascending order
v) Repeat steps iii) and iv) many times, so that each gene has many
randomized d-values corresponding to its rank from the observed
(unpermuted) d-value. Take the average of the randomized d-values
for each gene. This is the expected d-value of that gene.
Tuning parameter
“delta” limits, can
be dynamically
changed by using
the slider bar or
entering a value in
the text field.
The more a gene deviates from the “observed = expected” line, the
Significant negative genes more likely it is to be significant. Any gene beyond the first gene in the
(i.e., mean expression of group A > mean +ve or –ve direction on the x-axis (including the first gene), whose
expression of group B) in green observed exceeds the expected by at least delta, is considered
significant.
SAM Two-Class Unpaired – 5
For each permutation of the data, compute the number
of positive and negative significant genes for a given
delta as explained in the previous slide. The median
number of significant genes from these permutations
is the median False Discovery Rate.
A B
A-B pair
SAM Two-Class Paired - 2
•or, groups A and B could represent two strains for which samples were
collected at the several time points over a time course study. A sample
collected from each of strain A and B at the same time point could form
an AB pair.
Ex 1 Ex 2 Ex 7 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
• positive genes are those whose means are greater than the specified
mean, while negative genes have means smaller than the specified
mean
Support Vector Machines
(SVM)
SVM
Training
Weights SVM
Classification
Elements Elements
In Out of
Classificatio Classificatio
n n
SVM Classification
• SVM attempts to find an optimal separating
hyperplane between members of the two
initial classifications.
Separating
hyperplane
Separation Problem
5. If the p-value for a vector < the user-specified p-value, that vector
is retained for further analysis.
6. Steps 1-6 are repeated for every vector in the set to be classified.
KNNC – 4 - Classification parameters
• Let v be a vector that needs to be classified,
and T = {t1, t2, …, t10} be the set of training vectors.
t1
t4 t8 Class 2
t10
v
t2 Class 3
t5
t7 Class 4
t3
t6
T
KNNC –5 - Classification
t9 Class 1
t1
t4 t8 Class 2
t10
v
t2 Class 3
t5
t7 Class 4
t3
t6
T
• Suppose v’s 5 nearest neighbors in set T (by Euclidean distance) are
t1, t4, t8, t2, and t5.
12
8
The frequency of the theme within the cluster is 8/12 = 67%
AND
in 8 2
Theme
out 4 26
Assigning Significance to the Findings
The Fisher’s Exact Test permits us to determine if there are
non-random associations between the two variables, expression
based cluster membership and membership to a particular
biological theme.
Cluster
in out
in 8 2
Theme p .0002
out 4 26
(a c)! (b d )!
a!c! b!d! (a b)!(c d )!(a c)!(b d )!
n! n!a!b!c!d !
(a b)!(c d )!
Probability Computation
8 2
For our matrix, , we are not only
4 26
interested in getting the probability of getting exactly
8 annotation hits in the cluster but rather the probability
of having 8 or more hits. In this case the probabilities
of each of the possible matrices is summed.
8 2 9 1 10 0
4 26 3 27 2 28
• Algorithm scripting
• Discriminant analysis
• Chromosome Viewers
etc.