You are on page 1of 130

Analysis of Multiple

Experiments

TIGR Multiple Experiment


Viewer (MeV)
Advanced Course Coverage
• Introduction
-fundamental concepts, expression vectors and
distance metrics
-fundamental statistical concepts encountered in
mev analysis modules
• Algorithm Coverage
-Lecture / Hands on Exercises
(refer to algorithm handout for order…)
Microarray Data Flow
Microarray Printers Scheduler Microarray Scanners
(Machine Scheduling)
IAS-1 IAS-2 Lucidea Axon-1 Axon-2

MD MD3 Others SliTrack ScanArray Others


(Machine Control)

Exp Designer MABCOS PCR Score


.tiff Image File
(Barcode System)

Data Entry Pages Spotfinder


MADAM
(Data Manager) (Image Analysis)
Probe Source Study Expression Data

Probe Slide Raw .tav File


Miner
(.tav File Creator)
Hybridization Scan
Raw .tav File
Expression Analysis MIDAS
GenePix Converter (Normalization)

Reports Normalized .tav File

Study Experiment Query Window


MUSAGE
Slidetype Slide Others… MeV
(Data Analysis)
MAD
Probe MAGE-ML Database
Database Interpretation…
Database
TIGR
TIGR
THE INSTITUTE FOR GENOMIC RESEARCH
The Expression Matrix is a representation of data from multiple
microarray experiments.
Each element is a log ratio
(usually log 2 (Cy5 / Cy3) )
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2
Black indicates a log
Gene 3 ratio of zero, i. e.,
Cy5 and Cy3 are very
Gene 4 close in value

Gene 5

Green indicates a
Gene 6
negative log ratio , i.e.,
Cy5 < Cy3

Gray indicates missing data Red indicates a


positive log ratio, i.e, Cy5 > Cy3
Expression Vectors
-Gene Expression Vectors
encapsulate the expression of a gene over a
set of experimental conditions or sample
types.

2
Log2(cy5/cy3) 0 1 2 3 4 5 6 7 8

-2

-0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5


Expression Vectors As Points in
Exp 1 Exp 2 Exp 3
‘Expression Space’
G1 -0.8 -0.3 -0.7
G2 -0.4 -0.8 -0.7
G3 -0.6 -0.8 -0.4 Similar Expression
G4 0.9 1.2 1.3
G5 1.3 0.9 -0.6
Experiment 3

Experiment 2

Experiment 1
Distance and Similarity
-the ability to calculate a distance (or similarity,
it’s inverse) between two expression vectors is
fundamental to clustering algorithms
-distance between vectors is the basis upon which
decisions are made when grouping similar patterns
of expression
-selection of a distance metric defines the concept
of distance
Distance: a measure of similarity between genes.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene A x1A x2A x3A x4A x5A x6A

Gene B x1B x2B x3B x4B x5B x6B


p1

Some distances: (MeV provides 11 metrics)

1. Euclidean:  6 (xiA - xiB)2


i=1
p0
6
2. Manhattan:  |xiA – xiB|
i=1

3. Pearson correlation
Distance is Defined by a Metric
log2(cy5/cy3)

-2

Distance Metric: Euclidean Pearson(r*-1)


D 1.4 -0.90
D 4.2 -1.00
Statistical Concepts
Probability distributions

The probability of an event is the likelihood of its occurring. It is


sometimes computed as a relative frequency (rf), where

the number of “favorable” outcomes for an event


rf = ----------------------------------------------------------------
the total number of possible outcomes for that event.

The probability of an event can sometimes be inferred from a


theoretical probability distribution, such as a normal distribution.
Normal distribution

σ = std.
deviation
of the
distribution

X = μ (mean of the distribution)


Mean 1 Mean 2

Population 1 Population 2

Sample mean “s”

Less than a 5% chance that the sample with mean s came from population 1,
i.e., s is significantly different from “mean 1” at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from population 2.
Many biological variables, such as height and weight, can
reasonably be assumed to approximate the normal distribution.

But expression measurements? Probably not.

Fortunately, many statistical tests are considered to be fairly robust


to violations of the normality assumption, and other assumptions
used in these tests.

Randomization / resampling based tests can be used to get around


the violation of the normality assumption.

Even when parametric statistical tests (the ones that make use
of normal and other distributions) are valid, randomization tests
are still useful.
Outline of a randomization test - 1
1. Compute the value of interest (i.e., the test-statistic s) from
your data set.
s

Original data set

2. Make “fake” data sets from your original data, by taking a


random sub-sample of the data, or by re-arranging the data in a
random fashion.
3. Re-compute s from the “fake” data set.
“fake” s

“fake” s

“fake” s
...
Randomized data sets
Outline of a randomization test - 2

4. Repeat steps 2 and 3 many times (often several hundred to


several thousand times). Keep a record of the “fake” s values
from step 3.

5. Draw inferences about the significance of your original s value


by comparing it with the distribution of the randomized (“fake”)
s values.

Original s value: could be significant


as it exceeds most of the randomized
s values

Range of randomized s values


Outline of a randomization test - 3

Rationale

Ideally, we want to know the “behavior” of the larger


population from which the sample is drawn, in order to make
statistical inferences.

Here, we don’t know that the larger population “behaves”


like a normal distribution, or some other idealized distribution.
All we have to work with are the data in hand.

Our “fake” data sets are our best guess about this behavior (i.e.,
if we had been pulling data at random from an infinitely large
population, we might expect to get a distribution similar to what
we get by pulling random sub-samples, or by reshuffling the
order of the data in our sample)
The problem of multiple testing
(adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin, Germany
http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf)

• Let’s imagine there are 10,000 genes on a chip, AND

• None of them is differentially expressed.

• Suppose we use a statistical test for differential


expression, where we consider a gene to be
differentially expressed if it meets the criterion at a
p-value of p < 0.05.
The problem of multiple testing – 2

• Let’s say that applying this test to gene “G1” yields a p-


value of p = 0.01

• Remember that a p-value of 0.01 means that there is a


1% chance that the gene is not differentially expressed,
i.e.,

• Even though we conclude that the gene is differentially


expressed (because p < 0.05), there is a 1% chance that
our conclusion is wrong.

• We might be willing to live with such a low probability


of being wrong
BUT .....
The problem of multiple testing – 3

• We are testing 10,000 genes, not just one!!!

• Even though none of the genes is differentially


expressed, about 5% of the genes (i.e., 500 genes) will be
erroneously concluded to be differentially expressed,
because we have decided to “live with” a p-value of 0.05

• If only one gene were being studied, a 5% margin of


error might not be a big deal, but 500 false conclusions
in one study? That doesn’t sound too good.
The problem of multiple testing - 4

• There are “tricks” we can use to reduce the severity of


this problem.

• They all involve “slashing” the p-value for each test


(i.e., gene), so that while the critical p-value for the entire
data set might still equal 0.05, each gene will be
evaluated at a lower p-value.

• We’ll go into some of these techniques later.


• Don’t get too hung up on p-values.

• Ultimately, what matters is biological relevance.


P-values should help you evaluate the strength of the
evidence, rather than being used as an absolute yardstick
of significance. Statistical significance is not necessarily
the same as biological significance.
• i.e., you don’t want to belong to “that
group of people whose aim in life is to be
wrong 5% of the time”!!! *

*Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental
randomization. Biometrika 56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo
methods in biology: pg. 1. Chapman and Hall / CRC
Pearson correlation coefficient – r

• Indicates the degree to which a linear relationship can be


approximated between two variables.

• Can range from (–1.0) to (+1.0).

• Positive r between two variables X and Y: as X increases, so does


Y on the whole.
Y

•Negative r: as X increases, Y generally decreases. Y

• The higher the magnitude of r (in the positive or negative


direction), the more linear the relationship.
Pearson correlation - 2
• Sometimes, a p-value is associated with the correlation coefficient r.

• This p-value is computed from a theoretical distribution of the correlation


coefficient, similar to the normal distribution.

Population correlation coefficient = 0

Sample correlation coefficient r

p < 0.05 range, i.e., reject the null hypothesis that the variables
are not correlated, since the sample correlation coefficient is in
the rejection range of the correlation coefficient distribution
that has a mean = 0

• This is the p-value for the null hypothesis that the X and Y data for our sample
come from a population in which their correlation is zero, i.e., the null hypothesis
is that there is no linear relationship between X and Y.
 
• If p is sufficiently small (often p < 0.05), we can reject the null hypothesis, i.e., we
conclude that there is indeed a linear relationship between X and Y.
Pearson correlation - 3

The square of the Pearson correlation, r2, also known as the


coefficient of determination, is a measure of the “strength” of the
linear relationship between X and Y.
 

It is the proportion of the total variation in X and Y that is


explained by a linear relationship.
Algorithms…
Hierarchical Clustering (HCL)

HCL is an agglomerative clustering method which


joins similar genes into groups. The iterative process
continues with the joining of resulting groups based on
their similarity until all groups are connected in a
hierarchical tree.

(HCL-1)
Hierarchical Clustering
g1 g2 g3 g4 g5 g6 g7 g8

g1 is most like g8

g1 g8 g2 g3 g4 g5 g6 g7

g4 is most like {g1, g8}

g1 g8 g4 g2 g3 g5 g6 g7

(HCL-2)
Hierarchical Clustering

g1 g8 g4 g2 g3 g5 g6 g7

g5 is most like g7

g1 g8 g4 g2 g3 g5 g7 g6

{g5,g7} is most like {g1, g4, g8}

g1 g8 g4 g5 g7 g2 g3 g6

(HCL-3)
Hierarchical Tree

g1 g8 g4 g5 g7 g2 g3 g6

(HCL-4)
Hierarchical Clustering

During construction of the hierarchy, decisions must be


made to determine which clusters should be joined.
The distance or similarity between clusters must be
calculated. The rules that govern this calculation are
linkage methods.

(HCL-5)
Agglomerative Linkage Methods
Linkage methods are rules or metrics that return a value that
can be used to determine which elements (clusters) should be
linked.

Three linkage methods that are commonly used are:

• Single Linkage
• Average Linkage
• Complete Linkage

(HCL-6)
Single Linkage
Cluster-to-cluster distance is defined as the minimum distance
between members of one cluster and members of the another
cluster. Single linkage tends to create ‘elongated’ clusters with
individual genes chained onto clusters.

DAB = min ( d(ui, vj) )

where u A and v B


for all i = 1 to NA and j = 1 to NB

DAB

(HCL-7)
Average Linkage
Cluster-to-cluster distance is defined as the average distance
between all members of one cluster and all members of
another cluster. Average linkage has a slight tendency to
produce clusters of similar variance.

DAB = 1/(NANB)  ( d(ui, vj) )

where u A and v B


for all i = 1 to NA and j = 1 to NB
DAB

(HCL-8)
Complete Linkage
Cluster-to-cluster distance is defined as the maximum distance
between members of one cluster and members of the another
cluster. Complete linkage tends to create clusters of similar
size and variability.

DAB = max ( d(ui, vj) )

where u A and v B


for all i = 1 to NA and j = 1 to NB

DAB

(HCL-9)
Comparison of Linkage Methods

Single Ave. Complete


(HCL-10)
Bootstrapping (ST)
Bootstrapping – resampling with replacement
Original expression matrix:
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6

Various bootstrapped matrices (by experiments):


Exp 2 Exp 2 Exp 3 Exp 4 Exp 4 Exp 4 Exp 1 Exp 1 Exp 3 Exp 5 Exp 5 Exp 6

Gene 1 Gene 1
Gene 2 Gene 2
Gene 3 Gene 3
Gene 4 Gene 4
Gene 5 Gene 5
Gene 6 Gene 6
Jackknifing (ST)
Jackknifing – resampling without replacement
Original expression matrix:
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6

Various jackknifed matrices (by experiments):

Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6
Gene 1 Gene 1
Gene 2 Gene 2
Gene 3 Gene 3
Gene 4 Gene 4
Gene 5 Gene 5
Gene 6 Gene 6
Analysis of Bootstrapped and Jackknifed Support Trees

• Bootstrapped or jackknifed expression matrices are created many times by


randomly resampling the original expression matrix, using either the bootstrap
or jackknife procedure.

• Each time, hierarchical trees are created from the resampled matrices.

• The trees are compared to the tree obtained from the original data set.

• The more frequently a given cluster from the original tree is found in the
resampled trees, the stronger the support for the cluster.

• As each resampled matrix lacks some of the original data, high support for a
cluster means that the clustering is not biased by a small subset of the data.
K-Means / K-Medians Clustering (KMC)– 1

1. Specify number of clusters, e.g., 5.

2. Randomly assign genes to clusters.

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13


K-Means Clustering – 2
3. Calculate mean / median expression profile of each cluster.
4. Shuffle genes among clusters such that each gene is now in the cluster
whose mean / median expression profile (calculated in step 3) is the
closest to that gene’s expression profile.

G3 G6 G1 G8 G4 G5 G2 G10 G9 G12

G11 G7 G13

5. Repeat steps 3 and 4 until genes cannot be shuffled around any more,
OR a user-specified number of iterations has been reached.

K-Means / K-Medians is most useful when the user has an a-priori hypothesis
about the number of clusters the genes should group into.
Principal Components (PCAG and PCAE) – 1

1. PCA simplifies the “views” of the data.

2. Suppose we have measurements for each gene on multiple


experiments.

3. Suppose some of the experiments are correlated.

4. PCA will ignore the redundant experiments, and will take a


weighted average of some of the experiments, thus possibly making
the trends in the data more interpretable.

5. The components can be thought of as axes in n-dimensional


space, where n is the number of components. Each axis represents a
different trend in the data.
PCAG and PCAE - 2

x z

y
“Cloud” of data points (e.g., genes)
in 3-dimensional space

Data points resolved along 3 principal


In this example, component axes.

x-axis could mean a continuum from over-to under-expression (“blue” and “green”
genes over-expressed, yellow genes under-expressed)

y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in
The remaining expts, while “brown” genes are under-expressed in the first five expts, and
over-expressed in the remaining expts.

z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in
odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true
for “purple” genes.

Interpretation of components is somewhat subjective.


Cluster Affinity Search
Technique (CAST)
-uses an iterative approach to segregate
elements with ‘high affinity’ into a cluster
-the process iterates through two phases
-addition of high affinity elements to the
cluster being created
-removal or clean-up of low affinity
elements from the cluster being created
Clustering Affinity Search Technique (CAST)-1
Affinity = a measure of similarity between a gene, and all the genes in a cluster.
Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as
%age of maximum affinity at that point
1. Create a new empty cluster C1.
2. Set initial affinity of all genes to zero
3. Move the two most similar genes into the new cluster.
G3 G8 G13
G2 G14
G4
G1 G12
G5 G9 G15
Empty cluster C1 G6
G7
G11
G10

Unassigned genes
4. Update the affinities of all the genes (new affinity of a gene =
its previous affinity + its similarity to the gene(s) newly added to the cluster C1)
ADD GENES:
5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the
user-specified threshold affinity, pick the unassigned gene whose affinity is the highest,
and add it to cluster C1. Update the affinities of all the genes accordingly.
CAST – 2
REMOVE GENES:
6. When there are no more unassigned high-affinity genes, check to see if cluster C1
contains any elements whose affinity is lower than the current threshold. If so, remove
the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from
each gene’s affinity, its similarity to the removed gene.

7. Repeat step 6 while C1 contains a low-affinity gene.


G3 G8 G13
G2
G6 G4
G14 G12
G5 G9
Current cluster C1 G11
G7
G1 G10
G15
Unassigned genes
8. Repeat steps 5-7 as long as changes occur to the cluster C1.

9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps
1-8.

10. Keep forming new clusters following steps 1-9, until all genes have been assigned
to a cluster
QT-Clust (from Heyer et. al. 1999) (HJC) -1

1. Compute a jackknifed distance between all pairs of genes


(Jackknifed distance: The data from one experiment are excluded from both genes, and the
distance is calculated. Each experiment is thus excluded in turn, and the maximum distance
between the two genes (over all exclusions) is the jackknifed distance. This is a conservative
estimate of distance that accounts for bias that might be introduced by single outlier experiments.)

2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster
diameter the least. Continue adding genes until additional genes will exceed the
specified cluster diameter limit.

G6 G8 G9 G11
G4 G10 G11
G2 G5 G7 G1
“Seed” gene
G3
Currently unassigned genes G12

Current cluster
3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a
new cluster. All clusters are provisional at this point.
QT-Clust – 2
4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the
largest clusters at random.
G7
G11 G4 G4
G11
G1 G1 G8 G9
“Seed” gene G8 G2 G10 G3
“Seed” gene G9 “Seed” gene
G7 G12 G3
G5

Pick this cluster

5. All genes that are not in the cluster selected above are treated as currently unassigned.
Repeat steps 2-4 on these unassigned genes.

6. Stop when the last cluster thus formed has fewer genes than a user-specified number.
All genes that are not in a cluster at this point are treated as unassigned.
SOTA - 1

Self Organizing Tree Algorithm

• Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using


and unsupervised growing neural network that adopts the
topology of a phylogenetic tree. J. Mol. Evol. 44:226-233,
1997.

• Herrero, J., A. Valencia, and J. Dopazo. A hierarchical


unsupervised growing neural network for clustering gene
expression patterns. Bioinformatics, 17(2):126-136, 2001.
SOTA - 2

SOTA Characteristics
• Divisive clustering, allowing high level hierarchical
structure to be revealed without having to completely
partition the data set down to single gene vectors
• Data set is reduced to clusters arranged in a binary tree
topology
• The number of resulting clusters is not fixed before
clustering
• Neural network approach which has advantages similar to
SOMs such as handling large data sets that have large
amounts of ‘noise’
SOTA - 3

SOTA Topology
Centroid
Vector
Parent Node
p
Members

w s
Winning Sister
Cell Cell

migration factor (s < p < w)


SOTA - 4

Adaptation Overview
-each gene vector associated with the parent is compared
to the centroid vector of its offspring cells.

-the most similar cell’s centroid and its neighboring cells


are adapted using the appropriate migration weights.
SOTA - 5

-following the presentation of all genes to the system a


measure of system diversity is used to determine if
training has found an optimal position for the offspring.

-if the system diversity improves (decreases) then


another training epoch is started otherwise training ends
and a new cycle starts with a cell division.
SOTA - 6

The most ‘diverse’ cell


is selected for division
at the start of the next
training cycle.
SOTA - 7

Growth Termination

Expansion stops
when the most
diverse cell’s
diversity falls below
a threshold.
SOTA - 8

Each training cycle ends when the


0.2 overall tree diversity ‘stabilizes’.
This triggers a cell division and
0.15
possibly a new training cycle.
Tree Diversity

0.1

0.05

0
0 100 200 300 400 500
Adaptation Epoch Number
Self-organizing maps (SOMs) – 1
1. Specify the number of nodes (clusters) desired, and also specify a 2-D
geometry for the nodes, e.g., rectangular or hexagonal
N = Nodes
G7
G1 G6 G8 G = Genes
G2 G5 G9
G4 G10
G3 G11
N1 N2

N3 N4 G12 G13
G14
G26 G27 G15
N5 N6
G28 G29

G16
G17
G18 G19
G23 G20
G21
G24 G22
G25
SOMs – 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved
the most, and the other nodes are moved by smaller varying amounts. The
further away the node is from N2, the less it is moved.

G7
G1 G6 G8
G2 G5 G9
G4 G10
G3 G11
N1 N2

N3 N4 G12 G13
G14
G26 G27 G15
N5 N6
G28G29

G16
G17
G18G19
G23 G20
G21
G24 G22
G25
SOM Neighborhood Options
Bubble Gaussian
Neighborhood Neighborhood
radius G7 G7
G8 G8
G9 G9
G10 G10
G11 G11
N1 N2 N1 N2

N3 N4 N3 N4

N5 N6 N5 N6

Some move, alpha All move, alpha is scaled.


is constant.
SOMs – 3
4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are
repeated many (usually several thousand) times. However, with each iteration, the amount
that the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of genes, and a gene will be
considered to be in the cluster if its distance to the node in that cluster is less than its
distance to any other node
G7
G1 G6 G8
G2 G5 G9N2
N1
G4 G10
G3 G11

G12N4G13
G14
G26 G27 G15
N3
G28G29

G16
G17
G18G19
G23 G20N6
N5 G24 G21
G25 G22
Template Matching
-template matching allows one to find expression vectors
which match a provided template
-a template can be derived from
- a gene known to be central to the area of study
- a sample or set of samples of a particular type
- a cluster with a mean pattern of interest
- a pattern constructed to reveal trends based on
knowledge of the experimental design
PTM-2

-Sometimes it is useful to identify elements that


have complementary patterns by selecting to use
the absolute value of r.
K-Means / K-Medians Support (KMS)

Because of the random initialization of K-Means / K-Means,


clustering results may vary somewhat between successive runs on
the same dataset. KMS helps us validate the clustering results
obtained from K-Means / K-Medians.

• Run K-Means / K-Medians multiple times.

• The KMS module generates clusters in which the member genes


frequently group together in the same clusters (“consensus clusters”)
across multiple runs of K-Means / K-Medians.

3. The consensus clusters consist of genes that clustered together


in at least x% of the K-Means / Medians runs, where x is the
threshold percentage input by the user.
Gene Shaving
Compute first principle
component of
expression matrix
Shave off % (default
10%) of genes with
lowest values of dot
product with 1st principal
component
Results in a series
of nested clusters

Choose cluster of
appropriate size as
Repeat until only one determined by gap
gene remains statistic calculation

Orthogonalize expression
matrix with respect to the
average gene in the cluster and
repeat shaving procedure
Gene Shaving
Gap statistic calculation
(choosing cluster size)

between variance
Quality measure for clusters: R2 =
within variance

between within
Large R implies a
2
variance of mean variance of each
tight cluster of gene across gene about the
coherent genes experiments cluster average

Create random permutations of the The final cluster


expression matrix and calculate R2
for each contains a set of
genes that are
Compare R2 of each cluster to that of greatly affected by
the entire expression matrix
the experimental
Choose the cluster whose R2 is furthest conditions
from the average R2 of the permuted
expression matrices.
in a similar way.
Relevance Networks

Set of genes whose


expression profiles are
predictive of one another.

Can be used to identify


negative correlations
between genes

Genes with low entropy 10

(least variable across experiments) H = -p(x)log2(p(x))


x=1
are excluded from analysis.
Relevance Networks

A .75 A
.92
.15
.37
E B B
E
.02
.28 .63
.51
.40
D C C
.11 D

The expression pattern Tmin = 0.50 The remaining


of each gene relationships between
compared to that of Tmax = 0.90 genes define the
every other gene. subnets

The ability of each gene to Correlation coefficients


predict the expression of outside the boundaries
each other gene is defined by the minimum and
assigned a correlation maximum thresholds are
coefficient eliminated.
T-Tests (TTEST) – Between subjects (or unpaired) - 1
1. Assign experiments to two groups, e.g., in the expression matrix
below, assign Experiments 1, 2 and 5 to group A, and
experiments 3, 4 and 6 to group B.
Group A Group B
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6

Gene 1
Gene 1
Gene 2
Gene 2
Gene 3 Gene 3
Gene 4
Gene 4
Gene 5 Gene 5
Gene 6
Gene 6

2. Question: Is mean expression level of a gene in group A


significantly different from mean expression level in group B?
TTEST – Between subjects - 2

3. Calculate t-statistic for each gene

4. Calculate probability value of the t-statistic for each gene either


from:

A. Theoretical t-distribution

OR

B. Permutation tests.
TTEST - Between subjects - 3

Permutation tests

i) For each gene, compute t-statistic

ii) Randomly shuffle the values of the gene between groups A and B,
such that the reshuffled groups A and B respectively have the same
number of elements as the original groups A and B.
Group A Group B
Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6

Gene 1 Original grouping

Group A Group B
Exp 3 Exp 2 Exp 6 Exp 4 Exp 5 Exp 1

Gene 1 Randomized grouping


TTEST - Between subjects - 4

Permutation tests - continued


iii) Compute t-statistic for the randomized gene

iv) Repeat steps i-iii n times (where n is specified by the user).

v) Let x = the number of times the absolute value of the original


t-statistic exceeds the absolute values of the randomized t-statistic
over n randomizations.

vi) Then, the p-value associated with the gene = 1 – (x/n)


TTEST - Between subjects - 5
5. Determine whether a gene’s expression levels are significantly
different between the two groups by one of three methods:

A) Just alpha: If the calculated p-value for a gene is less than


or equal to the user-input alpha (critical p-value), the gene is
considered significant.

OR
Use Bonferroni corrections to reduce the probability of
erroneously classifying non-significant genes as significant.

B) Standard Bonferroni correction: The user-input alpha is divided


by the total number of genes to give a critical p-value that is used
as above.
TTEST - Between subjects – 6
5C) Adjusted Bonferroni:
i) The t-values for all the genes are ranked in descending
order.
ii) For the gene with the highest t-value, the critical p-value
becomes (alpha / N), where N is the total number of genes; for the
gene with the second-highest t-value, the critical p-value will be
(alpha/ N-1), and so on.
TTEST – 1-class (or One-sample t-test) - 1

1. Used to test if the the mean expression of a gene over all experiments is
different from a hypothesized mean.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6


Vector 1
Gene 1

Gene 2 Vector 2

Gene 3 Vector 3

2. Question: Is the mean of the values of a given gene vector significantly


different from a hypothesized mean?
TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene expression studies is zero, meaning
that we are looking for genes whose mean log2 ratio across all experiments is
significantly different from zero, i.e.,

4. Using 1-sample t-tests, we can select genes which, on average, show


differential expression across all experiments (since genes with no differential
expression should have a mean log2 ratio of zero across all expts).

5. Calculate t-value, where

Observed mean of gene vector – Hypothesized mean of gene vector


t = ------------------------------------------------------------------------------
Standard error of the mean of the gene vector
TTEST – 1 class - 3
6. Calculate p-value from a theoretical t-distribution, OR

7. By permutation:
7a. Randomly pick some elements of the gene vector, and change their values,
such that the new value of the changed element is

[original value – 2 x (original value - hypothesized mean)]

(i.e., “flip” the element’s deviation around the hypothesized mean)

Thus, if the original gene values are: 0.5 -1.3 2.4 1.2 -0.2 0.8

and the hypothesized mean is zero, then

the randomized gene values could be: -0.5 -1.3 2.4 -1.2 0.2 -0.8

These elements were randomly


chosen and flipped around zero,
the hypothesized mean
TTEST – 1 class - 4

7b. Calculate t-value from the randomized gene

7c. Repeat 7a and 7b as many times as desired. If all permutations are chosen,
then every possible combination of elements in the gene vector is chosen for
flipping.

7d. The p-value = 1 – (the proportion of times that the original absolute
t-value exceeds the randomized absolute t-value over all the permutations
conducted).

8. If a gene’s p-value is less than or equal to the user-specified critical p-value,


the gene’s mean expression over all experiments is significantly different from
the hypothesized mean.

9. Bonferroni and adjusted Bonferroni corrections may be applied just as in


the two-sample t-test.
One Way Analysis of Variance (ANOVA)
1. Assign experiments to > 2 groups
Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7

Ex 1 Ex 2 Ex 7 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7

Group 1 Group 2 Group 3

2. Question: Is mean expression level of a gene the same across all groups?
ANOVA - 2

3. Calculate an F-ratio for each gene, where

Mean square (groups)


F = --------------------------, which is a measure of
Mean square (error)

Between groups variability


---------------------------------
Within groups variability

The larger the value of F, the greater the difference among the group means
relative to the sampling error variability (which is the within groups variability).

i.e., the larger the value of F, the more likely it is that the differences among the
group means reflect “real” differences among the means of the populations they
are drawn from, rather than being due to random sampling error.
ANOVA - 3

4. The p-value associated with an F-value is the probability


that an F-value that large would be obtained if there were no
differences among group means (i.e., given the null
hypothesis).

Therefore, the smaller the p-value, the less likely it is that the
null hypothesis is valid, i.e., the differences among group
means are more likely to reflect real population differences as
p-values decrease in magnitude.
ANOVA - 4

5. P-values can be obtained for the F-values from a theoretical F-


distribution, assuming that the populations from which the data
are obtained
• are normally distributed, and
• have homogeneous variances.

The test is considered robust to violations of these assumptions,


provided sample sizes are relatively large and similar across
groups.
ANOVA – 5

6. P-values can be obtained from permutation tests (just like in


t-tests), if one does not want to rely on the assumptions needed
for using the F-distribution.

P-values can also be corrected for multiple comparisons (using


Bonferroni or other procedures).

These features will soon be implemented in MeV.


Two-factor ANOVA (TFA)

- Can be used to find genes whose expression is significantly


different over two factors (e.g., sex and strain), as well as to
look for genes with a significant interaction for these two
factors.
Strain A Strain B Strain C

Male

Female
Gene expression TFA - 2

Gene expression
Female Female

Male Male

1 2 3 1 2 3
Strain Strain

No interaction Interaction
TFA - 3

• Ideally, design should be balanced, i.e., equal numbers of samples


in each factor A – factor B combination.

• If unbalanced, the analysis can still be conducted, but F-tests will


be somewhat biased. May need to use smaller p-values.

• can have balanced designs with no replication (see below). In this


case, interaction cannot be tested..
Strain A Strain B Strain C

Male

Female
Significance analysis of microarrays (SAM)

• SAM can be used to pick out significant genes


based on differential expression between sets of
samples.

Currently implemented for the following designs:


- two-class unpaired
- two-class paired
- multi-class
- censored survival
- one-class
SAM -2
• SAM gives estimates of the False Discovery Rate (FDR),
which is the proportion of genes likely to have been wrongly
identified by chance as being significant.

• It is a very interactive algorithm – allows users to dynamically


change thresholds for significance (through the tuning
parameter delta) after looking at the distribution of the test
statistic.

• The ability to dynamically alter the input parameters based


on immediate visual feedback, even before completing the
analysis, should make the data-mining process more sensitive.
SAM designs

Two-class unpaired: to pick out genes whose mean


expression level is significantly different
between two groups of samples (analogous to
between subjects t-test).

Two-class paired: samples are split into two


groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group
B (analogous to paired t-test).
SAM designs - 2

Multi-class: picks up genes whose mean expression is


different across > 2 groups of samples (analogous to
one-way ANOVA)

Censored survival: picks up genes whose expression


levels are correlated with duration of survival.

One-class: picks up genes whose mean expression


across experiments is different from a user-specified
mean.
SAM Two-Class Unpaired
1. Assign experiments to two groups, e.g., in the expression matrix
below, assign Experiments 1, 2 and 5 to group A, and
experiments 3, 4 and 6 to group B.
Group A Group B
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6

Gene 1
Gene 1
Gene 2
Gene 2
Gene 3 Gene 3
Gene 4
Gene 4
Gene 5 Gene 5
Gene 6
Gene 6

2. Question: Is mean expression level of a gene in group A


significantly different from mean expression level in group B?
SAM Two-Class Unpaired– 2
Permutation tests
i) For each gene, compute d-value (analogous to t-statistic). This is
the observed d-value for that gene.

ii) Rank the genes in ascending order of their d-values.

iii) Randomly shuffle the values of the genes between groups A and B,
such that the reshuffled groups A and B respectively have the same
number of elements as the original groups A and B. Compute the
d-value for each randomized gene
Group A Group B
Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6

Gene 1 Original grouping

Group A Group B
Exp 3 Exp 2 Exp 6 Exp 4 Exp 5 Exp 1
Gene 1 Randomized grouping
SAM Two-Class Unpaired - 3
iv) Rank the permuted d-values of the genes in ascending order

v) Repeat steps iii) and iv) many times, so that each gene has many
randomized d-values corresponding to its rank from the observed
(unpermuted) d-value. Take the average of the randomized d-values
for each gene. This is the expected d-value of that gene.

vi) Plot the observed d-values vs. the expected d-values


SAM Two-Class Unpaired– 4 Significant positive genes
(i.e., mean expression of group B >
“Observed d = expected d” line
mean expression of group A) in red

Tuning parameter
“delta” limits, can
be dynamically
changed by using
the slider bar or
entering a value in
the text field.

The more a gene deviates from the “observed = expected” line, the
Significant negative genes more likely it is to be significant. Any gene beyond the first gene in the
(i.e., mean expression of group A > mean +ve or –ve direction on the x-axis (including the first gene), whose
expression of group B) in green observed exceeds the expected by at least delta, is considered
significant.
SAM Two-Class Unpaired – 5
For each permutation of the data, compute the number
of positive and negative significant genes for a given
delta as explained in the previous slide. The median
number of significant genes from these permutations
is the median False Discovery Rate.

The rationale behind this is, any genes designated as


significant from the randomized data are being
picked up purely by chance (i.e., “falsely”
discovered). Therefore, the median number picked
up over many randomizations is a good estimate of
false discovery rate.
SAM Two-Class Paired
•Samples fall into two groups

•Each member of group A is associated with a member of


group B in a 1-to-1 relationship

A B

A-B pair
SAM Two-Class Paired - 2

•e.g., groups A and B could respectively represent “before” and “after”


a drug treatment, and each A-B pair of samples could come from the
same patient before and after the treatment.

•or, groups A and B could represent two strains for which samples were
collected at the several time points over a time course study. A sample
collected from each of strain A and B at the same time point could form
an AB pair.

• The rest of the analysis is similar to two-class unpaired SAM. Positive


significant genes are those for which Mean(Group B) is significantly larger
than Mean (Group A), and reverse is true for negative significant genes
SAM Multi-Class
• Extension of SAM two -class unpaired to more than 2 groups

• Experiments belong to one of at least three groups

• Analogous to one-way between subjects ANOVA


Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7

Ex 1 Ex 2 Ex 7 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7

Group 1 Group 2 Group 3


SAM Multi-Class - 2

• This analysis yields only positive significant genes

• These are genes whose means are significantly different across


some combination of the groups of experiments.
SAM Censored Survival
• Each experiment (sample) is associated with an observation
time, and a state at the time of observation.

• The state is either “dead” or “censored”

• “Censored” means that the subject survived beyond the time


point at which the sample was taken.

• A positive score means that a higher expression level for that


gene implies shorter survival (i.e., higher risk), whereas a
negative score means that higher expression implies longer
survival.
SAM One-Class

• used to pick up genes whose mean expression across experiments


is different from a user-specified mean.

• analogous to one-class t-test

• positive genes are those whose means are greater than the specified
mean, while negative genes have means smaller than the specified
mean
Support Vector Machines
(SVM)

• supervised learning technique


• uses supplied information such as
presumptive biological relationships
between a set of elements, and the
expression profiles of elements to produce a
binary classification of elements.
Supervised Learning
-begins with the definition of a class which
specifies in advance which elements should
cluster together.
-ie. genes for enzymes in a common pathway or
part of a regulatory system, or samples may be a
tissue type or from a particular strain.
-this information is used to train the SVM to
discriminate members from non-members
SVM Process Overview
Initial Data
Classification
Data

SVM
Training

Weights SVM
Classification

Elements Elements
In Out of
Classificatio Classificatio
n n
SVM Classification
• SVM attempts to find an optimal separating
hyperplane between members of the two
initial classifications.

Separating
hyperplane
Separation Problem

-an optimal hyperplane partitions the initial


classification correctly and maximizes distance
from the plane to elements on either ‘side’,
positive and negative examples.
-when the training examples (initial
classification) consists of very diverse
expression patterns finding an optimal
hyperplane can be impossible…
SVM Kernel Construction
The expression data can be transformed to a higher
dimensional space (feature space) by applying a kernel
function. This transformation can have the effect of allowing a
separating hyperplane to be found.
Practical SVM Issues
• Results depend heavily on the input
parameters.
• Using a high degree kernel function risks
artificial separation of the data.
• An iterative approach to increasing the
kernel power is advisable.
SVM Results
• Two classes are produced
– Positive Class: contains elements with expression
patterns similar to those in the positive examples in the
training set.
– Negative Class: contains all other members of the
input set.
• Each of these classes has elements that fall in two groups
– Those initially in the class (true positives and true
negatives)
– Those recruited into the class (false positives and false
negatives)
K-Nearest Neighbor Classification – KNNC - 1

• supervised classification scheme

• user specifies the number of expected classes

• a training set of vectors is provided as input

• user specifies classes of training vectors

• training set should contain example of each class


KNNC – 2 – pre-classification filters

• Prior to classification, variance filtering can optionally be applied


to all vectors (training set + vectors to be trained). This will filter
out genes with low variance across experiments. Note that this
might filter out some genes in the training set as well.

• Correlation filtering can also be applied on the vectors to be


classified. This would filter out those vectors in the set to be
classified, that are not significantly correlated with any gene in the
training set.

• Significance for correlation filtering is determined by a


permutation test.
KNNC – 3 - correlation filtering randomization test
1. The Pearson correlation coefficient r is computed between a given
vector to be classified, and each member of the training set

2. The maximum such r is called the rmax for that vector.

3. The vector is randomized a user-specified number of times, and


each time, an rmax is calculated using the randomized vector
(call it rmax*), just as in steps 1 and 2.

4. The proportion of times rmax* exceeds rmax over all randomizations


is the p-value for that vector.

5. If the p-value for a vector < the user-specified p-value, that vector
is retained for further analysis.

6. Steps 1-6 are repeated for every vector in the set to be classified.
KNNC – 4 - Classification parameters
• Let v be a vector that needs to be classified,
and T = {t1, t2, …, t10} be the set of training vectors.

• The user specifies the classes of each element of T. Say, there


are 4 classes.

• The user also specifies the number of neighbors k. Say, k = 5.


t9 Class 1

t1
t4 t8 Class 2
t10
v
t2 Class 3
t5
t7 Class 4
t3
t6

T
KNNC –5 - Classification
t9 Class 1

t1
t4 t8 Class 2
t10
v
t2 Class 3
t5
t7 Class 4
t3
t6

T
• Suppose v’s 5 nearest neighbors in set T (by Euclidean distance) are
t1, t4, t8, t2, and t5.

• Since class 1 is most frequently represented in v’s nearest neigbors, v is assigned


to class 1.

• If there is a tie in frequency of classes represented among nearest neighbors, the


vector remains unassigned.
EASE
(Expression Analysis Systematic Explorer)

EASE analysis identifies prevalent biological


themes within gene clusters.

The significance of each identified theme is


determined by its prevalence in the cluster
and in the gene population of genes from
which the cluster was created.
Diverse Biological Roles
Consider a population of genes representing a
diverse set of biological roles or themes shown
below as different colors.
Many algorithms can be applied to expression data to
partition genes based on expression profiles over
multiple conditions.

Many of these techniques work solely on expression


data and disregard biological information.
Consider a particular cluster…
-What are the some of the predominant
biological themes represented in the cluster
and how should significance be assigned to
a discovered biological theme?
Example:

Population Size: 40 genes


Cluster size: 12 genes

10 genes, shown in green, have a common


biological theme and 8 occur within the cluster.
Consider the Outcome
The frequency of the theme in the population is 10/40 = 25%
10
40

12
8
The frequency of the theme within the cluster is 8/12 = 67%

AND

* 80% of the genes related to the theme in the population


ended up within the relatively small cluster.
Contingency Matrix
A 2x2 contingency matrix is typically used to
capture the relationships between cluster membership
and membership to a biological theme.
Cluster Contingency
in out Matrix

in 8 2
Theme
out 4 26
Assigning Significance to the Findings
The Fisher’s Exact Test permits us to determine if there are
non-random associations between the two variables, expression
based cluster membership and membership to a particular
biological theme.

Cluster
in out

in 8 2
Theme p  .0002
out 4 26

( 2x2 contingency matrix )


Hypergeometric Distribution
a b a+b The probability of any particular
matrix occurring by random
c d c+d selection, given no association
between the two variables, is given
a+c b+d by the hypergeometric rule.

(a  c)! (b  d )!

a!c! b!d!  (a  b)!(c  d )!(a  c)!(b  d )!
n! n!a!b!c!d !
(a  b)!(c  d )!
Probability Computation
8 2
For our matrix, , we are not only
4 26
interested in getting the probability of getting exactly
8 annotation hits in the cluster but rather the probability
of having 8 or more hits. In this case the probabilities
of each of the possible matrices is summed.

8 2 9 1 10 0
4 26 3 27 2 28

.0002207 + 7.27x10-6 + 7.79x10-8  .000228


EASE Results
• Consider all of the Results

EASE reports all themes represented in a cluster


and although some themes may not meet statistical
significance it may still be important to note that
particular biological roles or pathways are
represented in the cluster.

• Independently Verify Roles

Once found, biological themes should be


independently verified using annotation resources.
Basic EASE Requirements

Annotation keys; identifiers for each gene


must be loaded with the data into MeV.

EASE file system; EASE uses a file system


to link annotation keys to biological themes.
EASE File System
EASE
(Expression Analysis Systematic Explorer)

Hosack et al. Identifying biological themes within


lists of genes with EASE. Genome Biol., 4:R70-
R70.8, 2003.

NIAID graciously provided the foundation Java


classes upon which the MeV version was built.
Coming Attractions

• Algorithm scripting
• Discriminant analysis
• Chromosome Viewers
etc.

You might also like