CG10 HMM

Computational Genomics
Lecture 10
Hidden Markov Models (HMMs)
Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU)
.
Modified by Benny Chor (TAU)
Outline
Finite, or Discrete, Markov Models

Hidden Markov Models
Three major questions:
Q1.: Computing the probability of a given
observation.
A1.: Forward Backward (Baum Welch) dynamic
programming algorithm.
Q2.: Computing the most probable sequence, given
an observation.
A2.: Viterbis dynamic programming Algorithm
Q3.: Learn best model, given an observation,.
A3.: Expectation Maximization (EM): A Heuristic.
Markov Models
A discrete (finite) system:

N distinct states.
Begins (at time t=1) in some initial state(s).
At each time step (t=1,2,) the system moves
from current to next state (possibly the same as
the current state) according to transition
probabilities associated with current state.
This kind of system is called a finite, or discrete
Markov model
After Andrei Andreyevich Markov (1856 -1922)
Outline
Markov Chains (Markov Models)
Hidden Markov Chains (HMMs)
Algorithmic Questions
Biological Relevance
Discrete Markov Model: Example
Discrete Markov Model

with 5 states.
Each aij represents the
probability of moving
from state i to state j
The aij are given in a
matrix A = {aij}
The probability to start
in a given state i is i ,
The vector represents these
startprobabilities.
5
Markov Property
Markov Property: The state of the system at time t+1
depends only on the state of the system at time t
P[X t 1 x t 1 | X t x t , X t -1 x t -1 , . . . , X1 x1 , X 0 x 0 ]
P[X t 1 x t 1 | X t x t ]
Xt=1
Xt=2
Xt=3
Xt=4
Xt=5
Markov Chains
Stationarity Assumption
Probabilities independent of t when process is stationary
So,
for all t, P[X t 1 x j |X t x i ] pij
This means that if system is in state i, the probability that

the system will next move to state j is pij , no matter what
the value of t is
Simple Minded Weather Example

raining today
rain tomorrow
prr = 0.4
raining today
no rain tomorrow
prn = 0.6
no raining today
rain tomorrow
pnr = 0.2
no raining today
no rain tomorrow
prr = 0.8
Simple Minded Weather Example

Transition matrix for our example
0.4 0.6
P
0 .2 0 .8
Note that rows sum to 1
Such a matrix is called a Stochastic Matrix
If the rows of a matrix and the columns of a matrix all
sum to 1, we have a Doubly Stochastic Matrix
Coke vs. Pepsi (a cental cultural dilemma)

Given that a persons last cola purchase was Coke ,
there is a 90% chance that her next cola purchase will
also be Coke .
If that persons last cola purchase was Pepsi, there
is an 80% chance that her next cola purchase will also
be Pepsi.
0.1
0.9
coke
0.8
pepsi
0.2
Coke vs. Pepsi

Given that a person is currently a Pepsi purchaser,
what is the probability that she will purchase Coke
two purchases from now?
The transition matrices
are:
(corresponding to
0.9 0.1
P
0.2 0.8
one purchase ahead)
0.9 0.1 0.9 0.1 0.83 0.17

P
0.2 0.8 0.2 0.8 0.34 0.66

2
Coke vs. Pepsi

Given that a person is currently a Coke
drinker, what is the probability that she will
purchase Pepsi three purchases from now?
0.9 0.1 0.83 0.17 0.781 0.219

P
0.2 0.8 0.34 0.66 0.438 0.562

3
Coke vs. Pepsi

Assume each person makes one cola purchase
per week. Suppose 60% of all people now drink
Coke, and 40% drink Pepsi.
What fraction of people will be drinking Coke
three weeks from now?
Let (Q0,Q1)=(0.6,0.4) be the initial probabilities.
We will regard Coke as 0 and Pepsi as 1
We want to find P(X3=0)
P00
0.9 0.1
P
0
.
2
0
.
8
( 3)
P ( X 3 0) Qi pi(03) Q0 p00
Q1 p10(3) 0.6 0.781 0.4 0.438 0.6438
i 0
Equilibrium (Stationary) Distribution

Suppose 60% of all people now drink Coke, and 40%
drink Pepsi. What fraction will be drinking Coke
10,100,1000,10000 weeks from now?
For each week, probability is well defined. But does it
converge to some equilibrium distribution [p0,p1]?
If it does, then eqs. : .9p0+.2p1 =p0, .8p1+.1p0 =p1

must hold, yielding p0= 2/3, p1=1/3 .
0.1
0.9
coke
0.8
pepsi
0.2

Whether or not there is a stationary distribution, and
whether or not it is unique if it does exist, are determined
by certain properties of the process. Irreducible means that
every state is accessible from every other state. Aperiodic
means that there exists at least one state for which the
transition from that state to itself is possible. Positive
recurrent means that the expected return time is finite for
0.1
every state. 0.9
0.8
coke
pepsi
0.2
http://en.wikipedia.org/wiki/Markov_chain
If the Markov chain is positive recurrent, there

exists a stationary distribution. If it is positive
recurrent and irreducible, there exists a unique
stationary distribution, and furthermore the process
constructed by taking the stationary distribution as
the initial distribution is ergodic. Then the average
of a function f over samples of the Markov chain is
equal to the average with respect to the stationary
distribution,
1

Writing P for the transition matrix, a stationary
distribution is a vector which satisfies the equation
P = .
In this case, the stationary distribution is an
eigenvector of the transition matrix, associated with
the eigenvalue 1.
Discrete Markov Model - Example
States Rainy:1, Cloudy:2, Sunny:3
Matrix A
Problem given that the weather on day 1 (t=1) is sunny(3), what is the
probability for the observation O:
Discrete Markov Model Example (cont.)
The answer is -
Types of Models
Ergodic model
Strongly connected - directed
path w/ positive probabilities
from each state i to state j
(but not necessarily complete
directed graph)
Third Example: A Friendly Gambler

Game starts with 10$ in gamblers pocket
At each round we have the following:
or
Gambler wins 1$ with probability p

Gambler loses 1$ with probability 1-p
Game ends when gambler goes broke (no sister in bank),

or accumulates a capital of 100$ (including initial capital)
Both 0$ and 100$ are absorbing states
p
1
1-p
N-1
2
1-p
1-p
Start
(10$)
1-p
Fourth Example: A Friendly Gambler

Irreducible means that every state is accessible from every other
state. Aperiodic means that there exists at least one state for
which the transition from that state to itself is possible. Positive
recurrent means that the expected return time is finite for every
state. If the Markov chain is positive recurrent, there exists a
stationary distribution.
Is the gamblers chain positive recurrent? Does it have a
stationary distribution (independent upon initial distribution)?
p
1
1-p
N-1
2
1-p
1-p
Start
(10$)
1-p
Let Us Change Gear

Enough
Our
with these simple Markov chains.
next destination: Hidden Markov chains.

Start
tail
1/2
Fair
0.9
1/2
1/2
1/2
head
0.1
0.1
tail
1/4
loaded
3/4
0.9
head
Hidden Markov Models

(probabilistic finite state automata)
Often we face scenarios where states cannot be
directly observed.
We need an extension: Hidden Markov Models
a11
a12
b11
a34
a23
b14
b13
b12
a44
a33
a22
4
2
Observed
phenomenon
aij are state transition probabilities.

bik are observation (output) probabilities.
b11 + b12 + b13 + b14 = 1,
b21 + b22 + b23 + b24 = 1, etc.
Hidden Markov Models - HMM

Hidden variables
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
Observed data
2
Example: Dishonest Casino
Actually, what is hidden in this model?

2
Coin-Tossing Example
Start
1/2
Fair
0.9
1/2
1/2
tail
tail
1/4
0.1
loaded
0.1
3/4
1/2
0.9
head
head
Fair/Loade
d
L tosses
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
Head/Tail
Loaded Coin Example

Start
1/2
1/2
tail
1/2
Fair
0.9
1/2
0.1
0.1
tail
1/4
loaded
0.9
3/4
head
head
Fair/Loade
d
L tosses
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
Head/Tail
Q1.: What is the probability of the sequence of observed

outcome (e.g. HHHTHTTHHT), given the model?
2
HMMs Question I
Given an observation sequence O = (O1 O2 O3
OL),
and a model M = {A, B, }how do we efficiently

compute P(O|M), the probability that the given
model M produces the observation O in a run of
length L ?
This probability can be viewed as a measure of the

quality of the model M. Viewed this way, it enables
discrimination/selection among alternative models.
2
C-G Islands Example

C-G islands: DNA parts which are very rich in C and G
q/4
q/4
P
Regular
change
P
DNA
P q
q/4
q/4
p/6
(1-P)/4
p/3
(1-q)/6
(1-q)/3
p/3
P/6
C-G island
Example: CpG islands

In human genome, CG dinucleotides are relatively
rare
CG pairs undergo a process called methylation
that modifies the C nucleotide
A methylated C mutate (with relatively high
chance) to a T
Promotor regions are CG rich
These regions are not methylated, and thus
mutate less often
These are called CpG islands
CpG Islands
We construct Markov chain
for CpG rich and poor
regions
Using maximum likelihood
estimates from 60K
nucleotide, we get two
models
Ratio Test for CpC islands
Given a sequence X1,,Xn we compute the

likelihood ratio
P ( X 1 , , Xn | )
S ( X1 , , Xn ) log
P ( X1 , , Xn | )
A X i Xi 1
log
A X i Xi 1
i
X i Xi 1
i
Empirical Evalation
Finding CpG islands

Simple Minded approach:
Pick a window of size N
(N = 100, for example)
Compute log-ratio for the sequence in the window,
and classify based on that
Problems:
How do we select N?
What do we do when the window intersects the
boundary of a CpG island?
3
Alternative Approach
Build a model that include + states and - states
A state remembers last nucleotide and the type of region

A transition from a - state to a + describes a start of CpG island
A Different C-G Islands Model

G
change
C-G
island?
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
A/C/G/T
HMM Recognition (question I)
For a given model M = { A, B, p} and a given state

sequence Q1 Q2 Q3 QL ,, the probability of an
observation sequence O1 O2 O3 OL is
P(O|Q,M) = bQ1O1 bQ2O2 bQ3O3 bQTOT
For a given hidden Markov model M = { A, B, p}

the probability of the state sequence Q1 Q2 Q3 QL
is (the initial probability of Q1 is taken to be pQ1)
P(Q|M) = pQ1 aQ1Q2 aQ2Q3 aQ3Q4 aQL-1QL
So, for a given HMM, M

the probability of an observation sequence O1O2O3 OT
is obtained by summing over all possible state sequences
HMM Recognition (cont.)

P(O| M) = P(O|Q) P(Q|M)
= Q Q bQ O aQ Q bQ O aQ Q bQ O
1
Requires summing over exponentially many paths

Can this be made more efficient?
HMM Recognition (cont.)
Why isnt it efficient? O(2LQL)

For a given state sequence of length L we have
about 2L calculations
= Q a Q Q a Q Q a Q Q a Q
P(O|Q) = bQ O bQ O bQ O bQ O
P(Q|M)
T-1
QT
There are QL possible state sequence

So, if Q=5, and L=100, then the algorithm
requires 200x5100 computations
We can use the forward-backward (F-B)
algorithm to do things efficiently
4
The Forward Backward Algorithm
A white board presentation.
The F-B Algorithm (cont.)

Option 1) The likelihood is measured using any
sequence of states of length T
This is known as the Any Path Method
Option 2) We can choose an HMM by the probability
generated using the best possible sequence of
states
Well refer to this method as the Best Path
Method
HMM Question II (Harder)
Given an observation sequence, O = (O1 O2 OT),

and a model, M = {A, B, p }, how do we efficiently
compute the most probable sequence(s) of states,
Q?
Namely the sequence of states Q = (Q1 Q2 QT) ,
which maximizes P(O|Q,M), the probability that the
given model M produces the given observation O
when it goes through the specific sequence of
states Q .
Recall that given a model M, a sequence of
observations O, and a sequence of states Q, we
can efficiently compute P(O|Q,M) (should watch out
for numeric underflows)
4
Most Probable States Sequence (Q. II)

Idea:
If we know the identity of Qi , then the most
probable sequence on i+1,,n does not depend on
observations before time i
A white board presentation of Viterbis algorithm
Dishonest Casino (again)
Computing posterior probabilities for fair at each

point in a long sequence:
HMM Question III (Hardest)
Given an observation sequence O = (O1 O2 OL), and

a class of models, each of the form M = {A,B,p}, which
specific model best explains the observations?
A solution to question I enables the efficient
computation of P(O|M) (the probability that a specific
model M produces the observation O).
Question III can be viewed as a learning problem: We
want to use the sequence of observations in order to
train an HMM and learn the optimal underlying model
parameters (transition and output probabilities).
Learning
Given a sequence x1,,xn, h1,,hn
How do we learn Akl and Bka ?

We want to find parameters that maximize the
likelihood P(x1,,xn, h1,,hn)
We simply count:
Nkl - number of times hi=k & hi+1=l
Nka - number of times hi=k & xi = a

Nkl
Akl
Nkl '
l'
Nka
Bka
Nka '
a'
Learning
Given only sequence x1,,xn
How do we learn Akl and Bka ?

We want to find parameters that maximize the
likelihood P(x1,,xn)
Problem:
Counts are inaccessible since we do not observe hi
If we have Akl and Bka we can compute
P (Hi k , Hi 1 l | x1 , , xn )
P (Hi k , Hi 1 l , x1 , , xn )
P (x 1 , , x n )
P (Hi k , x1 , , xi )P (xi 1 | Hi 1 l )P (xi 2 , , xn | Hi 1 l )
P (x1 , , x n )
fk (i )Blxi 1 bl (i 1 )
P (x 1 , , x n )
Expected Counts
We can compute expected number of times

hi=k & hi+1=l
E [Nkl ] P (Hi k , Hi 1 l | x1 , , xn )
i
Similarly
E [Nka ]
P (Hi
i x a
,
k | x1 , , xn )
Expectation Maximization (EM)
Choose Akl and Bka
E-step:
Compute expected counts E[Nkl], E[Nka]
M-Step:
Restimate:
E [Nkl ]
A'kl
E [Nkl ']
l'
E [Nka ]
B'ka
E [Nka ']
a'
Reiterate
5
EM - basic properties
P(x1,,xn: Akl, Bka) P(x1,,xn: Akl, Bka)

Likelihood grows in each iteration
If P(x1,,xn: Akl, Bka) = P(x1,,xn: Akl, Bka)
then Akl, Bka is a stationary point of the likelihood
either a local maxima, minima, or saddle point
Complexity of E-step
Compute forward and backward messages
Time & Space complexity: O(nL)
Accumulate expected counts
Time complexity O(nL2)
Space complexity O(L2)
EM - problems
Local Maxima:
Learning can get stuck in local maxima
Sensitive to initialization
Require some method for escaping such maxima
Choosing L
We often do not know how many hidden values we
should have or can learn
Communication Example

CG10 HMM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CG10 HMM

Uploaded by

Copyright:

Available Formats

Computational Genomics

Modified by Benny Chor (TAU)

Finite, or Discrete, Markov Models

A discrete (finite) system:

Discrete Markov Model: Example

Discrete Markov Model

for all t, P[X t 1 x j |X t x i ] pij

This means that if system is in state i, the probability that

Simple Minded Weather Example

Simple Minded Weather Example

Coke vs. Pepsi (a cental cultural dilemma)

Coke vs. Pepsi

one purchase ahead)

0.9 0.1 0.9 0.1 0.83 0.17

0.2 0.8 0.2 0.8 0.34 0.66

Coke vs. Pepsi

0.9 0.1 0.83 0.17 0.781 0.219

0.2 0.8 0.34 0.66 0.438 0.562

Coke vs. Pepsi

Equilibrium (Stationary) Distribution

If it does, then eqs. : .9p0+.2p1 =p0, .8p1+.1p0 =p1

Equilibrium (Stationary) Distribution

Equilibrium (Stationary) Distribution

If the Markov chain is positive recurrent, there

Equilibrium (Stationary) Distribution

Discrete Markov Model - Example

States Rainy:1, Cloudy:2, Sunny:3

Discrete Markov Model Example (cont.)

Third Example: A Friendly Gambler

Gambler wins 1$ with probability p

Game ends when gambler goes broke (no sister in bank),

Fourth Example: A Friendly Gambler

Let Us Change Gear

with these simple Markov chains.

next destination: Hidden Markov chains.

Hidden Markov Models

aij are state transition probabilities.

Hidden Markov Models - HMM

Example: Dishonest Casino

Actually, what is hidden in this model?

Loaded Coin Example

Q1.: What is the probability of the sequence of observed

Given an observation sequence O = (O1 O2 O3

and a model M = {A, B, }how do we efficiently

This probability can be viewed as a measure of the

C-G Islands Example

Example: CpG islands

Ratio Test for CpC islands

Given a sequence X1,,Xn we compute the

Finding CpG islands

Build a model that include + states and - states

A state remembers last nucleotide and the type of region

A Different C-G Islands Model

HMM Recognition (question I)

For a given model M = { A, B, p} and a given state

For a given hidden Markov model M = { A, B, p}

So, for a given HMM, M

HMM Recognition (cont.)

Requires summing over exponentially many paths

HMM Recognition (cont.)

Why isnt it efficient? O(2LQL)

There are QL possible state sequence

The Forward Backward Algorithm