You are on page 1of 21

National Taiwan University

Department of Computer Science


and Information Engineering

Linkage Disequilibrium and Recent


Studies of Haplotypes and SNPs

Speaker: Yao-Ting Huang


Advisor: Kun-Mao Chao

Algorithms and Computational Biology Lab.


Dept. of Computer Science & Information Engineering
National Taiwan University
National Taiwan University
Department of Computer Science
and Information Engineering

Variations in DNA Sequence


 Variants in the human genome include
 Single Nucleotide Polymorphisms (SNPs),
 deletions (e.g., loss of heterozygosity),

 and insertions.

 SNPs become the preferred DNA markers for


association studies because of
 their high abundance (e.g., ~1 SNP/1000 base pairs), and
 high-throughput genotyping technology which allows
building a large SNP database (e.g., International
HapMap Project).
2
National Taiwan University
Department of Computer Science
and Information Engineering

SNPs Arise from Mutations


Variations observed
Mutations over time in a population

Common Disease Mutation


Ancestor

time present 3
National Taiwan University
Department of Computer Science
and Information Engineering

Haplotype
 A set of closely linked SNPs located on one
chromosome.

SNP 1

SNP 2

SNP 3
GATATTCGTACGGA-T
Haplotypes
GATGTTCGTACTGAAT
GATATTCGTACGGA-T AG- 2/6
GATATTCGTACGGAAT GTA 3/6
GATGTTCGTACTGAAT AGA 1/6
GATGTTCGTACTGAAT

DNA
Sequences
4
National Taiwan University
Department of Computer Science
and Information Engineering

Factors Affecting Haplotypes


 The chromosome recombination
breaks up and reorganizes halotypes.
 If SNPs are closely linked, they will
tend to be inherited together as
haplotypes.
 Less chance that recombination will
occur between them.
 Linkage Disequilibrium (LD) is a
measure of the non-random
association of alleles at linked loci.

5
National Taiwan University
Department of Computer Science
and Information Engineering

Linkage Disequilibrium
A B
A b
Consider only a B
two SNPs a b
There are 4 possible haplotypes

SNP 1
B b Total

The probabilities A PAB PaB PA


for each haplotype SNP 2
a PaB Pab Pa

Total PB Pb 1.0 6
National Taiwan University
Department of Computer Science
and Information Engineering

Linkage Equilibrium
 PAB = PAPB
 PAb = PAPb = PA(1-PB)
 PaB = PaPB = (1-PA) PB
 Pab = PaPb = (1-PA) (1-PB) SNP 1
B b Total

A PAB PaB PA
SNP 2
a PaB Pab Pa

Total PB Pb 1.0
7
National Taiwan University
Department of Computer Science
and Information Engineering

Linkage Disequilibrium
 PAB ≠ PAPB
 PAb ≠ PAPb = PA(1-PB)
 PaB ≠ PaPB = (1-PA) PB
 Pab ≠ PaPb = (1-PA) (1-PB)
SNP 1
B b Total

A PAB PaB PA
SNP 2
a PaB Pab Pa

Total PB Pb 1.0
8
National Taiwan University
Department of Computer Science
and Information Engineering

An Example of Linkage
Disequilibrium
Before mutation After mutation
-- A -- -- -- G -- -- -- -- A -- -- -- G -- -- --
-- C -- -- -- G -- -- -- -- C -- -- -- G -- -- --
PA=1/2 PG=1 -- C -- -- -- C -- -- --
PC=1/2 PA=1/3 PG=2/3
PC=2/3 PC=1/3

 We got only three haplotypes: AG, CG, and CC.


 There is no AC haplotype, i.e., PAC = 0.
 However, PAPC =1/9, thus PAPC ≠ PAC .
 These two SNPs are linkage disequilibrium. 9
National Taiwan University
Department of Computer Science
and Information Engineering

An Example of Linkage
Equilibrium
Before recombination After recombination
-- A -- -- -- G -- -- -- -- A -- -- -- G -- -- --
-- C -- -- -- G -- -- -- -- C -- -- -- G -- -- --
-- C -- -- -- C -- -- -- -- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
 After recombination, PA=1/2 PG=1/2
 PAG = PAPG = 1/4, PC=1/2 PC=1/2
 PCG = PCPG = 1/4,
 PCC = PCPC = 1/4, and
 PAC = PAPC = 1/4.
 Thus, these two SNPs are linkage equilibrium. 10
National Taiwan University
Department of Computer Science
and Information Engineering

D Coefficient
 We can measure the non-randomness of two loci by
means of a deviation, D, defined as follows:
 D = PAB – PAPB or PABPab – PAbPaB
 PAB = PAPB + D

 PAb = PA(1-PB) - D

 PaB = (1-PA) PB - D

 Pab = (1-PA) (1-PB) + D

 These two SNPs are linkage equilibrium iff D = 0.

11
National Taiwan University
Department of Computer Science
and Information Engineering

Standardization of D Coefficient
 D coefficient can be standardized in many ways.
 D’ = D/Dmax, where Dmax stands for the absolute maximal
possible value of D.

 D
 min( P P , P P ) , if D  0; D D
D'   A B a b
D
 , if D  0.
 min( PA Pb , Pa PB )
 PA PB  D  PAB  0  D   PA PB -PAPB 0 PaPB
Pa Pb  D  Pab  0  D   Pa Pb
PA Pb  D  PAb  0  D  PA Pb
Pa PB  D  PaB  0  D  Pa PB 12
National Taiwan University
Department of Computer Science
and Information Engineering

Interpretation of D’
 D’ is constrained between -1 and +1.
 D’ = 1 (perfect positive LD between SNP alleles)
 D’ = 0 (linkage equilibrium between SNP alleles)
 D’ = -1 (perfect negative LD between SNP alleles)
 D’ = 0.87 (strong positive LD between SNP alleles)
 D’ = 0.12 (weak positive LD between SNP alleles)
 Other measures of D coefficient:
2
 r2 or Δ2: D
2 
PA (1  PA ) PB (1  PB )

 Chi-square Test.
 P value.
13
National Taiwan University
Department of Computer Science
and Information Engineering

Decay of LD over Time


 The chromosome recombination decreases LD and
should reach equilibrium at the end.

14
National Taiwan University
Department of Computer Science
and Information Engineering

Haplotype Blocks in Human


Genome
 The human genome has been shown to contain
regions of high LD interspersed by regions of low
LD.
 The recombination occurs frequently in low LD regions.
 The high LD regions can form haplotype blocks.

 The International HapMap Project aims to build the


haplotype map across human genome.

Recombination hot spots Haplotype blocks


(Low LD regions) (High LD regions)

Chromosome 15
National Taiwan University
Department of Computer Science
and Information Engineering

Genotype Data v.s. Halotype Data


 The use of haplotype map has been limited due to
the fact that the human genome is diploid.
 Genotype data instead of haplotype data are obtained.
 Phase problem: loss of the information of the
chromosome where each base appears.
 e.g., we don’t know they are (GA, TC) or (GC, TA).

G A
Diploid
T C
16
SNP1 SNP2
National Taiwan University
Department of Computer Science
and Information Engineering

Haplotype Reconstruction with


Pedigree
 Haplotype reconstruction with pedigree (Li and Jiang,
2004).
 There is no mutations but only recombinants happened within
a pedigree.
 Given a pedigree and genotype data for each member in the
pedigree, find a haplotype configuration for the pedigree that
requires minimum number of recombinants.
Pedigree
1|2 1|2 2|2
3|1 1|3 2|2

1|2 1|2
17
1|2 3|2
National Taiwan University
Department of Computer Science
and Information Engineering

Haplotype Block Partition and Tag


SNP Selection Using Genotype Data
 Zhang et al. (2004) combine a dynamic programming
and an EM algorithms to partition haplotype blocks.
 The EM algorithm infers the haplotypes for a range of SNPs.
 The dynamic programming algorithm minimizes the number
of tag SNPs used in the haplotype block partition.
 The experiments examine the factors that affect block
partition and tag SNPs used, which include
 number of haplotypes,
 density of SNPs,
 minor allele frequency of SNPs,
 missing data, and
 genotyping error rate.
18
National Taiwan University
Department of Computer Science
and Information Engineering

Thoughts
 How to modify the tag SNP selection algorithm to
process genotype data.
 The naïve approach is inferring haplotype data by
existing algorithms and finding tag SNPs.
 Is it possible to determine tag SNPs directly from
genotype data?
 Assume 0: homozygous wild type, 1: homozygous
mutant, 2: heterozyhous. P P P P
1 2 3 4
S1 1 1 0 0
S2 1 0 1 0
S3 1 2 0 1
S4 1 2 0 1 19
National Taiwan University
Department of Computer Science
and Information Engineering

The Relation Between Minor Allele


Frequency and Tag SNPs
 The minor allele frequency ranges from 0% to 50%.
 The higher the frequency, the more useful tag SNPs
are available.
 0000000011 -> 20%.
 0010010011 -> 40%, this SNP can distinguish more
haplotype patterns.
 What is the relation between the minor allele
frequency and the number of tag SNPs.

20
National Taiwan University
Department of Computer Science
and Information Engineering

Block-Free Selection of Tagging SNPs


 Bafna, et al. (2004) propose algorithms for selecting
tag SNPs without considering haplotype block
structure.
 They define a new measure called “Informativeness,”
which measures how well a set of SNPs can predict
another set of SNPs.
 Find a subset of SNPs which has the maximum
Informativeness.
 The number of total tag SNPs used in a whole genome is
less than block-dependent approaches.

21

You might also like