You are on page 1of 19

BIOTOOLS

AND
DATABASES

BY
C.GAYATHRI
(I M.Sc.BIOINFORMATICS)
 Needleman Wunsch
Algorithm
 Smith Waterman
Algorithm
 Published in 1970 by SAUL NEEDLEMAN
and CHRISTIAN WUNSCH
 General algorithm for sequence
comparison
 Commonly used in bioinformatics to align
protein or nucleotide sequences
 Example of dynamic programming, and was
the first application of dynamic
programming to biological sequence
comparison.
 Scores for aligned characters are
specified by a SIMILARITY MATRIX. Here,
S(i, j) is the similarity of characters i and
j. It uses a LINEAR GAP PENALTY, here
called d.
 Maximizes a similarity score, to give
‘MAXIMUM MATCH’
 Maximum match = largest number of
residues of one sequence that can be
matched with another allowing for all
possible deletions
 Finds the best GLOBAL alignment of any
two sequences
 N-W involves an iterative matrix method of
calculation
 All possible pairs of residues (bases or amino
acids) - one from each sequence - are
represented in a 2-dimensional array
 All possible alignments (comparisons) are
represented by pathways through this array
 Three main steps

1. Assign similarity values

2. For each cell, allowing insertions and deletions


give the maximum possible scoring value

3. Construct an alignment (pathway) back from the


highest scoring cell
Similarity values
 Numerical value is assigned to M P R C L C Q R J N C B A
every cell (depending on the P 1
similarity/dissimilarity of the two B 1
residues) R 1 1
 simple scores or more C 1 1 1
complicated, (chemical similarities K
or frequency of observed C 1 1 1
R 1
substitutions)
N 1
 The example shown here has J 1
 match = +1 C 1 1 1
 mismatch = 0 J 1
A 1
Score pathways through array
 to know the maximum M P R C L C Q R J N C B A
possible score for an P 0 1 0 0 0 0 0 0 0 0 0 0 0
alignment B 0 0 1 1 1 1 1 1 1 1 1 2 1
R 0 0 2 1 1 1 1 2 1 1 1 1 2
 Searches sub rows and sub C 0 0 1 3 2 3 2 2 2 2 3 2 2
columns, for the highest K 0 0 1 2 3 3 3 3 3 3 3 3 3
score C
R
0
0
0
0
1
2
3
2
3
3
4
3
3
4 ?
3 3 3 4 3 3

 Adds this to the score for the N


current cell J 1
C 1 1 1
 Proceeds row by row through J 1
the array A 1
 Gap penalty for the
introduction of gaps in the
alignment = 0
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}
Construct alignment
 The alignment score is
cumulative by adding along a P M0 P1 R0 C0 L0 C0 Q0 R0 J0 N0 C0 B0 A0
path through the array B 0 0 1 1 1 1 1 1 1 1 1 2 1
 The best alignment has the R 0 0 2 1 1 1 1 2 1 1 1 1 2
highest score i.e. the C 0 0 1 3 2 3 2 2 2 2 3 2 2
maximum match K 0 0 1 2 3 3 3 3 3 3 3 3 3
 Maximum match = largest C 0 0 1 3 3 4 3 3 3 3 4 3 3
R 0 0 2 2 3 3 4 5 4 4 4 4 4
number resulting from N 0 0 1 2 3 3 4 4 5 6 5 5 5
summing the cell values of J 0 0 1 2 3 3 4 4 6 5 6 6 6
every pathway C 0 0 1 3 3 4 4 4 5 6 7 6 6
 The maximum match will J 0 0 1 2 3 3 4 4 6 6 6 7 7
ALWAYS be somewhere in the A 0 0 1 2 3 3 4 4 5 6 6 7 8
outer row or column shown
 The alignment is constructed
by working backwards from            MP –RCLCQR ­ 
the maximum match
JNCBA
               |    |  |   
|     |     |    |    |
Statistical Significance

 Maximum match is a function of sequence


relationship and composition
 Useful to know probability of obtaining result
(maximum match) from a pair of random
sequences
 Estimate this experimentally
 Sequences from random proteins are taken(I.e. having
same composition as the real proteins)
 if the value for the random proteins is significantly
different from that for the real proteins then the
difference is a function of the sequences alone and
not of their composition
 Proposed by Temple Smith and Michael
Waterman in 1981
 Smith-Waterman algorithm is useful for
performing local sequence alignment
 Determining similar regions between two
nucleotide or protein sequences
 Instead of looking at entire sequence, it
compares segments of all possible lengths
and optimizes the similarity measure.
 For every cell the algorithm calculates ALL
possible paths that can be of any length and
contain insertions, deletions and gaps
 Works effectively, only when C A G C C U C G C U U A G
gap penalties are used A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
 In example shown A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
 match = +1 U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
 mismatch = -1/3
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
 gap = -1+1/3k (k=extent of
C 1.0 0.7 0.0 1.0 3.0 1.7 ?
gap) A
 Start with all cell values = 0 U
 Looks in sub column and sub U
row shown and in direct G
diagonal for a score that is A
the highest when you take C
alignment score or gap G
penalty into account G

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}


 Four possible ways of forming a path

For every residue in the query sequence

1. To align with next residue,


score =previous score +similarity score
7. Deletion (i.e. match residue of query with a gap),
score =previous score - gap penalty dependent on size of the gap
9. Insertion (i.e. match residue of db sequence with a gap,
score =previous score - gap penalty dependent on size of the gap
4. Stop when the score is zero

 Choose whichever of these which has the highest score


C A G C C U C G C U U A G
Construct Alignment A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
 The score in each cell is the
U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
maximum possible score for
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
an alignment of ANY C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
LENGTH ending at those C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0
coordinates A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0
 Trace pathway back from U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0
highest scoring cell U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0
G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7
 This cell can be anywhere in A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0
the array C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0
 Align highest scoring G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0
segment G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

GCC­UCG
GCCAUUG
 Needleman-Wunsch  Smith-Waterman
1. Global alignments 1. Local alignments
2. Requires alignment score for 2. Residue alignment score may
a pair of residues to be >=0 be positive or negative
3. No gap penalty required 3. Requires a gap penalty to
work effectively
4. Score cannot decrease 4. Score can increase, decrease
between two cells of a or stay level between two
pathway cells of a pathway
5. Trace back is mostly from 5. Trace back is from the cell
the last cell that has the that has the highest score
highest score
CONCLUSION

Hence from calculating and working many


times on these algorithms considering
different organisms, it is found that NW and
SW algorithms are excellent methods for
finding the similarity and dissimilarity
between the different organisms

You might also like