You are on page 1of 3

Pattern Matching 5/29/2002 11:27 AM

Outline and Reading


Strings (§9.1.1)
Pattern Matching Pattern matching algorithms
„ Brute-force algorithm (§9.1.2)
a b a c a a b
„ Boyer-Moore algorithm (§9.1.3)
1
a b a c a b „ Knuth-Morris-Pratt algorithm (§9.1.4)
4 3 2
a b a c a b

Pattern Matching 1 Pattern Matching 2

Strings Brute-Force Algorithm


Algorithm BruteForceMatch(T, P)
A string is a sequence of Let P be a string of size m The brute-force pattern
characters matching algorithm compares Input text T of size n and pattern
„ A substring P[i .. j] of P is the
subsequence of P consisting of the pattern P with the text T P of size m
Examples of strings:
the characters with ranks for each possible shift of P Output starting index of a
„ Java program substring of T equal to P or −1
between i and j relative to T, until either
„ HTML document if no such substring exists
„ A prefix of P is a substring of „ a match is found, or
„ DNA sequence the type P[0 .. i] for i ← 0 to n − m
„ all placements of the pattern
„ Digitized image A suffix of P is a substring of { test shift i of the pattern }
„ have been tried
An alphabet Σ is the set of the type P[i ..m − 1]
Brute-force pattern matching j←0
possible characters for a Given strings T (text) and P runs in time O(nm) while j < m ∧ T[i + j] = P[j]
family of strings (pattern), the pattern matching j←j+1
problem consists of finding a Example of worst case:
Example of alphabets: „ T = aaa … ah if j = m
ASCII substring of T equal to P
„
„ P = aaah return i {match at i}
Unicode Applications:
„ may occur in images and
„
Text editors
else
„ {0, 1} „
DNA sequences
Search engines break while loop {mismatch}
„ {A, C, G, T} „
„ unlikely in English text
„ Biological research return -1 {no match anywhere}
Pattern Matching 3 Pattern Matching 4

Boyer-Moore Heuristics Last-Occurrence Function


The Boyer-Moore’s pattern matching algorithm is based on two
heuristics Boyer-Moore’s algorithm preprocesses the pattern P and the
alphabet Σ to build the last-occurrence function L mapping Σ to
Looking-glass heuristic: Compare P with a subsequence of T integers, where L(c) is defined as
moving backwards „ the largest index i such that P[i] = c or
Character-jump heuristic: When a mismatch occurs at T[i] = c „ −1 if no such index exists
If P contains c, shift P to align the last occurrence of c in P with T[i]
„
Example:
„ Else, shift P to align P[0] with T[i + 1] c a b c d
„ Σ = {a, b, c, d}
Example „ P = abacab L(c) 4 5 3 −1

a p a t t e r n m a t c h i n g a l g o r i t h m
The last-occurrence function can be represented by an array
1 3 5 11 10 9 8 7 indexed by the numeric codes of the characters
r i t h m r i t h m r i t h m r i t h m The last-occurrence function can be computed in time O(m + s),
where m is the size of P and s is the size of Σ
2 4 6
r i t h m r i t h m r i t h m

Pattern Matching 5 Pattern Matching 6

1
Pattern Matching 5/29/2002 11:27 AM

The Boyer-Moore Algorithm Example


Algorithm BoyerMooreMatch(T, P, Σ) Case 1: j ≤ 1 + l
. . . . . . a . . . . . .
L ← lastOccurenceFunction(P, Σ )
i
i←m−1 a b a c a a b a d c a b a c a b a a b b
j←m−1 . . . . b a
repeat j l 1
if T[i] = P[j] m−j
a b a c a b
if j = 0 . . . . b a
return i { match at i } 4 3 2 13 12 11 10 9 8
else j a b a c a b a b a c a b
i←i−1
j←j−1 Case 2: 1 + l ≤ j
5 7
else . . . . . . a . . . . . .
i
a b a c a b a b a c a b
{ character-jump }
l ← L[T[i]] . a . . b . 6
i ← i + m – min(j, 1 + l) l j a b a c a b
j←m−1 m − (1 + l)
until i > n − 1
return −1 { no match } . a . . b .

1+l
Pattern Matching 7 Pattern Matching 8

Analysis The KMP Algorithm - Motivation


Boyer-Moore’s algorithm Knuth-Morris-Pratt’s algorithm
runs in time O(nm + s) a a a a a a a a a compares the pattern to the
Example of worst case: 6 5 4 3 2 1 text in left-to-right, but shifts . . a b a a b x . . . . .
„ T = aaa … a b a a a a a the pattern more intelligently
„ P = baaa than the brute-force algorithm.
12 11 10 9 8 7
The worst case may occur in When a mismatch occurs, a b a a b a
b a a a a a what is the most we can shift
images and DNA sequences j
but is unlikely in English text 18 17 16 15 14 13 the pattern so as to avoid
b a a a a a redundant comparisons?
Boyer-Moore’s algorithm is a b a a b a
significantly faster than the Answer: the largest prefix of
24 23 22 21 20 19 P[0..j] that is a suffix of P[1..j]
brute-force algorithm on b a a a a a
English text No need to Resume
repeat these comparing
comparisons here
Pattern Matching 9 Pattern Matching 10

KMP Failure Function The KMP Algorithm


Knuth-Morris-Pratt’s The failure function can be Algorithm KMPMatch(T, P)
j 0 1 2 3 4 5
algorithm preprocesses the represented by an array and F ← failureFunction(P)
pattern to find matches of P[j] a b a a b a
can be computed in O(m) time i←0
F(j) 0 0 1 1 2 3 j←0
prefixes of the pattern with At each iteration of the while- while i < n
the pattern itself loop, either if T[i] = P[j]
The failure function F(j) is . . a b a a b x . . . . . if j = m − 1
„ i increases by one, or
return i − j { match }
defined as the size of the „ the shift amount i − j else
largest prefix of P[0..j] that is increases by at least one i←i+1
also a suffix of P[1..j] a b a a b a (observe that F(j − 1) < j) j←j+1
else
Knuth-Morris-Pratt’s Hence, there are no more
j if j > 0
algorithm modifies the brute- than 2n iterations of the j ← F[j − 1]
force algorithm so that if a while-loop else
a b a a b a i←i+1
mismatch occurs at P[j] ≠ T[i] Thus, KMP’s algorithm runs in
we set j ← F(j − 1) return −1 { no match }
F(j − 1) optimal time O(m + n)

Pattern Matching 11 Pattern Matching 12

2
Pattern Matching 5/29/2002 11:27 AM

Computing the Failure


Function Example
The failure function can be
a b a c a a b a c c a b a c a b a a b b
represented by an array and Algorithm failureFunction(P)
can be computed in O(m) time F[0] ← 0 1 2 3 4 5 6
The construction is similar to i←1 a b a c a b
the KMP algorithm itself j←0
while i < m 7
At each iteration of the while- if P[i] = P[j] a b a c a b
loop, either {we have matched j + 1 chars}
8 9 10 11 12
i increases by one, or F[i] ← j + 1
„
i←i+1 a b a c a b
„ the shift amount i − j j←j+1
increases by at least one else if j > 0 then 13
(observe that F(j − 1) < j) {use failure function to shift P} a b a c a b
j ← F[j − 1] j 0 1 2 3 4 5
Hence, there are no more 14 15 16 17 18 19
else P[j] a b a c a b
than 2m iterations of the F[i] ← 0 { no match } a b a c a b
while-loop F(j) 0 0 1 0 1 2
i←i+1

Pattern Matching 13 Pattern Matching 14

You might also like