You are on page 1of 45

Jim Anderson Comp 750, Fall 2009 String Matching - 1

Chapter 32: String Matching


Given: Two strings T[1..n] and P[1..m] over alphabet E.

Want to find all occurrences of P[1..m] the pattern in T[1..n] the text.

Example: E = {a, b, c}
a b c a b a a b c a a b a c
a b a a
text T
pattern P
s=3
Terminology:
- P occurs with shift s.
- P occurs beginning at position s+1.
- s is a valid shift.

Goal: Find all valid shifts.
Applications: Text editors, search for patterns in DNA sequences
(actually, this is stretching the truth a little),
Jim Anderson Comp 750, Fall 2009 String Matching - 2
Notation and Terminology
w pre x -- w is a prefix of x.

Example: aba pre abaabc.

w suf x -- w is a suffix of x.

Example: abc suf abaabc.

Note: In the book the symbol is used instead of pre,
and the symbol is used instead of suf.

I couldnt figure out an easy way to reproduce these
symbols in Powerpoint.
Jim Anderson Comp 750, Fall 2009 String Matching - 3
Lemma 32.1
Lemma 32.1: Suppose x suf z and y suf z. If |x| s |y| then
x suf y. If |x| > |y| then y suf x. If |x| = |y| then x = y.
x
z
y
x
y
x
z
y
x
y
x
z
y
x
y
Jim Anderson Comp 750, Fall 2009 String Matching - 4
More Notation
P
k
= P[1..k] where k s m.

Thus, P
0
= c, P
m
= P[1..m] = P.

Similarly T
k
= T[1..k], where k s n.

Our Problem: Find all s, where 0 s s s n m such that P suf T
s+m
.

Assumption: We assume the test x = y takes O(t + 1) time,
where t is the length of the longest string z such that z pre x
and z pre y.
Jim Anderson Comp 750, Fall 2009 String Matching - 5
Nave Brute-Force Algorithm
Nave(T, P)
n := length[T];
m := length[P];
for s := 0 to n m do
if P[1..m] = T[s+1..s+m] then
print pattern occurs with shift s
fi
od
Running time is O((n m + 1)m).

Bound is tight. Consider: T = a
n
, P = a
m
.
Jim Anderson Comp 750, Fall 2009 String Matching - 6
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 7
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 8
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 9
Example
a c a a b c
a a b
s = 1
Jim Anderson Comp 750, Fall 2009 String Matching - 10
Example
a c a a b c
a a b
s = 1
Jim Anderson Comp 750, Fall 2009 String Matching - 11
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 12
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 13
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 14
Example
a c a a b c
a a b
s = 2
match!
Jim Anderson Comp 750, Fall 2009 String Matching - 15
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 16
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 17
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 18
Rabin-Karp Algorithm
Suppose E = {0, 1, 2, , 9}.

Let us view P as a decimal number.

Example: View P = 31415 as 31,415.

Can also view substrings of T as decimal numbers.

Let t
s
= the decimal number corresponding to T[s+1..s+m].
Let p = the decimal number corresponding to P.

We want to know all s such that t
s
= p.

We can compute p in O(m) time using Horners Rule:

p = P[m] + 10(P[m1] + 10(P[m2] + + 10(P[2] + 10 P[1]) ))

Can similarly compute t
0
in O(m) time.
Jim Anderson Comp 750, Fall 2009 String Matching - 19
RK Algorithm (Continued)
Can compute t
1
, t
2
, , t
n-m
in O(n m) time as follows:

t
s+1
= 10(t
s
10
m-1
T[s+1]) + T[s+m+1].

Example: T = 314152
t
s+1
= 10(31415 100003) + 2
= 14152
Time Complexity:
O(n+m) + O(nm) = O(n+m)
to compute
p and t
0
,,t
n-m

to perform
nm+1 comparisons
Jim Anderson Comp 750, Fall 2009 String Matching - 20
Two Problems
Might have |E| = d = 10.
Solution: Use radix-d arithmetic.

Numbers may be very large.
Solution: Perform computations modulo-q for some q.
Jim Anderson Comp 750, Fall 2009 String Matching - 21
What is q?
Select q to be a large prime such that dq fits in one memory word.

all computations can be performed using single-precision
arithmetic.

To summarize, p is computed using

p = (P[m] + d(P[m1] + d(P[m2] + + d(P[2] + d P[1]) ))) mod q

t
0
is computed similarly.

Other t
i
s are computed using

t
s+1
= (d(t
s
T[s+1]h) + T[s+m+1]) mod q, where h d
m-1
(mod q)

Unfortunately, we have a new problem: spurious hits.
Jim Anderson Comp 750, Fall 2009 String Matching - 22
Example
pattern P
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
3 1 4 1 5
7
mod 13
text T
7
mod 13
7
mod 13
valid
match
spurious
hit
Jim Anderson Comp 750, Fall 2009 String Matching - 23
Algorithm
We deal with spurious
hits by performing an
explicit check whenever
there is a potential match.
RK(T, P, d, q)
n := length[T];
m := length[P];
h := d
m-1
mod q;
p := 0;
t
0
:= 0;
for i := 1 to m do
p := (dp + P[i]) mod q;
t
0
:= (dt
0
+ P[i]) mod q
od;
for s := 0 to n m do
if p = t
s
then
if P[1..m] = T[s+1..s+m] then
print pattern occurs with shift s
fi
fi;
if s < n-m then
t
s+1
:= (d(t
s
T[s+1]h) + T[s+m+1]) mod q
fi
od
Jim Anderson Comp 750, Fall 2009 String Matching - 24
Running Time
Worst-Case: O((n m + 1)m). (Again, consider P = a
m
, T = a
n
.)

Average-Case:

Some assumptions

Assume O(1) valid shifts.

Think of 0, 1, , q 1 like hash buckets.

Assume each bucket is equally likely.

We expect O(n/q) spurious hits.

Expected running time is:
O(n) + O(m(number of valid shifts + n/q))
= O(n+m) choosing q > m
Jim Anderson Comp 750, Fall 2009 String Matching - 25
Finite Automata Algorithm
0 1 2 3 4 5 6 7
a b
a b
a c
a
a
a
a
a
b
b
a b c P
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
state
input
i -- 1 2 3 4 5 6 7 8 9 10 11
T[i] -- a b a b a b a c a b a
state |(i) 0 1 2 3 4 5 4 5 6 7 2 3
Processing time takes O(n).
But have to first construct FA.
Main Issue: How to construct FA?
Jim Anderson Comp 750, Fall 2009 String Matching - 26
Need some Notation
|(w) = state FA ends up in after processing w.

Example: |(abab) = 4.

o(x) = max{k: P
k
suf x}. Called the suffix function.

Examples: Let P = ab.
o(c) = 0
o(ccaca) = 1
o(ccab) = 2

Note: If |P| = m, then o(x) = m indicates a match.
T: a b a b b a b b a c
States: 0 1..m.m.

Note Also: x suf y o(x) s o(y).
match match
Jim Anderson Comp 750, Fall 2009 String Matching - 27
FA Construction
Given: P[1..m]

Let Q = states = {0, 1, , m}.


Define transition function o as follows:

o(q, a) = o(P
q
a) for each q and a.

Example: o(5, b) = o(P
5
b)
= o(ababab)
= 4

Intuition: Encountering a b in state 5 means the current substring
doesnt match. But, you know this substring ends with abab -- and
this is the longest suffix that matches the beginning of P. Thus, we
go to state 4 and continue processing abab .
initial final
Jim Anderson Comp 750, Fall 2009 String Matching - 28
Time Complexity
FA takes O(m|E|) time to construct.

(Book only gives a O(m
3
|E|) algorithm.)

Total time is O(n + m|E|).
Jim Anderson Comp 750, Fall 2009 String Matching - 29
Correctness
Lemma 32.2: o(xa) s o(x) + 1.
Proof:

Let r = o(xa).

Case: r = 0. Clearly o(xa) s o(x) + 1.

Case: r > 0.
a
P
r

x
P
r-1

We have:
P
r
suf xa.
P
r-1
suf x
r1 s o(x).
Jim Anderson Comp 750, Fall 2009 String Matching - 30
Another Lemma
Lemma 32.3: q = o(x) o(xa) = o(P
q
a) .
Proof:
Let q = o(x).
P
q
suf x
P
q
a suf xa

Let r = o(xa). By Lemma 32.2, r s q + 1. We have:
a
P
q

x
a
P
r

o(P
q
a) = r.
Jim Anderson Comp 750, Fall 2009 String Matching - 31
Main Theorem
Theorem 32.4: |(T
i
) = o(T
i
) for all i = 0, 1, , n.
Implies: in accepting state
if and only if
string processed so far has a match at position (length of string) m

Proof:

Induction on i.

Basis: i = 0. T
0
= c.
|(T
0
) = o(T
0
) = 0.
Step: Assume |(T
i
) = o(T
i
).

Let q = |(T
i
), a = T[i+1].

Then, |(T
i
) = o(T
i
) = q, which by
Lemma 32.3, implies
o(T
i
a) = o(P
q
a). (**)
Jim Anderson Comp 750, Fall 2009 String Matching - 32
Proof Continued
|(T
i+1
) = |(T
i
a) , T
i+1
= T
i
a
= o(|(T
i
), a) , |(wa) = o(|(w), a)
= o(q, a) , q = |(T
i
)
= o(P
q
a) , o(q, a) = o(P
q
a)
= o(T
i
a) , by (**)
= o(T
i+1
) , T
i+1
= T
i
a
Jim Anderson Comp 750, Fall 2009 String Matching - 33
Knuth-Morris-Pratt Algorithm
Achieves O(n + m) by avoiding precomputation
of o.

Instead, we precompute t[1..m] in O(m) time.

As T is scanned, t[1..m] is used to deduce
information given by o in FA algorithm.
Jim Anderson Comp 750, Fall 2009 String Matching - 34
Motivating Example
b a c b a b
a b a
s
a b a a b c b a b
b a c a
T
P
q
Shift s is discovered to be invalid because of mismatch
of 6
th
character of P.

By definition of P, we also know s + 1 is an invalid shift

However, s + 2 may be a valid shift.
Jim Anderson Comp 750, Fall 2009 String Matching - 35
Motivating Example
b a c b a b
a b a
s + 2
a b a a b c b a b
b a c a
T
P
k
The shift s + 2. Note that the first 3 characters of T starting
at s + 2 dont have to be checked again -- we already know
what they are.
Jim Anderson Comp 750, Fall 2009 String Matching - 36
Motivating Example
a b
a b a
a b a
P
k

The longest prefix of P that is also a proper suffix of P
5
is P
3
.
We will define t[5] = 3.
P
q

In general, if q characters have matched successfully at shift
s, the next potentially valid shift is s' = s + (q t[q]).
Jim Anderson Comp 750, Fall 2009 String Matching - 37
The Prefix Function
t is called the prefix function for P.

t: {1, 2, , m} {0, 1, , m1}

t[q] = length of the longest prefix
of P that is a proper suffix
of P
q
, i.e.,

t[q] = max{k: k < q and P
k
suf P
q
}.
Compute-t(P)
1 m := length[P];
2 t[1] := 0;
3 k := 0;
4 for q := 2 to m do
5 while k > 0 and P[k+1] = P[q] do
6 k := t[k]
od;
7 if P[k+1] = P[q] then
8 k := k + 1
fi;
9 t[q] := k
od;
10 return t
Jim Anderson Comp 750, Fall 2009 String Matching - 38
Example
i 1 2 3 4 5 6 7
P[i] a b a b a c a
t[i] 0 0 1 2 3 0 1
Same as our
FA example
P
7
= a b a b a c a

a = P
1

P
6
= a b a b a c

c = P
0

P
5
= a b a b a

a b a = P
3

P
4
= a b a b

a b = P
2

P
3
= a b a

a = P
1

P
2
= a b

c = P
0

P
1
= a

c = P
0

Jim Anderson Comp 750, Fall 2009 String Matching - 39
Another Explanation
0 1 2 3 4 5 6 7
a b
a b
a c
a
c
c
c
c
Essentially KMP is computing a FA with epsilon moves. The spine
of the FA is implicit and doesnt have to be computed -- its just the
pattern P. t gives the c transitions. There are O(m) such transitions.
Recall from Comp 455 that a FA with epsilon moves is
conceptually able to be in several states at the same time (in
parallel). Thats whats happening here -- were exploring
pieces of the pattern in parallel.
Jim Anderson Comp 750, Fall 2009 String Matching - 40
Another Example
i 1 2 3 4 5 6 7 8 9 10
P[i] a b b a b a a b b a
t[i] 0 0 0 1 2 1 1 2 3 4
P
7
= a b b a b a a

a = P
1

P
6
= a b b a b a

a = P
1

P
5
= a b b a b

a b = P
2
P
4
= a b b a

a = P
1

P
3
= a b b

c = P
0

P
2
= a b

c = P
0
P
1
= a

c = P
0

P
10
= a b b a b a a b b a

a b b a = P
4

P
9
= a b b a b a a b b

a b b = P
3

P
8
= a b b a b a a b

a b = P
2
Jim Anderson Comp 750, Fall 2009 String Matching - 41
Time Complexity
Amortized Analysis --

u
0

+ loop q = 2 (1
st
iteration)
u
1
+ loop q = 3 (2
nd
iteration)
u
2
+ loop q = 4 (3
rd
iteration)

+ loop q = m ((m 1)
st
iteration)
u
m-1

u = potential function = value of k

Amortized cost:
i
= c
i
+ u
i
u
i-1
iteration actual loop cost
Jim Anderson Comp 750, Fall 2009 String Matching - 42
Time Complexity (Continued)
Total amortized cost:

=
+ =
+ =
1 m
1 i
0 1 - m i
1 m
1 i
1 i i i
1 m
1 i
i
c
) (c
c


If u
m-1
> u
0
, then amortized cost upper bounds real cost.

We have u
0
= 0 (initial value of k)
u
m-1
> 0 (final value of k).

We show
i
= O(1).
Jim Anderson Comp 750, Fall 2009 String Matching - 43
Time Complexity (Continued)
The value of
i
obviously depends on how many times statement
6 is executed.

Note that k > t[k]. Thus, each execution of statement 6 decreases
k by at least 1.

So, suppose that statements 5..6 iterate several times, decreasing
the value of k.

We have: number of iterations s k
old
k
new
. Thus,

i
s O(1) + 2(k
old
k
new
) + u
i
u
i-1




Hence,
i
= O(1). Total cost is therefore O(m).

for statements
other than 5 & 6
= k
new
= k
old

Jim Anderson Comp 750, Fall 2009 String Matching - 44
Rest of the Algorithm
KMP(T, P)
n := length[T];
m := length[P];
t := Compute-t(P);
q := 0;
for i := 1 to n do
while q > 0 and P[q+1] = T[i] do
q := t[q]
od;
if P[q+1] = T[i] then
q := q + 1
fi;
if q = m then
print pattern occurs with shift i m;
q := t[q]
fi
od
Time complexity
of loop is O(n)
(similar to the
analysis of
Compute-t).

Total time is
O(m + n).
Jim Anderson Comp 750, Fall 2009 String Matching - 45
Example
i 1 2 3 4 5
P[i] a b a b c
t[i] 0 0 1 2 0
P = a b a b c
1 2 3 4 5 6 7 8 9 10
T = a b b a b a b a b c
Start of 1
st
loop: q = 0, i = 1 [a]
2
nd
loop: q = 1, i = 2 [b]
3
rd
loop: q = 2, i = 3 [b]
4
th
loop: q = 0, i = 4 [a]
5
th
loop: q = 1, i = 5 [b]
6
th
loop: q = 2, i = 6 [a]
7
th
loop: q = 3, i = 7 [b]
mismatch
detected
8
th
loop: q = 4, i = 8 [a]
9
th
loop: q = 3, i = 9 [b]
10
th
loop: q = 4, i = 10 [c]
Termination: q = 5
mismatch
detected
match
detected
Please see the book for formal correctness proofs.
(Theyre very tedious.)

You might also like