Professional Documents
Culture Documents
Text Searching
Algorithms
Presented by: Aldrian Obaja
Rabin-Karp algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore(-Horspool) algorithm
Approximate matching
Regular expression
Example
j = 0
|
001
010001
|
i = 0
j = 1
|
001
010001
|
i = 0
j = 0
|
001
010001
|
i = 1
j = 0
|
001
010001
|
i = 2
j = 1
|
001
010001
|
i = 2
j = 2
|
001
010001
|
i = 2
j = 0
|
001
010001
|
i = 3
j = 1
|
001
010001
|
i = 3
j = 2
Match
found at
i=3
|
001
010001
|
i = 3
Too slow!
Rabin-Karp Algorithm
Example
p = "0011", t = "0001001000"
f[i] = parity of the string t[i..i+3]
i
0
t[i] 0
f[i] 1
1
0
1
2
0
1
3
1
0
4
0
1
5
0
1
6
1
1
7
0
8
0
9
0
Complexity Analysis
Knuth-Morris-Pratt Algorithm
Example
Searching for "Tweedledum" in "Tweedledee and
Tweedledum"
Tweedledum
Tweedledee and Tweedledum
Shift 8 positions since the partial text "Tweedled"
does not contain any "T" after the first "T" to be
matched with the "T" in the pattern
Tweedledum
Tweedledee and Tweedledum
Example
Searching for "pappar" in "pappappapparrassan"
pappar
pappappapparrassan
Shift 3 positions
pappar
pappappapparrassan
Shift 3 positions
pappar
pappappapparrassan
Shift Table
Algorithm
knuth_morris_pratt_search(p,t){
1 m = p.length
2 n = t.length
3 knuth_morris_pratt_shift(p,shift)
4 i=0, j=0
5 while(i+m<=n){
6
while(t[i+j]==p[j]){
7
j=j+1
8
if(j>=m) return i
9
}
10
i = i+shift[j-1]
11
j = max(j-shift[j-1],0)
12 }
13 return -1
}
Algorithm
knuth_morris_pratt_shift(p,shift){
1 m = p.length
2 shift[-1] = 1, shift[0] = 1
3 i=1, j=0
4 while(i+j<m){
5
if(p[i+j]==p[j]){
6
shift[i+j] = i
7
j=j+1
8
} else {
9
if(j==0) shift[i] = i+1
10
i = i+shift[j-1]
11
j = max(j-shift[j-1],0)
12
}
13 }
}
Complexity Analysis
Boyer-Moore
Boyer-Moore-Horspool
Examples
|
kettle
tea kettle
|
|
kettle
tea kettle
|
|
kettle
tea kettle
|
|
date
detective
|
|
date
detective
|
|
date
detective
|
Examples
Regular Expression
Notations
Regex Tree
Regex Tree
The regex tree for query
pattern: 0(0|1)*0
cand
0
eps
Regex Matching
Require four methods:
next Algorithm
next(t,mark){
1 if(t.value=="."){
2
next(t.left,mark)
3
if(t.matched || (t.eps && mark)){
4
next(t.right,true)
5
} else {
6
next(t.right,false)
7
}
8 } else if (t.value=="|"){
9
next(t.left,mark), next(t.right,mark)
10 } else if (t.value=="*"){
11
if(t.matched)next(t.left,true)
12
else next(t.left,mark)
13 } else {
14
t.cand = mark
}
match_letter Algorithm
match_letter(t,a){
1 if(t.value=="."){
2
match_letter(t.left,a)
3
t.matched = match_letter(t.right,a)
4 } else if (t.value=="|"){
5
t.matched = match_letter(t.left,a) ||
match_letter(t.right,a)
6 } else if (t.value=="*"){
7
t.matched = match_letter(t.left,a)
8 } else {
9
t.matched = t.cand && (a==t.value)
10
t.cand = false
11 }
12 return t.matched
}
Matching Algorithm
match(w,t){
1 n = w.length
2 epsilon(t)
3 start(t)
4 i=0
5 while(i<n){
6
match_letter(t,w[i])
7
if(t.matched)
8
return true
9
next(t,false)
10
i=i+1
11 }
12 return -1
}
Example
Try to match the text "010"
matched
.
matched
0
cand
eps
matched
|
cand
matched
0
eps
matched
cand
1
After matching 0
and call to `next`
cand
0
0
cand
matched
eps
matched
matched
matched
cand
1
After matching 1
and call to `next`
.
matched
*
|
cand
0
0
matched
matched
cand
1
After matching 0
and call to `next`
Regex Search
References
Thank You