You are on page 1of 194

CS 124 Course Notes 1 Spring 2002

An algorithm is a recipe or a well-defined procedure for performing a calculation, or in general, for transforming
some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying
integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write
the multiplier and multiplicand side by side. Repeat the following operations - divide the first number by 2 (throw
out any fractions) and multiply the second by 2, until the first number is 1. This results in two columns of numbers.
Now cross out all rows in which the first entry is even, and add all entries of the second column that haven’t been
crossed out. The result is the product of the two numbers.

75 29 29
37 58 x 1001011
18 116 29
9 232 58
4 464 232
2 928 1856
1 1856 2175
2175
Figure 1.1: A different multiplication algorithm.

1-1
1-2

In this course we will ask a number of basic questions about algorithms:

• Does it halt?

The answer for the algorithm given above is clearly yes, provided we are multiplying positive integers. The
reason is that for any integer greater than 1, when we divide it by 2 and throw out the fractional part, we always
get a smaller integer which is greater than or equal to 1. Hence our first number is eventually reduced to 1 and
the process halts.

• Is it correct?

To see that the algorithm correctly computes the product of the integers, observe that if we write a 0 for each
crossed out row, and 1 for each row that is not crossed out, then reading from bottom to top just gives us
the first number in binary. Therefore, the algorithm is just doing standard multiplication, with the multiplier
written in binary.

• Is it fast?

It turns out that the above algorithm is about as fast as the standard algorithm you learned in school. Later in
the course, we will study a faster algorithm for multiplying integers.

• How much memory does it use?

The memory used by this algorithm is also about the same as that of standard algorithm.
1-3

The history of algorithms for simple arithmetic is quite fascinating. Although we take these algorithms for
granted, their widespread use is surprisingly recent. The key to good algorithms for arithmetic was the positional
number system (such as the decimal system). Roman numerals (I, II, III, IV, V, VI, etc) are just the wrong data
structure for performing arithmetic efficiently. The positional number system was first invented by the Mayan
Indians in Central America about 2000 years ago. They used a base 20 system, and it is unknown whether they had
invented algorithms for performing arithmetic, since the Spanish conquerors destroyed most of the Mayan books on
science and astronomy.

The decimal system that we use today was invented in India in roughly 600 AD. This positional number system,
together with algorithms for performing arithmetic, were transmitted to Persia around 750 AD, when several impor-
tant Indian works were translated into Arabic. Around this time the Persian mathematician Al-Khwarizmi wrote his
Arabic textbook on the subject. The word “algorithm” comes from Al-Khwarizmi’s name. Al-Khwarizmi’s work
was translated into Latin around 1200 AD, and the positional number system was propagated throughout Europe
from 1200 to 1600 AD.

The decimal point was not invented until the 10th century AD, by a Syrian mathematician al-Uqlidisi from
Damascus. His work was soon forgotten, and five centuries passed before decimal fractions were re-invented by the
Persian mathematician al-Kashi.

With the invention of computers in this century, the field of algorithms has seen explosive growth. There are a
number of major successes in this field:

• Parsing algorithms - these form the basis of the field of programming languages

• Fast Fourier transform - the field of digital signal processing is built upon this algorithm.

• Linear programming - this algorithm is extensively used in resource scheduling.

• Sorting algorithms - until recently, sorting used up the bulk of computer cycles.

• String matching algorithms - these are extensively used in computational biology.

• Number theoretic algorithms - these algorithms make it possible to implement cryptosystems such as the RSA
public key cryptosystem.

• Compression algorithms - these algorithms allow us to transmit data more efficiently over, for example, phone
lines.
1-4

• Geometric algorithms - displaying images quickly on a screen often makes use of sophisticated algorithmic
techniques.

In designing an algorithm, it is often easier and more productive to think of a computer in abstract terms. Of
course, we must carefully choose at what level of abstraction to think. For example, we could think of computer
operations in terms of a high level computer language such as C or Java, or in terms of an assembly language. We
could dip further down, and think of the computer at the level AND and NOT gates.

For most algorithm design we undertake in this course, it is generally convenient to work at a fairly high level.
We will usually abstract away even the details of the high level programming language, and write our algorithms in
”pseudo-code”, without worrying about implementation details. (Unless, of course, we are dealing with a program-
ming assignment!) Sometimes we have to be careful that we do not abstract away essential features of the problem.
To illustrate this, let us consider a simple but enlightening example.
1-5

1.1 Computing the nth Fibonacci number

Remember the famous sequence of numbers invented in the 15th century by the Italian mathematician Leonardo
Fibonacci? The sequence is represented as F0 , F1 , F2 . . ., where F0 = 0, F1 = 1, and for all n ≥ 2, Fn is defined as
Fn−1 + Fn−2 . The first few Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . The value of F30 is greater than a
million! It is easy to see that the Fibonacci numbers grow exponentially. As an exercise, try to show that Fn ≥ 2n/2
for sufficiently large n by a simple induction.

Here is a simple program to compute Fibonacci numbers that slavishly follows the definition.

function F(n: integer): integer


if n = 0 then return 0
else if n = 1 then return 1
else return F(n − 1) + F(n − 2)

The program is obviously correct. However, it is woefully slow. As it is a recursive algorithm, we can naturally
express its running time on input n with a recurrence equation. In fact, we will simply count the number of addition
operations the program uses, which we denote by T (n). To develop a recurrence equation, we express T (n) in terms
of smaller values of T . We shall see several such recurrence relations in this class.

It is clear that T (0) = 0 and T (1) = 0. Otherwise, for n ≥ 2, we have

T (n) = T (n − 1) + T (n − 2) + 1,

because to computer F(n) we compute F(n − 1) and F(n − 2) and do one other addition besides. This is (almost)
the Fibonacci equation! Hence we can see that the number of addition operations is growing very large; it is at least
2n/2 for n ≥ 4.
1-6

Can we do better? This is the question we shall always ask of our algorithms. The trouble with the naive
algorithm the wasteful recursion: the function F is called with the same argument over and over again, exponentially
many times (try to see how many times F(1) is called in the computation of F(5)). A simple trick for improving
performance is to avoid repeated calculations. In this case, this can be easily done by avoiding recursion and just
calculating successive values:

function F(n: integer): integer array A[0 . . . n] of integer


A[0] = 0; A[1] = 1
for i = 2 to n do:
A[i] = A[i − 1] + A[i − 2]
return A[n]

This algorithm is of course correct. Now, however, we only do n − 1 additions.


1-7

It seems that we have come so far, from exponential to polynomially many operations, that we can stop here.
But in the back of our heads, we should be wondering an we do even better? Surprisingly, we can. We rewrite our
equations in matrix notation. Then      
F1 0 1 F0
 = · .
F2 1 1 F1
Similarly,
       2  
F2 0 1 F1 0 1 F0
 = · =  · ,
F3 1 1 F2 1 1 F1
and in general, Similarly,    n  
Fn 0 1 F0
 =  · .
Fn+1 1 1 F1

So, in order to compute Fn , it suffices to raise this 2 by 2 matrix to the nth power. Each matrix multiplication
takes 12 arithmetic operations, so the question boils down to the following: how many multiplications does it take
to raise a base (matrix, number, anything) to the nth power? The answer is O(log n). To see why, consider the case
where n > 1 is a power of 2. To raise X to the nth power, we compute X n/2 and then square it. Hence the number of
multiplications T (n) satisfies
T (n) = T (n/2) + 1,

from which we find T (n) = log n. As an exercise, consider what you have to do when n is not a power of 2.
(Hint: consider the connection with the multiplication algorithm of the first section; there too we repeatedly halved
a number...)

So we have reduced the computation time exponentially again, from n − 1 arithmetic operations to O(log n),
a great achievement. Well, not really. We got a little too abstract in our model. In our accounting of the time
requirements for all three methods, we have made a grave and common error: we have been too liberal about what
constitutes an elementary step. In general, we often assume that each arithmetic step takes unit time, because the
numbers involved will be typically small enough that we can reasonably expect them to fit within a computer’s
word. Remember, the number n is only log n bits in length. But in the present case, we are doing arithmetic on huge
numbers, with about n bits, where n is pretty large. When dealing with such huge numbers, if exact computation
is required we have to use sophisticated long integer packages. Such algorithms take O(n) time to add two n-bit
numbers. Hence the complexity of the first two methods was larger than we actually thought: not really O(Fn ) and
O(n), but instead O(nFn ) and O(n2 ), respectively. The second algorithm is still exponentially faster. What is worse,
the third algorithm involves multiplications of O(n)-bit integers. Let M(n) be the time required to multiply two n-bit
numbers. Then the running time of the third algorithm is in fact O(M(n)).
1-8

The comparison between the running times of the second and third algorithms boils down to a most important
and ancient issue: can we multiply two n-bit integers faster than Ω(n 2 ) ? This would be faster than the method we
learn in elementary school or the clever halving method explained in the opening of these notes.

As a final consideration, we might consider the mathematicians’ solution to computing the Fibonacci numbers.
A mathematician would quickly determine that
" √ !n √ !n #
1 1+ 5 1− 5
Fn = √ − .
5 2 2

Using this, how many operations does it take to compute Fn ? Note that this calculation would require floating point
arithmetic. Whether in practice that would lead to a faster or slower algorithm than one using just integer arithmetic
might depend on the computer system on which you run the algorithm.
CS 124 Lecture 2

In order to discuss algorithms effectively, we need to start with a basic set of tools. Here, we explain these tools
and provide a few examples. Rather than spend time honing our use of these tools, we will learn how to use them by
applying them in our studies of actual algorithms.

Induction
The standard form of the induction principle is the following:

If a statement P(n) holds for n = 1, and if for every n ≥ 1 P(n) implies P(n + 1), then P holds for all n.

Let us see an example of this:

n(n+1)
Claim 2.1 Let S(n) = ∑ni=1 i. Then S(n) = 2 .

Proof: The proof is by induction.

1(2)
Base Case: We show the statement is true for n = 1. As S(1) = 1 = 2 , the statement holds.

n(n+1)
Induction Hypothesis: We assume S(n) = 2 .

(n+1)(n+2)
Reduction Step: We show S(n + 1) = 2 . Note that S(n + 1) = S(n) + n + 1. Hence

S(n + 1) = S(n) + n + 1
n(n + 1)
= +n+1
2 
n 
= (n + 1) +1
2
(n + 1)(n + 2)
= .
2

2-1
2-2

The proof style is somewhat pedantic, but instructional and easy to read. We break things down to the base case
– showing that the statement holds when n = 1; the induction hypothesis – the statement that P(n) is true; and the
reduction step – showing that P(n) implies P(n + 1).

Induction is one of the most fundamental proof techniques. The idea behind induction is simple: take a large
problem (P(n + 1)), and somehow reduce its proof to a proof of a smaller problems (such as P(n); P(n) is smaller
in the sense that n < n + 1). If every problem can thereby be broken down to a small number of instances (we keep
reducing down to P(1)), these can be checked easily. We will see this idea of reduction, whereby we reduce solving
a problem to a solving an easier problem, over and over again throughout the course.

As one might imagine, there are other forms of induction besides the specific standard form we gave above.
Here’s a different form of induction, called strong induction:

If a statement P(n) holds for n = 1, and if for every n ≥ 1 the truth of P(i) for all i ≤ n implies P(n + 1), then P holds
for all n.

Exercise: show that every number has a unique prime factorization using strong induction.
2-3

O Notation
When measuring, for example, the number of steps an algorithm takes in the worst case, our result will generally
be some function T (n) of the input size, n. One might imagine that this function may have some complex form, such
as T (n) = 4n2 − 3n log n + n2/3 + log3 n − 4. In very rare cases, one might wish to have such an exact form for the
running time, but in general, we are more interested in the rate of growth of T (n) rather than its exact form.

The O notation was developed with this in mind. With the O notation, only the fastest growing term is important,
and constant factors may be ignored. More formally:

Definition 2.2 We say for non-negative functions f (n) and g(n) that f (n) is O(g(n)) if there exist positive constants
c and N such that for all n ≥ N,
f (n) ≤ cg(n).
2-4

Let us try some examples. We claim that 2n 3 + 4n2 is O(n3 ). It suffices to show that 2n3 + 4n2 ≤ 6n3 for n ≥ 1,
by definition. But this is clearly true as 4n 3 ≥ 4n2 for n ≥ 1. (Exercise: show that 2n3 + 4n2 is O(n4 ).)

We claim 10 log 2 n is O(ln n). This follows from the fact that 10 log 2 n ≤ (10 log 2 e) ln n.

If T (n) is as above, then T (n) is O(n2 ). This is a bit harder to prove, because of all the extraneous terms. It is,
however, easy to see; 4n2 is clearly the fastest growing term, and we can remove the constant with O notation. Note,
though, that T (n) is O(n3 ) as well! The O notation is not tight, but more like a ≤ comparison.
2-5

Similarly, there is notation for ≥ and = comparisons.

Definition 2.3 We say for non-negative functions f (n) and g(n) that f (n) is is Ω(g(n)) if there exist positive con-
stants c and N such that for all n ≥ N,
f (n) ≥ cg(n).

We say that f (n) is Θ(g(n)) if both f (n) is O(g(n)) and f (n) is Ω(g(n)).

The O notation has several useful properties that are easy to prove.

Lemma 2.4 If f 1 (n) is O(g1 (n)) and f 2 (n) is O(g2 (n)) then f 1 (n) + f2 (n) is O(g1 (n) + g2 (n)).

Proof: There exist positive constants c 1 , c2 , N1 , and N2 such that f 1 (n) ≤ c1 g1 (n) for n ≥ N1 and f2 (n) ≤ c2 g2 (n) for
n ≥ N2 . Hence f1 (n) + f2 (n) ≤ max{c1 , c2 }(g1 (n) + g2 (n)) for n ≥ max{N1 , N2 }.

Exercise: Prove similar lemmata for f 1 (n) f2 (n). Prove the lemmata when O is replaced by Ω or Θ.
2-6

Finally, there is a bit for notation corresponding to <<, when one function is (in some sense) much less than
another.

Definition 2.5 We say for non-negative functions f (n) and g(n) that f (n) is is o(g(n)) if

f (n)
lim = 0.
n→∞ g(n)

Also, f (n) is ω(g(n)) if g(n) is o( f (n)).

We emphasize that the O notation is a tool to help us analyze algorithms. It does not always accurately tell us
how fast an algorithm will run in practice. For example, constant factors make a huge difference in practice (imagine
increasing your bank account by a factor of 10), and they are ignored in the O notation. Like any other tool, the O
notation is only useful if used properly and wisely. Use it as a guide, not as the last word, to judging an algorithm.
2-7

Recurrence Relations
A recurrence relation defines a function using an expression that includes the function itself. For example, the
Fibonacci numbers are defined by:

F(n) = F(n − 1) + F(n − 2), F(1) = F(2) = 1.

This function is well-defined, since we can compute a unique value of F(n) for every positive integer n.

Note that recurrence relations are similar in spirit to the idea of induction. The relations defines a function value
F(n) in terms of the function values at smaller arguments (in this case, n − 1 and n − 2), effectively reducing the
problem of computing F(n) to that of computing F at smaller values. Base cases (the values of F(1) and F(2)) need
to be provided.

Finding exact solutions for recurrence relations is not an extremely difficult process; however, we will not
focus on solution methods for them here. Often a natural thing to do is to try to guess a solution, and then prove it
by induction. Alternatively, one can use a symbolic computation program (such as Maple or Mathematica); these
programs can often generate solutions.

We will occasionally use recurrence relations to describe the running times of algorithms. For our purposes, we
often do not need to have an exact solution for the running time, but merely an idea of its asymptotic rate of growth.
For example, the relation
T (n) = 2T (n/2) + 2n, T (1) = 1

has the exact solution (for n a power of 2) of T (n) = 2n log 2 n + n. (Exercise: Prove this by induction.) But for our
purposes, it is generally enough to know that the solution is Θ(n log n).
2-8

The following theorem is extremely useful for such recurrence relations:

Theorem 2.6 The solution to the recurrence relation T (n) = aT (n/b) + cn k , where a ≥ 1 and b ≥ 2 are integers
and c and k are positive constants satisfies:

O nlogb a if a > bk
 



T (n) is

O nk log n if a = bk


 O nk 

if a < bk .
2-9

Data Structures
We shall regard integers, real numbers, and bits, as well as more complicated objects such as lists and sets, as
primitive data structures. Recall that a list is just an ordered sequence of arbitrary elements.

List q := [x1 , x2 , . . . , xn ].

x1 is called the head of the list.

xn is called the tail of the list.

n = |q| is the size of the list.

We denote by ◦ the concatenation operation. Thus q ◦ r is the list that results from concatenating the list q with
the list r.

The operations on lists that are especially important for our purposes are:

head(q) return(x1 )
push(q, x) q := [x] ◦ q
pop(q) q := [x2 , . . . , xn ], return(x1 )
inject(q, x) q := q ◦ [x]
eject(q) q := [x1 , x2 , . . . , xn−1 ], return(xn )
size(q) return(n)

The head, pop, and eject operations are not defined for empty lists. Appropriate return values (either an error,
or an empty symbol) can be designed depending on the implementation.

A stack is a list that supports operations head, push, pop.

A queue is a list that supports operations head, inject and pop.

A deque supports all these operations.

Note that we can implement lists either by arrays or using pointers as the usual linked lists. Arrays are often
faster in practice, but they are often more complicated to program (especially if there is no implicit limit on the
number of items). In either case, each of the above operations can be implemented in a constant number of steps.
2-10

Application: Mergesort
For the rest of the lecture, we will review the procedure mergesort. The input is a list of n numbers, and the
output is a list of the given numbers sorted in increasing order. The main data structure used by the algorithm will be
a queue. We will assume that each queue operation takes 1 step, and that each comparison (is x > y?) takes 1 step.
We will show that mergesort takes O(n log n) steps to sort a sequence of n numbers.

The procedure mergesort relies on a function merge which takes as input two sorted (in increasing order) lists
of numbers and outputs a single sorted list containing all the given numbers (with repetition).
2-11

function merge (s,t)


list s,t
if s = [ ] then return t
else if t = [ ] then return s
else if s(1) ≤ t(1) then u:= pop(s)
else u:= pop(t)
return push(u, merge(s,t))
end merge

function mergesort (s)


list s, q
q=[]
for x ∈ s
inject(q, [x])
rof
while size(q) ≥ 2
u := pop(q)
v := pop(q)
inject(q, merge(u, v))
end
if q = [ ] return [ ]
else return q(1)
end mergesort
2-12

The correctness of the function merge follows from the following fact: the smallest number in the input is either
s(1) or t(1), and must be the first number in the output list. The rest of the output list is just the list obtained by
merging s and t after deleting that smallest number.

The number of steps for each invocation of function merge is O(1) steps. Since each recursive invocation of
merge removes an element from either s or t, it follows that function merge halts in O(|s| + |t|) steps.

Question: Can you design an iterative (rather than recursive) version of merge? How much time does is take?
Which version would be faster in practice– the recursive or the iterative?
2-13

Q : [ [7, 9], [1, 4], [6, 16], [2, 10] ∗ [3, 11, 12, 14], [5, 8, 13, 15] ]
Q : [ [6, 16], [2, 10] ∗ [3, 11, 12, 14], [5, 8, 13, 15], [1, 4, 7, 9] ]

Figure 2.1: One step of the mergesort algorithm.

The iterative algorithm mergesort uses q as a queue of lists. (Note that it is perfectly acceptable to have lists of
lists!) It repeatedly merges together the two lists at the front of the queue, and puts the resulting list at the tail of the
queue.

The correctness of the algorithm follows easily from the fact that we start with sorted lists (of length 1 each),
and merge them in pairs to get longer and longer sorted lists, until only one list remains. To analyze the running
time of this algorithm, let us place a special marker ∗ initially at the end of the q. Whenever the marker ∗ reaches the
front of q, and is either the first or the second element of q, we move it back to the end of q. Thus the presence of the
marker ∗ makes no difference to the actual execution of the algorithm. Its only purpose is to partition the execution
of the algorithm into phases: where a phase is the time between two successive visits of the marker ∗ to the end
of the q. Then we claim that the total time per phase is O(n). This is because each phase just consists of pairwise
merges of disjoint lists in the queue. Each such merge takes time proportional to the sum of the lengths of the lists,
and the sum of the lengths of all the lists in q is n. On the other hand, the number of lists is halved in each phase,
and therefore the number of phases is at most log n. Therefore the total running time of mergesort is O(n log n).
2-14

An alternative analysis of mergesort depends on a recursive, rather than iterative, description. Suppose we have
an operation that takes a list and splits it into two equal-size parts. (We will assume our list size is a power of 2, so
that all sublists we ever obtain have even size or are of length 1.) Then a recursive version of mergesort would do
the following:

function mergesort (s)


list s, s1 , s2
if size(s) = 1 then return(s)
split(s, s1 , s2 )
s1 = mergesort(s1 )
s2 = mergesort(s2 )
return(merge(s1 , s2 ))
end mergesort

Here split splits the list s into two parts of equal length s 1 and s2 . The correctness follows easily from induction.

Let T (n) be the number of comparisons mergesort performs on lists of length n. Then T (n) satisfies the
recurrence relation T (n) ≤ 2T (n/2) + n − 1. This follows from the fact that to sort lists of length n we sort two
sublists of length n/2 and then merge them using (at most) n − 1 comparisons. Using our general theorem on
solutions of recurrence relations, we find that T (n) = O(n log n).

Question: The iterative version of mergesort uses a queue. Implicitly, the recursive version is using a stack. Explain
the implicit stack in the recursive version of mergesort.

Question: Solve the recurrence relation T (n) = 2T (n/2) + n − 1 exactly to obtain an upper bound on the number of
comparisons performed by the recursive mergesort variation.
CS124 Lecture 3 Spring 2002

Graphs and modeling

Formulating a simple, precise specification of a computational problem is often a prerequisite to writing a


computer program for solving the problem. Many computational problems are best stated in terms of graphs. A
directed graph G(V, E) consists of a finite set of vertices V and a set of (directed) edges or arcs E. An arc is an
ordered pair of vertices (v, w) and is usually indicated by drawing a line between v and w, with an arrow pointing
towards w. Stated in mathematical terms, a directed graph G(V, E) is just a binary relation E ⊆ V ×V on a finite set
V . Undirected graphs may be regarded as special kinds of directed graphs, such that (u, v) ∈ E ↔ (v, u) ∈ E. Thus,
since the directions of the edges are unimportant, an undirected graph G(V, E) consists of a finite set of vertices V ,
and a set of edges E, each of which is an unordered pair of vertices {u, v}.
Graphs model many situations. For example, the vertices of a graph can represent cities, with edges representing
highways that connect them. In this case, each edge might also have an associated length. Alternatively, an edge
might represent a flight from one city to another, and each edge might have a weight which represents the cost of the
flight. A typical problem in this context is to compute shortest paths: given that you wish to travel from city X to
city Y, what is the shortest path (or the cheapest flight schedule). We will find very efficient algorithms for solving
these problems.
A seemingly similar problem is the traveling salesman problem. Supposing that a traveling salesman wishes to
visit each city exactly once and return to his starting point, in what order should he visit the cities to minimize the
total distance traveled? Unlike the shortest paths problem, however, this problem has no known efficient algorithm.
This is an example of an NP-complete problem, and one we will study towards the end of this course.

3-1
3-2

A different context in which graphs play a critical modeling role is in networks of pipes or communication
links. These can, in general, be modeled by directed graphs with capacities on the edges. A directed edge from u
to v with capacity c might represent a cable that can carry a flow of at most c calls per unit time from u to v. A
typical problem in this context is the max-flow problem: given a communications network modeled by a directed
graph with capacities on the edges, and two special vertices — a source s and a sink t — what is the maximum rate
at which calls from s to t can be made? There are ingenious techniques for solving these types of flow problems.
In all the cases mentioned above, the vertices and edges of the graph represented something quite concrete such
as cities and highways. Often, graphs will be used to represent more abstract relationships. For example, the vertices
of a graph might represent tasks, and the edges might represent precedence constraints: a directed edge from u to v
says that task u must be completed before v can be started. An important problem in this context is scheduling: in
what order should the tasks be scheduled so that all the precedence constraints are satisfied. There are extremely fast
algorithms for this problem that we will see shortly.
3-3

Representing graphs on the computer

One common representation for a graph G(V, E) is the adjacency matrix. Suppose V = {1, · · · , n}. The adja-
cency matrix for G(V, E) is an n × n matrix A, where a i, j = 1 if (i, j) ∈ E and ai, j = 0 otherwise.1 The advantage of
the adjacency matrix representation is that it takes constant time (just one memory access) to determine whether or
not there is an edge between any two given vertices. In the case that each edge has an associated length or a weight,
the adjacency matrix representation can be appropriately modified so entry a i, j contains that length or weight instead
of just a 1. The disadvantage of the adjacency matrix representation is that it requires Ω(n 2 ) storage, even if the
graph has as few as O(n) edges. Moreover, just examining all the entries of the matrix would require Ω(n 2 ) steps,
thus precluding the possibility of linear time algorithms for graphs with o(n 2 ) edges (at least in cases where all the
matrix entries must be examined).
An alternative representation for a graph G(V, E) is the adjacency list representation. We say that a vertex j is
adjacent to a vertex i if (i, j) ∈ E. The adjacency list for a vertex i is a list of all the vertices adjacent to i (in any
order). To represent the graph, we use an array of size n to represent the vertices of the graph, and the i th element of
the array points to the adjacency list of the ith vertex. The total storage used by an adjacency list representation of a
graph with n vertices and m edges is O(n + m). The adjacency list representation hence avoids the disadvantage of
using more space than necessary. We will use this representation for all our graph algorithms that take linear or near
linear time. A disadvantage of adjacency lists, however, is that determining whether there is an edge from vertex i to
vertex j may take as many as n steps, since there is no systematic shortcut to scanning the adjacency list of vertex i.
For applications where determining if there is an edge between two vertices is the bottleneck, the adjacency matrix
is thus preferable.

1 Generally, we use either n or |V | for the number of nodes in a graph, and m or |E| for the number of edges.
3-4

Depth first search

There are two fundamental algorithms for searching a graph: depth first search and breadth first search. To
better understand the need for these procedures, let us imagine the computer’s view of a graph that has been input
into it, in the adjacency list representation. The computer’s view is fundamentally local to a specific vertex: it can
examine each of the edges adjacent to a vertex in turn, by traversing its adjacency list; it can also mark vertices as
visited. One way to think of these operations is to imagine exploring a dark maze with a flashlight and a piece of
chalk. You are allowed to illuminate any corridor of the maze emanating from your current position, and you are
also allowed to use the chalk to mark your current location in the maze as having been visited. The question is how
to find your way around the maze.
We now show how the depth first search allows the computer to find its way around the input graph using just
these primitives. (We will examine breadth first search shortly.)
Depth first search is technique for exploring a graph using a stack as the basic data structure. We start by
defining a recursive procedure search (the stack is implicit in the recursive calls of search): search is invoked on a
vertex v, and explores all previously unexplored vertices reachable from v.

Procedure search(v)
vertex v
explored(v) := 1
previsit(v)
for (v, w) ∈ E
if explored(w) = 0 then search(w)
rof
postvisit(v)
end search

Procedure DFS (G(V, E))


graph G(V, E)
for each v ∈ V do
explored(v) := 0
rof
for each v ∈ V do
if explored(v) = 0 then search(v)
rof
end DFS
3-5

By modifying the procedures previsit and postvisit, we can use DFS to solve a number of important problems,
as we shall see. It is easy to see that depth first search takes O(|V | + |E|) steps (assuming previsit and postvisit take
O(1) time), since it explores from each vertex once, and the exploration involves a constant number of steps per
outgoing edge.
The procedure search defines a tree in a natural way: each time that search discovers a new vertex, say w, we
can incorporate w into the tree by connecting w to the vertex v it was discovered from via the edge (v, w). The
remaining edges of the graph can be classified into three types:

• Forward edges - these go from a vertex to a descendant (other than child) in the DFS tree.

• Back edges - these go from a vertex to an ancestor in the DFS tree.

• Cross edges - these go from “right to left”– there is no ancestral relation.

Question: Explain why if the graph is undirected, there can be no cross edges.
One natural use of previsit and postvisit could each keep a counter that is increased each time one of these
routines is accessed; this corresponds naturally to a notion of time. Each routine could assign to each vertex a
preorder number (time) and a postorder number (time) based on the counter. If we think of depth first search as
using an explicit stack, then the previsit number is assigned when the vertex is first placed on the stack, and the
postvisit number is assigned when the vertex is removed from the stack. Note that this implies that the intervals
[preorder(u), postorder(u)] and [preorder(v), postorder(v)] are either disjoint, or one contains the other.
3-6

An important property of depth-first search is that the contents of the stack at any time yield a path from the root
to some vertex in the depth first search tree. (Why?) This allows us to prove the following property of the postorder
numbering:

Claim 3.1 If (u, v) ∈ E then postorder(u) < postorder(v) ⇐⇒ (u, v) is a back edge.

Proof: If postorder(u) < postorder(v) then v must be pushed on the stack before u. Otherwise, the existence
of edge (u, v) ensures that v must be pushed onto the stack before u can be popped, resulting in postorder(v) <
postorder(u) — contradiction. Furthermore, since v cannot be popped before u, it must still be on the stack when u
is pushed on to it. It follows that v is on the path from the root to u in the depth first search tree, and therefore (u, v)
is a back edge.
The other direction is trivial.
Exercise: What conditions to the preorder and postorder numbers have to satisfy if (u, v) is a forward edge? A
cross edge?

Claim 3.2 G(V, E) has a cycle iff the DFS of G(V, E) yields a back edge.

Proof: If (u, v) is a back edge, then (u, v) together with the path from v to u in the depth first tree form a cycle.
Conversely, for any cycle in G(V, E), consider the vertex assigned the smallest postorder number. Then the
edge leaving this vertex in the cycle must be a back edge by Claim 3.1, since it goes from a lower postorder number
to a higher postorder number.
3-7

A A

B E B E

C F C F

D D

Graph is explored in preorder ABCDEF.


Postorder is DCBAFE.
DB is a back edge.
AD is a forward edge.
EC is a cross edge.
Figure 3.1: A sample depth-first search.

Application of DFS: Topological sort

We now suggest an algorithm for the scheduling problem described previously. Given a directed graph G(V, E),
whose vertices V = {v1 , . . . vn } represent tasks, and whose edges represent precedence constraints: a directed edge
from u to v says that task u must be completed before v can be started. The problem of topological sorting asks: in
what order should the tasks be scheduled so that all the precedence constraints are satisfied.
Note: The graph must be acyclic for this to be possible. (Why?) Directed acyclic graphs appear so frequently
they are commonly referred to as DAGs.

Claim 3.3 If the tasks are scheduled by decreasing postorder number, then all precedence constraints are satisfied.

Proof: If G is acyclic then the DFS of G produces no back edges by Claim 3.2. Therefore by Claim 3.1,
(u, v) ∈ G implies postorder(u) > postorder(v). So, if we process the tasks in decreasing order by postorder number,
when task v is processed, all tasks with precedence constraints into v (and therefore higher postorder numbers) must
already have been processed.
There’s another way to think about topologically sorting a DAG. Each DAG has a source, which is a vertex
with no incoming edges. Similarly, each DAG has a sink, which is a vertex with no outgoing edges. (Proving this
is an exercise.) Another way to topologically order the vertices of a DAG is to repeatedly output a source, remove
it from the graph, and repeat until the graph is empty. Why does this work? Similarly, once could repeatedly output
sinks, and this gives the reverse of a valid topological order. Again, why?
3-8

Strongly Connected Components

Connectivity in undirected graphs is rather straightforward. A graph that is not connected can naturally be
decomposed into several connected components (Figure 3.2). DFS does this handily: each restart of DFS marks a
new connected component.

12 13
1 2

3 5
4
14
6 7
8

9 10 11

Figure 3.2: An undirected graph


3-9

In directed graphs, what connectivity means is more subtle. In some primitive sense, the directed graph in
Figure 3.3 appears connected, since if it were an undirected graph, it would be connected. But there is no path from
vertex 12 to 6, or from 6 to 1, so saying the graph is connected would be misleading.
We must begin with a meaningful definition of connectivity in directed graphs. Call two vertices u and v of
a directed graph G = (V, E) connected if there is a path from u to v, and one from v to u. This relation between
vertices is reflexive, symmetric, and transitive (check!), so it is an equivalence relation on the vertices. As such, it
partitions V into disjoint sets, called the strongly connected components (SCC’s) of the graph (in Figure 3.3 there
are four SCC’s). Within a strongly connected component, every pair of vertices are connected.

1 2 3 1 2-4-5 3-6

4
6
5

7-8-9-10-11-12
7 8

9 11
10

12

Figure 3.3: A directed graph and its SCC’s


3-10

We now imagine shrinking each SCC into a vertex (a supervertex), and draw an edge (a superedge) from SCC
X to SCC Y if there is at least one edge from a vertex in X to a vertex in Y . The resulting directed graph has to be a
directed acyclic graph (DAG) – that is to say, it can have no cycles (see Figure 3.3). The reason is simple: a cycle
containing several SCC’s would merge to a single SCC, since there would be a path between every pair of vertices
in the SCC’s of the cycle. Hence, every directed graph is a DAG of its SCC’s.
This important decomposition theorem allows one to think of connectivity information of a directed graph
in two levels. At the top level we have a DAG, which has a useful, simple structure. For example, as we have
mentioned before, a DAG is guaranteed to have at least one source (a vertex without incoming edges) and a sink
(a vertex without outgoing edges). If we want more details, we could look inside a vertex of the DAG to see the
full-fledged SCC —a completely connected graph— that lies there.
This decomposition is extremely useful and informative; it is thus very fortunate that we have a very efficient
algorithm, based on DFS, that finds the strongly connected components in linear time! We motivate this algorithm
next. It is based on several interesting and slightly subtle properties of DFS:
3-11

Property 1: If DFS is started at a vertex v, then it will get stuck and restarted precisely when all vertices in the SCC
of v, and in all the SCC’s that are reachable from the SCC of v, are visited. Consequently, if DFS is started at a
vertex of a sink SCC (a SCC that has no edges leaving it in the DAG of SCC’s), then it will get stuck after it visits
precisely the vertices of this SCC.

For example, if DFS is started at vertex 11 in Figure 3.3 (a vertex in the only sink SCC in this graph), then it will visit
the six vertices in the sink SCC before getting stuck: vertices 12, 10, 9, 7, 8. Property 1 suggests a way of starting a
decomposition algorithm, by finding the first SCC: start DFS from a vertex in a sink SCC, and, when stuck, output
the vertices that have been visited. They form an SCC!
Of course, this leaves us with two problems: (A) How to guess a vertex in a sink SCC, and (B) how to continue
our algorithm by outputting the next SCC, and so on.
3-12

Let us first face Problem (A). It turns out that it will be easier not to look for vertices in a sink SCC, but instead
look for vertices in a source SCC. In particular:

Property 2: The vertex with the highest postorder number in DFS (that is, the vertex where the DFS ends) belongs
to a source SCC.
The proof is by contradiction. If Property 2 were not not true, and v is the vertex with the highest post-order
number, then there would be an incoming edge (u, w) with u not in the SCC of v and w in the SCC of v. If u were
searched before v, then u clearly has a higher postorder number. If u were searched after v, then since u does not lie
in v’s SCC, it must not be searched until v is popped from the search stack, so again u must have a higher postorder
number than v.

The reason behind Property 2 is thus not hard to see: if there is an SCC “above” the SCC of the vertex where the
DFS ends, then the DFS should have ended in that SCC (reaching it either by restarting or by backtracking).
Property 2 provides an indirect solution to Problem (A). Consider a graph G and the reverse graph G R —G with
the directions of all edges reversed. G R has precisely the same SCC’s as G (why?). So, if we make a DFS in G R ,
then the vertex where we end (the one with the highest post-order) belongs to a source SCC of G R —that is to say, a
sink SCC of G. We have solved Problem (A).
3-13

Onwards to Problem (B). How does the algorithm continue after the first sink component is output? The solution
is clear: delete the SCC just output from G R , and make another DFS in the remaining graph. The only problem is,
this would be a quadratic, not linear, algorithm, since we would run an O(m) DFS algorithm for up to each or O(n)
vertices. How can we avoid this extra work? The key observation here is that we do not have to make a new DFS in
the remaining graph:

Property 3: If we make a DFS in a directed graph, and then delete a source SCC of this graph, what remains is a
DFS in the remaining graph (the pre-order and post-order numbers may now not be consecutive, but they will be of
the right relative magnitude).

This is also easy to justify. We just imagine two runs of the DFS algorithm, one with and one without the source
SCC. Consider a transcript recording the steps of the DFS algorithm. It is easy to see that the transcript of both
runs would be the same (assuming they both made the same choices of what edges to follow at what points), except
where the the first went through the source SCC.
3-14

Property 3 allows us to use induction to continue our SCC algorithm. After we output the first SCC, we can use
the same DFS information from GR to output the second SCC, the third SCC, and so on. The full algorithm can thus
be described as follows:

Step 1: Perform DFS on GR .

Step 2: Perform DFS on G, processing unsearched vertices in the order of decreasing postorder numbers from the
DFS of Step 1. At the beginning and every restart print “New SCC:” When visiting vertex v, print v.

This algorithm is linear-time, since the total work is really just two depth-first searches, each of which is linear time.
Question: (How does one construct G R from G?) If we run this algorithm on Figure 3.3, Step 1 yields the following
order on the vertices (decreasing postorder in G R ’s DFS): 7, 9, 10, 12, 11, 8, 3, 6, 2, 5, 4, 1. Step 2 now produces the
following output: New SCC: 7, 8, 10, 9, 11, 12, New SCC: 3, 6, New SCC: 2, 4, 5, New SCC: 1.
3-15

Incidentally, there is more sophisticated connectivity information that one can derive from undirected graphs.
An articulation point is a vertex whose deletion increases the number of connected components in the undirected
graph. In Figure 3.2 there are 4 articulation points: 3, 6, 8, and 13. Articulation points divide the graph into bicon-
nected components (the pieces of the graph between articulation points) and bridge edges. Biconnected components
are maximal edge sets (of at least 2 edges) such that any two edges on the set lie on a common cycle. For example,
the large connected component of the graph in Figure 3.2 contains the biconnected components on edges between
vertices 1-2-3-4-5-7-8 and 6-9-10. The remaining edges are 3-6 and 8-11 are bridge edges; they disconnect the
graph. Not coincidentally, this more sophisticated and subtle connectivity information can also be captured by DFS.
3-16

Putting in Into Practice

Suppose you are debugging your latest huge software program for a major industrial client. The program has
hundreds of procedures, each of which must be carefully tested for bugs.
You realize that, to save yourself some work, it would be best to analyze the procedures in a particular
order. For instance, if procedure Write Check() calls Get Check Number(), you would probably want to test
Get Check Number() first. That way, when you look for the bugs in Write Check(), you do not have to worry
about checking (or re-checking) Get Check Number(). (Let’s ignore the specious argument that if there are no bugs,
you might avoid testing and debugging Get Check Number() altogether by starting with Write Check().)
You can easily generate a list of what procedures each procedure calls with a single pass through the code. So
here’s the problem: given your program, determine what schedule you should give your testing and debugging team,
so that a procedure will be debugged only after anything it calls will be debugged.
Go through the program, creating one vertex for each procedure. Introduce a directed edge from vertex A to
vertex B if the procedure A calls B. This directed edge represents the fact that A must be debugged before B. We call
this graph the procedure graph. If this graph is acyclic, then the topological sort will give you a valid ordering for
the debugging.
What if the graph is not acyclic? Then your program uses mutual recursion; that is, there is some chain of
procedures through which a procedure might end up calling itself. For example, this would be the case if procedure
A calls procedure B, procedure B calls procedure C, and procedure C calls procedure A. A topological sort will
detect these cycles, but what we really want is a list of them, since instances of mutual recursion are harder to test
and debug.
In this case, we should use the strongly connected components algorithm on the procedure graph. The SCC
algorithm will find all the cycles, showing all instances of mutual recursion. Moreover, if we collapse the cycles in
the graph, so that instances of mutual recursion are treated as one large super-procedure, then the SCC algorithm
will provide a valid debugging ordering for all the procedures in this modified graph. That is, the SCC algorithm
will topologically sort the underlying SCC DAG.
CS124 Lecture 4 Spring 2002

Breadth-First Search
A searching technique with different properties than DFS is Breadth-First Search (BFS). While DFS used an
implicit stack, BFS uses an explicit queue structure in determining the order in which vertices are searched. Also,
generally one does not restart BFS, because BFS only makes sense in the context of exploring the part of the graph
that is reachable from a particular vertex (s in the algorithm below).

Procedure BFS (G(V, E), s ∈ V )


graph G(V, E)
array[|V |] of integers dist
queue q;
dist[s] := 0
inject(q, s)
while size(q) > 0
v := pop(q)
previsit(v)
explored(v) := 1
for (v, w) ∈ E
if explored(w) = 0 then
inject(q, w)
dist(w) = dist(v)+1
fi
rof
end while
end BFS

BFS runs, of course, in linear time O(|E|), under the assumption that |E| ≥ |V |. The reason is that BFS visits
each edge exactly once, and does a constant amount of work per edge.

4-1
4-2

0 1 2
S

1 2
2

2 3

Figure 4.1: BFS of a directed graph

Although BFS does not have the same subtle properties of DFS, it does provide useful information. BFS visits
vertices in order of increasing distance from s. In fact, our BFS algorithm above labels each vertex with the distance
from s, or the number of edges in the shortest path from s to the vertex. For example, applied to the graph in
Figure 4.1, this algorithm labels the vertices (by the array dist) as shown.

Why are we sure that the array dist is the shortest-path distance from s? A simple induction proof suffices. It
is certainly true if the distance is zero (this happens only at s). And, if it is true for dist(v) = d, then it can be easily
shown to be true for values of dist equal to d + 1 —any vertex that receives this value has an edge from a vertex with
dist d, and from no vertex with lower value of dist. Notice that vertices not reachable from s will not be visited or
labeled.
4-3

Single-Source Shortest Paths —Nonnegative Lengths


What if each edge (v, w) of our graph has a length, a positive integer denoted length(v, w), and we wish to find
the shortest paths from s to all vertices reachable from it? (What if we are interested only in the shortest path from s
to a specific node t? As it turns out, all algorithms known for this problem have to compute the shortest path from s
to all vertices reachable from it.) BFS offers a possible solution. We can subdivide each edge (u, v) into length(u, v)
edges, by inserting length(u, v) − 1 “dummy” nodes, and then apply DFS to the new graph. This algorithm solves
the shortest-path problem in time O( ∑(u,v)∈E length(u, v)). Unfortunately, this can be very large —lengths could be
in the thousands or millions. So we need to find a better way.

The problem is that this BFS-based algorithm will spend most of its time visiting “dummy” vertices; only
occasionally will it do something truly interesting, like visit a vertex of the original graph. What we would like to
do is run this algorithm, but only do work for the “interesting” steps.
4-4

To do this, We need to generalize BFS. Instead of using a queue, we will instead use a heap or priority queue
of vertices. A heap is an data structure that keeps a set of objects, where each object has an associated value. The
operations a heap H implements include the following:

deletemin(H) return the object with the smallest value


insert(x, y, H) insert a new object x/value y pair in the structure
change(x, y, H) if y is smaller than x’s current value,
change the value of object x to y

We will not distinguish between insert and change, since for our purposes, they are essentially equivalent;
changing the value of a vertex will be like re-inserting it. (In all heap implementations we assume that we have an
array of pointers that gives, for each vertex, its position in the heap, if any. This allows us to always have at most
one copy of each vertex in the heap. Furthermore, it makes changes and inserts essentially equivalent operations.)

Each entry in the heap will stand for a projected future “interesting event” of our extended BFS. Each entry will
correspond to a vertex, and its value will be the current projected time at which we will reach the vertex. Another
way to think of this is to imagine that, each time we reach a new vertex, we can send an explorer down each adjacent
edge, and this explorer moves at a rate of 1 unit distance per second. With our heap, we will keep track of when each
vertex is due to be reached for the first time by some explorer. Note that the projected time until we reach a vertex
can decrease, because the new explorers that arise when we reach a newly explored vertex could reach a vertex first
(see node b in Figure 4.2). But one thing is certain: the most imminent future scheduled arrival of an explorer must
happen, because there is no other explorer who can reach any vertex faster. The heap conveniently delivers this most
imminent event to us.
4-5

As in all shortest path algorithms we shall see, we maintain two arrays indexed by V . The first array, dist[v],
will eventually contain the true distance of v from s. The other array, prev[v], will contain the last node before v in
the shortest path from s to v. Our algorithm maintains a useful invariant property: at all times dist[v] will contain a
conservative over-estimate of the true shortest distance of v from s. Of course dist[s] is initialized to its true value 0,
and all other dist’s are initialized to ∞, which is a remarkably conservative overestimate. The algorithm is known as
Djikstra’s algorithm, named after the inventor.

Algorithm Djikstra (G = (V, E, length); s ∈ V )


v, w: vertices
dist: array[V ] of integer
prev: array[V ] of vertices
H: priority heap of V
H := {s : 0}
for v ∈ V do
dist[v] := ∞, prev[v] :=nil
rof
dist[s] := 0
while H 6= 0/
v := deletemin(h)
for (v, w) ∈ E
if dist[w] > dist[v]+ length(v, w)
dist[w] := dist[v] + length(v, w), prev[w] := v, insert(w,dist[w], H)
fi
rof
end while end shortest paths 1
4-6

a2 c3 e6
1 4
2
s0 5
3 2
1 1

6 2 2
b4 d6 f5

Figure 4.2: Shortest paths

The algorithm, run on the graph in Figure 4.2, will yield the following heap contents (node: dist/priority pairs)
at the beginning of the while loop: {s : 0}, {a : 2, b : 6}, {b : 5, c : 3}, {b : 4, e : 7, f : 5}, {e : 7, f : 5, d : 6}, {e : 6, d : 6},
{e : 6}, {}. The distances from s are shown in Figure 2, together with the shortest path tree from s, the rooted tree
defined by the pointers prev.
4-7

What is the running time of this algorithm? The algorithm involves |E| insert operations and |V | deletemin
operations on H, and so the running time depends on the implementation of the heap H. There are many ways to
implement a heap. Even an unsophisticated implementation as a linked list of node/priority pairs yields an interesting
time bound, O(|V |2 ) (see first line of the table below). A binary heap would give O(|E| log |V |).

Which of the two should we prefer? The answer depends on how dense or sparse our graphs are. In all graphs,
|V |2
|E| is between |V | and |V |2 . If it is Ω(|V |2 ), then we should use the linked list version. If it is anywhere below log |V | ,

we should use binary heaps.

heap implementation deletemin insert |V |×deletemin+|E|×insert


linked list O(|V |) O(1) O(|V |2 )
binary heap O(log |V |) O(log |V |) O(|E| log |V |)
|V | |V | |V |
d-ary heap O( d log
log d ) O( log
log d ) O((|V | · d + |E|) log
log d

Fibonacci heap O(log |V |) O(1) amortized O(|V | log |V | + |E|)

A more sophisticated data structure, the d-ary heap, performs even better. A d-ary heap is just like a binary
heap, except that the fan-out of the tree is d, instead of 2. (Here d should be at least 2, however!) Since the depth of
log |V |
any such tree with |V | nodes is log d , it is easy to see that inserts take this amount of time. Deletemins take d times
that, because deletemins go down the tree, and must look at the children of all vertices visited.

The complexity of this algorithm is a function of d. We must choose d to minimize it. A natural choice is
|E|
d= |V | , which is the the average degree! (Note that this is the natural choice because it equalizes the two terms of
|E| + |V | · d. Alternatively, the “exact” value can be found using calculus.) This yields an algorithm that is good for
both sparse and dense graphs. For dense graphs, its running time is O(|V | 2 ). For graphs with |E| = O(|V |), it is
|V | log |V |. Finally, for graphs with intermediate density, such as |E| = |V | 1+δ , where δ is the density of the graph,
the algorithm is linear!

The fastest known implementation of Djikstra’s algorithm uses a data structure known as a Fibonacci heap,
which we will not cover here. Note that the bounds for the insert operation for Fibonacci heaps are amortized
bounds: certain operations may be expensive, but the average cost over a sequence of operations is constant.
4-8

Single-Source Shortest Paths: General Lengths


Our argument of correctness of our shortest path algorithm was based on the “time metaphor:” the most im-
minent prospective event (arrival of an explorer) must take place, exactly because it is the most imminent. This
however would not work if we had negative edges. (Imagine explorers being able to arrive before they left!) If the
length of edge (a, b) in Figure 2 were −1, the shortest path from s to b would have value 1, not 4, and our simple
algorithm fails. Obviously, with negative lengths we need more involved algorithms, which repeatedly update the
values of dist.

We can describe a general paradigm for constructing shortest path algorithms with arbitrary edge weights. The
algorithms use arrays dist and prev, and again we maintain the invariant that dist is always a conservative overestimate
of the true distance from s. (Again, dist is initialized to ∞ for all nodes, except for s for which it is 0).

The algorithms maintain dist so that it is always a conservative overestimate; it will only update the a value
when a suitable path is discovered to show that the overestimate can be lowered. That is, suppose we find a neighbor
w of v, with dist[v] > dist[w] + length(w, v). Then we have found an actual path that shows the distance estimate is
too conservative. We therefore repeatedly apply the following update rule.
4-9

procedure update ( (w, v) )


edge (w, v)
if dist[v] > dist[w]+ length(w, v) then
dist[v] := dist[w] + length(w, v), prev[v] := w

A crucial observation is that this procedure is safe, in that it never invalidates our “invariant” that dist is a
conservative overestimate.

The key idea is to consider how these updates along edges should occur. In Djikstra’s algorithm, the edges are
updated according to the time order of the imaginary explorers. But this only works with positive edge lengths.

A second crucial observation concerns how many updates we have to do. Let a 6= s be a node, and consider the
shortest path from s to a, say s, v1 , v2 , . . . , vk = a for some k between 1 and n − 1. If we perform update first on (s, v 1 ),
later on (v1 , v2 ), and so on, and finally on (vk−1 , a), then we are sure that dist(a) contains the true distance from s
to a, and that the true shortest path is encoded in prev. (Exercise: Prove this, by induction.) We must thus find a
sequence of updates that guarantee that these edges are updated in this order. We don’t care if these or other edges
are updated several times in between, since all we need is to have a sequence of updates that contains this particular
subsequence. There is a very easy way to guarantee this: update all edges |V | − 1 times in a row!
4-10

Algorithm Shortest Paths 2 (G = (V, E, length); s ∈ V )


v, w: vertices
dist: array[V ] of integer
prev: array[V ] of vertices
i: integer
for v ∈ V do
dist[v] := ∞, prev[v] :=nil
rof
dist[s] := 0
for i = 1 . . . n − 1
for (w, v) ∈ E update(w, v)
end shortest paths 2

This algorithm solves the general single-source shortest path problem in O(|V | · |E|) time.
4-11

Negative Cycles
In fact, there is a further problem that negative edges can cause. Suppose the length of edge (b, a) in Figure 2
were changed to −5. The the graph would have a negative cycle (from a to b and back). On such graphs, it does not
make sense to even ask the shortest path question. What is the shortest path from s to c in the modified graph? The
one that goes directly from s to a to c (cost: 3), or the one that goes from s to a to b to a to c (cost: 1), or the one that
takes the cycle twice (cost: -1)? And so on.

The shortest path problem is ill-posed in graphs with negative cycles. It makes no sense and deserves no
answer. Our algorithm in the previous section works only in the absence of negative cycles. (Where did we assume
no negative cycles in our correctness argument? Answer: When we asserted that a shortest path from s to a exists!)
But it would be useful if our algorithm were able to detect whether there is a negative cycle in the graph, and thus to
report reliably on the meaningfulness of the shortest path answers it provides.

This is easily done. After the |V | − 1 rounds of updates of all edges, do a last update. If any changes occur
during this last round of updates, there is a negative cycle. This must be true, because if there were no negative
cycles, |V | − 1 rounds of updates would have been sufficient to find the shortest paths.
4-12

Shortest Paths on DAG’s


There are two subclasses of weighted graphs that automatically exclude the possibility of negative cycles:
graphs with non-negative weights and DAG’s. We have already seen that there is a fast algorithm when the weights
are non-negative. Here we will give a linear algorithm for single-source shortest paths in DAG’s.

Our algorithm is based on the same principle as our algorithm for negative weights. We are trying to find a
sequence of updates, such that all shortest paths are its subsequences. But in a DAG we know that all shortest paths
from s must go in the topological order of the DAG. All we have to do then is first topologically sort the DAG using
a DFS, and then visit all edges coming out of nodes in the topological order. This algorithm solves the general
single-source shortest path problem for DAG’s in O(m) time.
CS124 Lecture 5 Spring 2002

Minimum Spanning Trees


A tree is an undirected graph which is connected and acyclic. It is easy to show that if graph G(V, E) that
satisfies any two of the following properties also satisfies the third, and is therefore a tree:

• G(V, E) is connected

• G(V, E) is acyclic

• |E| = |V | − 1

(Exercise: Show that any two of the above properties implies the third (use induction).)

A spanning tree in an undirected graph G(V, E) is a subset of edges T ⊆ E that are acyclic and connect all the
vertices in V . It follows from the above conditions that a spanning tree must consist of exactly n − 1 edges. Now
suppose that each edge has a weight associated with it: w : E → Z. Say that the weight of a tree T is the sum of the
weights of its edges; w(T ) = ∑e∈T w(e). The minimum spanning tree in a weighted graph G(V, E) is one which has
the smallest weight among all spanning trees in G(V, E).

As an example of why one might want to find a minimum spanning tree, consider someone who has to install
the wiring to network together a large computer system. The requirement is that all machines be able to reach each
other via some sequence of intermediate connections. By representing each machine as a vertex and the cost of
wiring two machines together by a weighted edge, the problem of finding the minimum cost wiring scheme reduces
to the minimum spanning tree problem.

In general, the number of spanning trees in G(V, E) grows exponentially in the number of vertices in G(V, E).
(Exercise: Try to determine the number of different spanning trees for a complete graph on n vertices.) Therefore
it is infeasible to search through all possible spanning trees to find the lightest one. Luckily it is not necessary
to examine all possible spanning trees; minimum spanning trees satisfy a very important property which makes it
possible to efficiently zoom in on the answer.

5-1
Lecture 5 5-2

We shall construct the minimum spanning tree by successively selecting edges to include in the tree. We will
guarantee after the inclusion of each new edge that the selected edges, X, form a subset of some minimum spanning
tree, T . How can we guarantee this if we don’t yet know any minimum spanning tree in the graph? The following
property provides this guarantee:

Cut property: Let X ⊆ T where T is a MST in G(V, E). Let S ⊂ V such that no edge in X crosses between S
and V − S; i.e. no edge in X has one endpoint in S and one endpoint in V − S. Among edges crossing between S and
V − S, let e be an edge of minimum weight. Then X ∪ {e} ⊆ T 0 where T 0 is a MST in G(V, E).

The cut property says that we can construct our tree greedily. Our greedy algorithms can simply take the
minimum weight edge across two regions not yet connected. Eventually, if we keep acting in this greedy manner,
we will arrive at the point where we have a minimum spanning tree. Although the idea of acting greedily at each
point may seem quite intuitive, it is very unusual for such a strategy to actually lead to an optimal solution, as we
will see when we examine other problems!

/ T . Adding e into T creates a unique cycle. We will remove a single edge e 0 from this
Proof: Suppose e ∈
unique cycle, thus getting T 0 = T ∪ {e} − {e0 }. It is easy to see that T 0 must be a tree — it is connected and has
n − 1 edges. Furthermore, as we shall show below, it is always possible to select an edge e 0 in the cycle such that it
crosses between S and V − S. Now, since e is a minimum weight edge crossing between S and V − S, w(e 0 ) ≥ w(e).
Therefore w(T 0 ) = w(T ) + w(e) − w(e0 ) ≤ w(T ). However since T is a MST, it follows that T 0 is also a MST and
w(e) = w(e0 ). Furthermore, since X has no edge crossing between S and V − S, it follows that X ⊆ T 0 and thus
X ∪ {e} ⊆ T 0 .

How do we know that there is an edge e0 6= e in the unique cycle created by adding e into T , such that e 0 crosses
between S and V − S? This is easy to see, because as we trace the cycle, e crosses between S and V − S, and we must
cross back along some other edge to return to the starting point.
Lecture 5 5-3

In light of this, the basic outline of our minimum spanning tree algorithms is going to be the following:

X := { }.
Repeat until |X| = n − 1.
Pick a set S ⊆ V such that no edge in X crosses between S and V − S.
Let e be a lightest edge in G(V, E) that crosses between S and V − S.
X := X ∪ {e}.

The difference between minimum spanning tree algorithms lies in how we pick the set S at each step.
Lecture 5 5-4

Prim’s algorithm:
In the case of Prim’s algorithm, X consists of a single tree, and the set S is the set of vertices of that tree. One
way to think of the algorithm is that it grows a single tree, adding a new vertex at each step, until it has the minimum
spanning tree. In order to find the lightest edge crossing between S and V − S, Prim’s algorithm maintains a heap
containing all those vertices in V − S which are adjacent to some vertex in S. The priority of a vertex v, according
to which the heap is ordered, is the weight of its lightest edge to a vertex in S. This is reminiscent of Dijkstra’s
algorithm (where distance was used for the heap instead of the edge weight). As in Dijkstra’s algorithm, each vertex
v will also have a parent pointer prev(v) which is the other endpoint of the lightest edge from v to a vertex in S. The
pseudocode for Prim’s algorithm is almost identical to that for Dijkstra’s algorithm:

Procedure Prim(G(V, E), s)


v, w: vertices
dist: array[V ] of integer
prev: array[V ] of vertices
S: set of vertices, initially empty
H: priority heap of V
H := {s : 0}
for v ∈ V do
dist[v] := ∞, prev[v] :=nil
rof
dist[s] := 0
while H 6= 0/
v := deletemin(h)
S := S ∪ {v}
for (v, w) ∈ E and w ∈ V − S do
if dist[w] > length(v, w)
dist[w] := length(v, w), prev[w] := v, insert(w,dist[w], H)
fi
rof
end while end Prim

Note that each vertex is “inserted” on the heap at most once; other insert operations simply change the value on
the heap. The vertices that are removed from the heap form the set S for the cut property. The set X of edges chosen
to be included in the MST are given by the parent pointers of the vertices in the set S. Since the smallest key in the
heap at any time gives the lightest edge crossing between S and V − S, Prim’s algorithm follows the generic outline
for a MST algorithm presented above, and therefore its correctness follows from the cut property.

The running time of Prim’s algorithm is clearly the same as Dijkstra’s algorithm, since the only change is how
we prioritize nodes in the heap. Thus, if we use d-heaps, the running time of Prim’s algorithm is O(m log m/n n).
Lecture 5 5-5

Kruskal’s algorithm:
Kruskal’s algorithm uses a different strategy from Prim’s algorithm. Instead of growing a single tree, Kruskal’s
algorithm attempts to put the lightest edge possible in the tree at each step. Kruskal’s algorithm starts with the edges
sorted in increasing order by weight. Initially X = { }, and each vertex in the graph regarded as a trivial tree (with
no edges). Each edge in the sorted list is examined in order, and if its endpoints are in the same tree, then the edge is
discarded; otherwise it is included in X and this causes the two trees containing the endpoints of this edge to merge
into a single tree. Note that, by this process, we are implicitly choosing a set S ⊆ V with no edge in X crossing
between S and V − S, so this fits in our basic outline of a minimum spanning tree algorithm.

To implement Kruskal’s algorithm, given a forest of trees, we must decide given two vertices whether they
belong to the same tree. For the purposes of this test, each tree in the forest can be represented by a set consisting of
the vertices in that tree. We also need to be able to update our data structure to reflect the merging of two trees into a
single tree. Thus our data structure will maintain a collection of disjoint sets (disjoint since each vertex is in exactly
one tree), and support the following three operations:

• MAKESET(x): Create a new x containing only the element x.

• FIND(x): Given an element x, which set does it belong to?

• UNION(x,y): replace the set containing x and the set containing y by their union.

The pseudocode for Kruskal’s algorithm follows:

Function Kruskal(graph G(V, E))


set X
X ={}
E:= sort E by weight
for u ∈ V
MAKESET(u)
rof
for (u, v) ∈ E (in increasing order) do
if FIND(u) 6= FIND(v) do
X = X ∪ {(u, v)}
UNION(u,v)
rof
return(X)
end Kruskal
Lecture 5 5-6

The correctness of Kruskal’s algorithm follows from the following argument: Kruskal’s algorithm adds an edge
e into X only if it connects two trees; let S be the set of vertices in one of these two trees. Then e must be the first
edge in the sorted edge list that has one endpoint in S and the other endpoint in V − S, and is therefore the lightest
edge that crosses between S and V − S. Thus the cut property of MST implies the correctness of the algorithm.

The running time of the algorithm, assuming the edges are given in sorted order, is dominated by the set
operations: UNION and FIND. There are n − 1 UNION operations (one corresponding to each edge in the spanning
tree), and 2m FIND operations (2 for each edge). Thus the total time of Kruskal’s algorithm is O(m × FIND + n ×
UNION). We will soon show that this is O(m log ∗ n). Note that, if the edges are not initially given in sorted order,
then to sort them in the obvious way takes O(m log m) time, and this would be the dominant part of the running time
of the algorithm.
Lecture 5 5-7

Exchange Property
Actually spanning trees satisfy an even stronger property than the cut property — the exchange property. The
exchange property is quite remarkable since it implies that we can “walk” from any spanning tree T to a minimum
spanning tree T̂ by a sequence of exchange moves — each such move consists of throwing an edge out of the current
tree that is not in T̂ , and adding a new edge into the current tree that is in T̂ . Moreover, each successive tree in the
“walk” is guaranteed to weigh no more than its predecessor.

Exchange property: Let T and T 0 be spanning trees in G(V, E). Given any e 0 ∈ T 0 − T , there exists an edge
e ∈ T − T 0 such that (T − {e}) ∪ {e0 } is also a spanning tree.

The proof is quite similar to that of the cut property. Adding e 0 into T results in a unique cycle. There must be
some edge in this cycle that is not in T 0 (since otherwise T 0 must have a cycle). Call this edge e. Then deleting e
restores a spanning tree, since connectivity is not affected, and the number of edges is restored to n − 1.

To see how one may use this exchange property to “walk” from any spanning tree to a MST: let T be any
spanning tree and let T̂ be a MST in G(V, E). Let e0 be the lightest edge that is not in both trees. Perform an
exchange using this edge. Since the exchange was done with the lightest such edge, the new tree must be lighter than
the old one. Since T̂ is already a MST, it follows that the exchange must have been performed upon T and results in
a lighter spanning tree which has more edges in common with T̂ (if there are several edges of the same weight, then
the new tree might not be lighter, but it still has more edges in common with T̂ ).
Lecture 5 5-8

1 5

3 5 2
4 1
2 5 7
3 6

Figure 5.1: An example of Prim’s algorithm and Kruskal’s algorithm. Which is which?
CS124 Lecture 6 Spring 2002

Disjoint set (Union-Find)


For Kruskal’s algorithm for the minimum spanning tree problem, we found that we needed a data structure for
maintaining a collection of disjoint sets. That is, we need a data structure that can handle the following operations:

• MAKESET(x) - create a new set containing the single element x

• UNION(x, y) - replace two sets containing x and y by their union.

• FIND(x) - return the name of the set containing the element x

Naturally, this data structure is useful in other situations, so we shall consider its implementation in some detail.

Within our data structure, each set is represented by a tree, so that each element points to a parent in the tree.
The root of each tree will point to itself. In fact, we shall use the root of the tree as the name of the set itself; hence
the name of each set is given by a canonical element, namely the root of the associated tree.

It is convenient to add a fourth operation LINK(x, y) to the above, where we require for LINK that x and y are
two roots. LINK changes the parent pointer of one of the roots, say x, and makes it point to y. It returns the root
of the now composite tree y. With this addition, we have UNION(x, y) = LINK(FIND(x),FIND(y)), so the main
problem is to arrange our data structure so that FIND operations are very efficient.

6-1
Lecture 6 6-2

Notice that the time to do a FIND operation on an element corresponds to its depth in the tree. Hence our goal is
to keep the trees short. Two well-known heuristics for keeping trees short in this setting are UNION BY RANK and
PATH COMPRESSION. We start with the UNION BY RANK heuristic. The idea of UNION BY RANK is to ensure
that when we combine two trees, we try to keep the overall depth of the resulting tree small. This is implemented as
follows: the rank of an element x is initialized to 0 by MAKESET. An element’s rank is only updated by the LINK
operation. If x and y have the same rank r, then invoking LINK(x, y) causes the parent pointer of x to be updated to
point to y, and the rank of y is then updated to r + 1. On the other hand, if x and y have different rank, then when
invoking LINK(x, y) the parent point of the element with smaller rank is updated to point to the element with larger
rank. The idea is that the rank of the root is associated with the depth of the tree, so this process keeps the depth
small. (Exercise: Try some examples by hand with and without using the UNION BY RANK heuristic.)

The idea of PATH COMPRESSION is that, once we perform a FIND on some element, we should adjust its
parent pointer so that it points directly to the root; that way, if we ever do another FIND on it, we start out much
closer to the root. Note that, until we do a FIND on an element, it might not be worth the effort to update its parent
pointer, since we may never access it at all. Once we access an item, however, we must walk through every pointer
to the root, so modifying the pointers only changes the cost of this walk by a constant factor.
Lecture 6 6-3

procedure MAKESET(x)
p(x) := x
rank(x) := 0
end

function FIND(x)
if x 6= p(x) then
p(x) := FIND(p(x))
return(p(x))
end

function LINK(x, y)
if rank(x) > rank(y) then x ↔ y
if rank(x) = rank(y) then rank(y) := rank(y) + 1
p(x) := y
return(y)
end

procedure UNION(x, y)
LINK(FIND(x),FIND(y))
end
Lecture 6 6-4

In our analysis, we show that any sequence of m UNION and FIND operations on n elements take at most
O((m + n) log∗ n) steps, where log∗ n is the number of times you must iterate the log 2 function on n before getting
a number less than or equal to 1. (So log ∗ 4 = 2, log∗ 16 = 3, log ∗ 65536 = 4.) We should note that this is not the
tightest analysis possible; however, this analysis is already somewhat complex!

Note that we are going to do an amortized analysis here. That is, we are going to consider the cost of the
algorithm over a sequence of steps, instead of considering the cost of a single operation. In fact a single UNION or
FIND operation could require O(log n) operations. (Exercise: Prove this!) Only by considering an entire sequence
of operations at once can obtain the above bound. Our argument will require some interesting accounting to total the
cost of a sequence of steps.
Lecture 6 6-5

We first make a few observations about rank.

• if v 6= p(v) then rank(p(v)) > rank(v)

• whenever p(v) is updated, rank(p(v)) increases

n
• the number of elements with rank k is at most 2k

n
• the number of elements with rank at least k is at most 2k−1

The first two assertions are immediate from the description of the algorithm. The third assertion follows from
the fact that the rank of an element v changes only if LINK(v, w) is executed, rank(v) = rank(w), and v remains
the root of the combined tree; in this case v’s rank is incremented by 1. A simple induction then yields that when
rank(v) is incremented to k, the resulting tree has at least 2 k elements. The last assertion then follows from the third
assertion, as ∑∞j=k 2nj = n
2k−1
.

Exercise: Show that the maximum rank an item can have is log n.
Lecture 6 6-6

As soon as an element becomes a non-root, its rank is fixed. Let us divide the (non-root) elements into groups
according to their ranks. Group i contains all elements whose rank r satisfies log ∗ r = i. For example, elements in
i−1
group 3 have ranks in the range (4, 16], and the range of ranks associated with group i is (2 i−1 , 22 ). For convenience
we shall write this more simply by saying group (k, 2 k ] to mean the group with these ranks.

It is easy to establish the following assertions about these groups:

• The number of distinct groups is at most log ∗ n. (Use the fact that the maximum rank is log n.)

n
• The number of elements in the group (k, 2 k ] is at most 2k
.

Let us assign 2k tokens to each element in group (k, 2 k ]. The total number of tokens assigned to all elements
from that group is then at most 2k 2nk = n, and the total number of groups is at most log ∗ n, so the total number of
tokens given out is n log ∗ n. We use these tokens to account for the work done by FIND operations.

Recall that the number of steps for a FIND operation is proportional to the number of pointers that the FIND
operation must follow up the tree. We separate the pointers into two groups, depending on the groups of u and
p(u) = v, as follows:

• Type 1: a pointer is of Type 1 if u and v belong to different groups, or v is the root.

• Type 2: a pointer is of Type 2 if u and v belong to the same group.

We account for the two Types of pointers in two different ways. Type 1 links are “charged” directly to the FIND
operation; Type 2 links are “charged” to u, who “pays” for the operation using one of the tokens. Let us consider
these charges more carefully.
Lecture 6 6-7

The number of Type 1 links each FIND operation goes through is at most log ∗ n, since there are only log ∗ n
groups, and the group number increases as we move up the tree.

What about Type 2 links? We charge these links directly back to u, who is supposed to pay for them with a
token. Does u have enough tokens? The point here is that each time a FIND operation goes through an element u,
its parent pointer is changed to the current root of the tree (by PATH COMPRESSION), so the rank of its parent
increases by at least 1. If u is in the group (k, 2 k ], then the rank of u’s parent can increase fewer than 2 k times before
it moves to a higher group. Therefore the 2 k tokens we assign to u are sufficient to pay for all FIND operations that
go through u to a parent in the same group.
Lecture 6 6-8

We now count the total number of steps for m UNION and FIND operations. Clearly LINK requires just O(1)
steps, and since a UNION operation is just a LINK and 2 FIND operations, it suffices to bound the time for at most
2m FIND OPERATIONS. Each FIND operation is charged at most log ∗ n for a total of O(m log∗ n). The total number
of tokens used at most n log ∗ n, and each token pays for a constant number of steps. Therefore the total number of
steps is O((m + n) log∗ n).

Let us give a more equation-oriented explanation. The total time spent over the course of m UNION and FIND
operations is just

∑ (# links passed through).


all FIND ops
We split this sum up into two parts:

∑ (# links in same group) + ∑ (# links in different groups).


all FIND ops all FIND ops

(Technically, the case where a link goes to the root should be handled explicitly; however, this is just O(m) links in
total, so we don’t need to worry!) The second term is clearly O(m log ∗ n). The first term can be upper bounded by:

∑ (# ranks in the group of u),


all elements u
because each element u can be charged only once for each rank in its group. (Note here that this is because the links
to the root count in the second sum!) This last sum is bounded above by
log∗ n
n k
∑ (# items in group) · (# ranks in group) ≤ ∑ 2k
2 ≤ n log∗ n.
all groups k=1

This completes the proof.


Lecture 6 6-9

y
x y
UNION(x,y)

a a

b FIND(d)

c b c d

Figure 6.1: Examples of UNION BY RANK and PATH COMPRESSION.


CS124 Lecture 7

In today’s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we
will see, sometimes it works, and sometimes even when it doesn’t, it can provide a useful result.

Horn Formulae
A simple application of the greedy paradigm solves an important special case of the SAT problem. We have
already seen that 2SAT can be solved in linear time. Now consider SAT instances where in each clause, there is at
most one positive literal. Such formulae are called Horn formulae; for example, this is an instance:

(x ∨ y ∨ z ∨ w) ∧ (x ∨ y ∨ w) ∧ (x ∨ z ∨ w) ∧ (x ∨ y) ∧ (x) ∧ (z) ∧ (x ∨ y ∨ w).

Given a Horn formula, we can separate its clauses into two parts: the pure negative clauses (those without a
positive literal) and the implications (those with a positive literal). We call clauses with a positive literal implications
because they can be rewritten suggestively as implications; (x ∨ y ∨ z ∨ w) is equivalent to (y ∧ z ∧ w) → x. Note
the trivial clause (x) can be thought of as a trivial implication → x. Hence, in the example above, we have the
implications
(y ∧ z ∧ w → x), (x ∧ z → w), (x → y), (→ x), (x ∧ y → w)

and these two pure negative clauses


(x ∨ y ∨ w), (z).

We can now develop a greedy algorithm. The idea behind the algorithm is that we start with all variables set to
false, and we only set variables to T if an implication forces us to. Recall that an implication is not satisfied if all
variables to the left of the arrow are true and the one to the right is false. This algorithm is greedy, in the sense that
it (greedily) tries to ensure the pure negative clauses are satisfied, and only changes a variable if absolutely forced.

7-1
Lecture 7 7-2

Algorithm Greedy-Horn(φ: CNF formula with at most one positive literal per clause)
Start with the truth assignment t :=FFF· · ·F
while there is an implication that is not satisfied do
make the implied variable T in t
if all pure negatives are satisfied then return t
else return “φ is unsatisfiable”

Once we have the proposed truth assignment, we look at the pure negatives. If there is a pure negative clause
that is not satisfied by the proposed truth assignment, the formula cannot be satisfied. This follows from the fact that
all the pure negative clauses will be satisfied if any of their variables are set to F. If such a clause is unsatisfied, all of
its variables must be set to T. But we only set a variable to T if we are forced to by the implications. If all the pure
negative clauses are satisfied, then we have found a truth assignment.

On the example above, Greedy-Horn first flips x to true, forced by the implication → x. Then y gets forced to
true (from x → y), and similarly w is forced to true. (Why?) Looking at the pure negative clauses, we find that the
first is not satisfied, and hence we conclude the original formula had no truth assignment.

Exercise: Show that the Horn-greedy algorithm can be implemented in linear time in the length of the formula (i.e.,
the total number of appearances of all literals).
Lecture 7 7-3

Huffman Coding
Suppose that you must store the map of a chromosome which consists of a sequence of 130 million symbols of
the form A, C, G, or T. To store the sequence efficiently, you represent each character with just 2 bits: A as 00, C as
01, G as 10, and T as 11. Such a representation is called an encoding. With this encoding, the sequence requires 260
megabits to store.

Suppose, however, that you know that some symbols appear more frequently than others. For example, suppose
A appears 70 million times, C 3 million times, G 20 million times, and T 37 million times. In this case it seems
wasteful to use two bits to represent each A. Perhaps a more elaborate encoding assigning a shorter string to A could
save space.

We restrict ourselves to encodings that satisfy the prefix property: no assigned string is the prefix of another.
This property allows us to avoid backtracking while decoding. For an example without the prefix property, suppose
we represented A as 1 and C as 101. Then when we read a 1, we would not know whether it was an A or the
beginning of a C! Clearly we would like to avoid such problems, so the prefix property is important.

You can picture an encoding with the prefix property as a binary tree. For example, the binary tree below
corresponds to an optimal encoding in the above situation. (There can be more than one optimal encoding! Just flip
the left and right hand sides of the tree.) Here a branch to the left represent a 0, and a branch to the right represents
a 1. Therefore A is represented by 1, C by 001, G by 000, and T by 01. This encoding requires only 213 million bits
– a 17% improvement over the balanced tree (the encoding 00,01,10,11). (This does not include the bits that might
be necessary to store the form of the encoding!)
Lecture 7 7-4

0 1
(60)
A (70)
0 1
(23)
T (37)
0 1

G (20) C (3)

Figure 7.1: A Huffman tree.


Lecture 7 7-5

Let us note some properties of the binary trees that represent encoding. The symbols must correspond to leaves;
an internal node that represents a character would violate the prefix property. The code words are thus given by all
root-to-leaf paths. All internal nodes must have exactly two children, as an internal node with only one child could
be deleted to yield a better code. Hence if there are n leaves there are n − 1 internal edges. Also, if we assign
frequencies to the internal nodes, so that the frequencies of an internal node are the sums of the frequencies of the
children, then the total length produced by the encoding is the sum of the frequencies of all nodes except the root. (A
one line proof: each edge corresponds to a bit that is written as many times as the frequency of the node to which it
leads.)

One final property allows us to determine how to build the tree: the two symbols with the smallest frequencies
are together at the lowest level of the tree. Otherwise, we could improve the encoding by swapping a more frequently
used character at the lowest level up. (This is not a full proof; feel free to complete one.)

This tells us how to construct the optimum tree greedily. Take the two symbols with the lowest frequency,
delete them from the list of symbols, and replace them with a new meta-character; this new meta-character will lie
directly above the two deleted symbols in the tree. Repeat this process until the whole tree is constructed.

We can prove by induction that this gives an optimal tree. It works for 2 symbols (base case). We also show
that if it works for n letters, it must also work for n + 1 letters. After deleting the two least frequent symbols and
replacing them with a meta-character, it as though we have just n symbols. this process yields an optimal tree for
these n symbols (by the inductive hypothesis). Expanding the meta-character back into the two deleted nodes must
now yield an optimal tree, since otherwise we could have found a better tree for the n symbols.
Lecture 7 7-6

A 60 A 60 A 60 [OA] 110

E 70 E 70 E 70 O A
I 40 I 40 O 50 E 70

O 50 O 50 [I[UY]] 90 [I[UY]] 90

U 20 [UY] 50
I I
Y 30
U Y U Y U Y

Figure 7.2: The first few steps of building a Huffman tree.


Lecture 7 7-7

It is important to realize that when we say that a Huffman tree is optimal, this does not mean that it gives the
best way to compress a string. It only says that we cannot do better by encoding one symbol at a time. By encoding
frequently used blocks of letters (such as, in this section, the block “frequen”) we can obtain much better encodings.
(Note that finding the right blocks of letters can be quite complicated.) Given this, one might expect that Huffman
coding is rarely used. In fact, many compression schemes use Huffman codes at some point as a basic building
block. For example, image and video transmission often use Huffman encoding somewhere along the line.

Exercise: Find a Huffman compressor and another compressor, such as a gzip compressor. Test them on some files.
Which compresses better?

It is straightforward to write code to generate the appropriate tree, and then use this tree to encode and decode
messages. For encoding, we simply build a table with a codeword for each sybmol. To decode, we could read bits
in one at a time, and walk down the tree in the appropriate manner. When we reach a leaf, we output the appropriate
symbol and return to the top of the tree.

In practice, however, if we want to use Huffman coding, there are much faster ways to decode than to explicitly
walk down the tree one bit at a time. Using an explicit tree is slow, for a variety of reasons. Exercise: Think about
this.

One approach is to design a system that performs several steps at a time by reading several bits of input and
determining what actions to take according to a big lookup table. For example, we could have a table that represents
the information, “If you are currently at this point in the tree, and the next 8 bits are 00110101, then output AC and
move to this point in the tree.” This lookup table, which might be huge, encapsulates the information needed to
handle eight bits at once. Since computers naturally handle eight bit blocks more easily than single bits, and because
table lookups are faster than following pointers down a Huffman tree, substantial speed gains are possible. Notice
that this gain in speed comes at the expense of the space required for the lookup table.

There are other solutions that work particularly well for very large dictionaries. For example, if you were using
Huffman codes on a libraray of newspaper articles, you might treat each work as a symbol that can be encoded. In
this case, you would have a lot of symbols! We will not go over these other methods here; a useful paper on the
subject is “On the Implementation of Minimum-Redundancy Prefix Codes,” by Moffat and Turpin. The key to keep
in mind is that while thinking of decoding on the Huffman tree as happening one bit at a time is useful conceptually,
good engineering would use more sophisticated methods to increase efficiency.
Lecture 7 7-8

The Set Cover Problem


The inputs to the set cover problem are a finite set X = {x 1 , . . . , xn }, and a collection of subsets S of X such that
S = X. The problem is to find the subcollection T ⊆ S such that the sets of T cover X, that is
S
S∈S

[
T = X.
T ∈T

Notice that such a cover exists, since S is itself a cover.

The greedy heuristic suggests that we build a cover by repeatedly including the set in S that will cover the
maximum number of as yet uncovered elements. In this case, the greedy heuristic does not yield an optimal solution.
Interestingly, however, we can prove that the greedy solution is a good solution, in the sense that it is not too far
from the optimal.

This is an example of an approximation algorithm. Loosely speaking, with an approximation algorithm, we


settle for a result that is not the correct answer. Instead, however, we try to prove a guarantee on how close the
algorithm is to the right answer. As we will see later in the course, sometimes this is the best we can hope to do.
Lecture 7 7-9

Claim 7.1 Let k be the size of the smallest set cover for the instance (X, S ). Then the greedy heuristic finds a set
cover of size at most k ln n.

Proof: Let Yi ⊆ X be the set of elements that are still not covered after i sets have been chosen with the greedy
heuristic. Clearly Y0 = X. We claim that there must be a set A ∈ S such that |A ∩Yi | ≥ |Yi |/k. To see this, consider
the sets in the optimal set cover of X. These sets cover Yi , and there are only k of them, so one of these sets must
cover at least a 1/k fraction of Yi . Hence

|Yi+1 | ≤ |Yi | − |Yi |/k = (1 − 1/k)|Yi |,

and by induction,
|Yi | ≤ (1 − 1/k)i |Y0 | = n(1 − 1/k)i < ne−i/k ,

where the last inequality uses the fact that 1 + x ≤ e x with equality iff x = 0. Hence when i ≥ k ln n we have |Yi | < 1,
meaning there are no uncovered elements, and hence the greedy algorithm finds a set cover of size at most k ln n.

Exercise: Show that this bound is tight, up to constant factors. That is, give a family of examples where the set
cover has size k and the greedy algorithm finds a cover of size Ω(k ln n).
CS124 Lecture 8 Spring 2000

Divide and Conquer


We have seen one general paradigm for finding algorithms: the greedy approach. We now consider another
general paradigm, known as divide and conquer.

We have already seen an example of divide and conquer algorithms: mergesort. The idea behind mergesort is to
take a list, divide it into two smaller sublists, conquer each sublist by sorting it, and then combine the two solutions
for the subproblems into a single solution. These three basic steps – divide, conquer, and combine – lie behind most
divide and conquer algorithms.

With mergesort, we kept dividing the list into halves until there was just one element left. In general, we may
divide the problem into smaller problems in any convenient fashion. Also, in practice it may not be best to keep
dividing until the instances are completely trivial. Instead, it may be wise to divide until the instances are reasonably
small, and then apply an algorithm that is fast on small instances. For example, with mergesort, it might be best to
divide lists until there are only four elements, and then sort these small lists quickly by insertion sort.

8-1
Lecture 8 8-2

Maximum/minimum
Suppose we wish to find the minimum and maximum items in a list of numbers. How many comparisons does
it take?

A natural approach is to try a divide and conquer algorithm. Split the list into two sublists of equal size. (Assume
that the initial list size is a power of two.) Find the maxima and minima of the sublists. Two more comparisons then
suffice to find the maximum and minimum of the list.

Hence, if T (n) is the number of comparisons, then T (n) = 2T (n/2) + 2. (The 2T (n/2) term comes from
conquering the two problems into which we divide the original; the 2 term comes from combining these solutions.)
Also, clearly T (2) = 1. By induction we find T (n) = (3n/2) − 2, for n a power of 2.
Lecture 8 8-3

Integer Multiplication
The standard multiplication algorithm takes time Θ(n 2 ) to multiply together two n digit numbers. This algo-
rithm is so natural that we may think that no algorithm could be better. Here, we will show that better algorithms
exist (at least in terms of asymptotic behavior).

Imagine splitting each number x and y into two parts: x = 10 n/2 a + b, y = 10n/2 c + d. Then

xy = 10n ac + 10n/2 (ad + bc) + bd.

The additions and the multiplications by powers of 10 (which are just shifts!) can all be done in linear time. We
have therefore reduced our multiplication problem into four smaller multiplications problems, so the recurrence for
the time T (n) to multiply two n-digit numbers becomes

T (n) = 4T (n/2) + O(n).

The 4T (n/2) term arises from conquering the smaller problems; the O(n) is the time to combine these problems into
the final solution (using additions and shifts). Unfortunately, when we solve this recurrence, the running time is still
Θ(n2 ), so it seems that we have not gained anything.
Lecture 8 8-4

The key thing to notice here is that four multiplications is too many. Can we somehow reduce it to three? It
may not look like it is possible, but it is using a simple trick. The trick is that we do not need to compute ad and bc
separately; we only need their sum ad + bc. Now note that

(a + b)(c + d) = (ad + bc) + (ac + bd).

So if we calculate ac , bd, and (a + b)(c + d), we can compute ad + bc by the subtracting the first two terms from
the third! Of course, we have to do a bit more addition, but since the bottleneck to speeding up this multiplication
algorithm is the number of smaller multiplications required, that does not matter. The recurrence for T (n) is now

T (n) = 3T (n/2) + O(n),

and we find that T (n) = nlog2 3 ≈ n1.59 , improving on the quadratic algorithm.

If one were to implement this algorithm, it would probably be best not to divide the numbers down to one
digit. The conventional algorithm, because it uses fewer additions, is probably more efficient for small values of
n. Moreover, on a computer, there would be no reason to continue dividing once the length n is so small that the
multiplication can be done in one standard machine multiplication operation!

It also turns out that using a more complicated algorithm (based on a similar idea) the asymptotic time for
multiplication can be made arbitrarily close to linear– that is, for any ε > 0 there is an algorithm that runs in time
O(n1+ε ).
Lecture 8 8-5

Strassen’s algorithm
Divide and conquer algorithms can similarly improve the speed of matrix multiplication. Recall that when
multiplying two matrices, A = ai j and B = b jk , the resulting matrix C = cik is given by

cik = ∑ ai j b jk .
j

In the case of multiplying together two n by n matrices, this gives us an Θ(n 3 ) algorithm; computing each cik takes
Θ(n) time, and there are n2 entries to compute.

Let us again try to divide up the problem. We can break each matrix into four submatrices, each of size n/2 by
n/2. Multiplying the original matrices can be broken down into eight multiplications of the submatrices, with some
additions.

    
A B E F AE + BG AF + BH
  = 
C D G H CE + DG CF + DH

Letting T (n) be the time to multiply together two n by n matrices by this algorithm, we have T (n) = 8T (n/2) +
Θ(n2 ). Unfortunately, this does not improve the running time; it is still Θ(n 3 ).
Lecture 8 8-6

As in the case of multiplying integers, we have to be a little tricky to speed up matrix multiplication. (Strassen
deserves a great deal of credit for coming up with this trick!) We compute the following seven products:

• P1 = A(F − H)

• P2 = (A + B)H

• P3 = (C + D)E

• P4 = D(G − E)

• P5 = (A + D)(E + H)

• P6 = (B − D)(G + H)

• P7 = (A −C)(E + F)

Then we can find the appropriate terms of the product by addition:

• AE + BG = P5 + P4 − P2 + P6

• AF + BH = P1 + P2

• CE + DG = P3 + P4

• CF + DH = P5 + P1 − P3 − P7

Now we have T (n) = 7T (n/2) + Θ(n2 ), which give a running time of T (n) = Θ(n log 7 ).

Faster algorithms requiring more complex splits exist; however, they are generally too slow to be useful in
practice. Strassen’s algorithm, however, can improve the standard matrix multiplication algorithm for reasonably
sized matrices, as we will see in our second programming assignment.
CS124 Lecture 9 Spring 2000

9.1 The String reconstruction problem

The greedy approach doesn’t always work, as we have seen. It lacks flexibility; if at some point, it makes a wrong
choice, it becomes stuck.

For example, consider the problem of string reconstruction. Suppose that all the blank spaces and punctuation
marks inadvertantly have been removed from a text file. You would like to reconstruct the file, using a dictionary.
(We will assume that all words in the file are standard English.)

For example, the string might begin “thesearethereasons”. A greedy algorithm would spot that the first two
words were “the” and “sea”, but then it would run into trouble. We could backtrack; we have found that sea is
a mistake, so looking more closely, we might find the first three words “the”,“sear”, and “ether”. Again there is
trouble. In general, we might end up spending exponential time traveling down false trails. (In practice, since
English text strings are so well behaved, we might be able to make this work– but probably not in other contexts,
such as reconstructing DNA sequences!)

9-1
Lecture 9 9-2

This problem has a nice structure, however, that we can take advantage of. The problem can be broken down
into entirely similar subproblems. For example, we can ask whether the strings “theseare” and “thereasons” both
can be reconstructed with a dictionary. If they can, then we can glue the reconstructions together. Notice, however,
that this is not a good problem for divide and conquer. The reason is that we do not know where the right dividing
point is. In the worst case, we could have to try every possible break! The recurrence would be
n−1
T (n) = ∑ T (i) + T (n − i).
i=1

You can check that the solution to this recurrence grows exponentially.

Although divide and conquer directly fails, we still want to make use of the subproblems. The attack we now
develop is called dynamic programming. The way to understand dynamic programming is to see that divide and
conquer fails because we might recalculate the same thing over and over again. (Much like we saw very early on
with the Fibonacci numbers!) If we try divide and conquer, we will repeatedly solve the same subproblems (the case
of small substrings) over and over again. The key will be to avoid the recalculations. To avoid recalculations, we use
a lookup table.
Lecture 9 9-3

In order for this approach to be effective, we have to think of subproblems as being ordered by size. We solve
the subproblems bottom-up, from the smallest to the largest, until we reach the original problem.

For this dictionary problem, think of the string as being an array s[1 . . . n]. Then there is a natural subprob-
lem for each substring s[i . . . j]. Consider a two dimensional array D(i, j) that will denote whether s[i . . . j] is the
concatenation of words from the dictionary. The size of a subproblem is naturally d = j − i.

So now we write a simple loops which solves the subprobelms in order of increasing size:

for d := 1 to n − 1 do
for i := 1 to n − d do
j := i + d;
if indict(s[i . . . j]) then D(i, j) := true else
for k := i + 1 to j − 1 do
if D(i, k) and D(k, j) then D(i, j) := true;

This algorithm runs in time O(n3 ); the three loops each run over at most n values. Pictorially, we can think of the
algorithm as filling in the upper diagonal triangle of a two-dimensional array, starting along the main diagonal and
moving up, diagonal by diagonal.

We need to add a bit to actually find the words. Let F(i, j) be the position of end of the first word in s[i . . . j]
when this string is a proper concatenation of dictionary words. Initially all F(i, j) should be set to nil. The value for
F(i, j) can be set whenever D(i, j) is set to true. Given the F(i, j), we can reconstruct the words simply by finding
the words that make up the string in order. Note also that we can use this to improve the running time; as soon as we
find a match for the entire string, we can exit the loop and return success! Further optimizations are possible.

Let us highlight the aspects of the dynamic programming approach we used. First, we used a recursive descrip-
tion based on subproblems: D(i, j) is true if D(i, k) and D(k, j) for some k. Second, we built up a table containing
the answers of the problems, in some natural bottom-up order. Third, we used this table to find a way to determine
the actual solution. Dynamic programming generally involves these three steps.
Lecture 9 9-4

9.2 Edit distance

A problem that arises in biology is to measure the distance between two strings (of DNA). We will examine the
problem in English; the ideas are the same. There are many possible meanings for the distance between two strings;
here we focus on one natural measure, the edit distance. The edit distance measures the number of editing operations
it would be necessary to perform to transform the first string into the second. The possible operations are as follows:

• Insert: Insert a character into the first string.

• Delete: Delete a character from the first string.

• Replace: Replace a character from the first string with another character.

Another possibility is to not edit a character, when there is a Match. For example, a transformation from activate
to caveat can be represented by

D M R D M I M M D
a c t i v a t e
c a v e a t

The top line represents the operation performed. So the a in activate id deleted, and the t is replaced. The e in
caveat is explicitly inserted.

The edit distance is the minimal number of edit operations – that is, the number of Inserts, Deletes, or Replaces
– necesary to transform one string to the other. Note that Matches do not count. Also, it is possible to have a
weighted edit distance, if the different edit operations have different costs. We currently assume all operations have
weight 1.
Lecture 9 9-5

We will show how compute the edit distance using dynamic programming. Our first step is to define appropriate
subproblems. Let us reprsent our strings by A[1 . . . n] and B[1 . . . m]. Suppose we want to consider what we do with
the last character of A. To determine that, we need to know how we might have transformed the first n − 1 characters
of A. These n−1 characters might have transformed into any number of symbols of B, up to m. Similarly, to compute
how we might have transformed the first n − 1 characters of A into some part of B, it makes sense to consider how
we transformed the first n − 2 characters, and so on.

This suggests the following submproblems: we will let D(i, j) represent the edit distance between A[1 . . . i] and
B[1 . . . j]. We now need a recursive description of the subproblems in order to use dynamic programming. Here the
recurrence is:
D(i, j) = min[D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + I(i 6= j)].

In the above, I(i 6= j) represents the value 1 if i 6= j and 0 if i = j. We obtain the above expression by considering the
possible edit operations available. Suppose our last operation is a Delete, so that we deleted the ith character of A to
transform A[1 . . . i] to B[1 . . . j]. Then we must have transformed A[1 . . . i − 1] to B[1 . . . j], and hence the edit distance
would be D(i − 1, j) + 1, or the cost of the transformation from A[1 . . . i − 1] to B[1 . . . j] plus one for the cost of the
final Delete. Similarly, if the last operation is an Insert, the cost would be D(i, j − 1) + 1.

The other possibility is that the last operation is a Replace of the ith character of A with the jth character of B,
or a Match between these two characters. If there is a Match, then the two characters must be the same, and the cost
is D(i − 1, j − 1). If there is a Replace, then the two characters should be different, and the cost is D(i − 1, j − 1) + 1.
We combine these two cases in our formula, using D(i − 1, j − 1) + I(i 6= j).

Our recurrence takes the minimum of all these possibilities, expressing the fact that we want the best possible
choice for the final operation!
Lecture 9 9-6

It is worth noticing that our recursive description does not work when i or j is 0. However, these cases are
trivial. We have
D(i, 0) = i,

since the only way to transform the first i characters of A into nothing is to delete them all. Similarly,

D(0, j) = j.

Again, it is helpful to think of the computation of the D(i, j) as filling up a two-dimensional array. Here, we
begin with the first column and first row filled. We can then fill up the rest of the array in various ways: row by row,
column by column, or diagonal by diagonal!

Besides computing the distance, we may want to compute the actual transformation. To do this, when we fill
the array, we may also picture filling the array with pointers. For example, if the minimal distance for D(i, j) was
obtained by a final Delete operation, then the cell (i, j) in the table should have a pointer to (i − 1, j). Note that a
cell can have multiple pointers, if the minimum distance could have been achieved in multiple ways. Now any path
back from (n, m) to (0, 0) corresponds to a sequence of operations that yields the minimum distance D(n, m), so the
transformation can be found by following pointers.

The total computation time and space required for this algorithm is O(nm).
Lecture 9 9-7

9.3 All pairs shortest paths

Let G be a graph with positive edge weights. We want to calculate the shortest paths between every pair of nodes.
One way to do this is to run Dijkstra’s algorithm several times, once for each node. Here we develop a different
dynamic programming solution.

Our subproblems will be shortest paths using only nodes 1 . . . k as intermediate nodes. Of course when k equals
the number of nodes in the graph, n, we will have solved the original problem.

We let the matrix Dk [i. j] represent the length of the shortest path between i and j using intermediate nodes 1 . . . k.
Initially, we set a matrix D0 with the direct distances between nodes, given by d i j . Then Dk is easily computed from
the subproblems Dk−1 as follows:

Dk [i, j] = min(Dk−1 [i, j], Dk−1 [i, k] + Dk−1 [k, j]).

The idea is the shortest path using intermediate nodes 1 . . . k either completely avoids node k, in which case it
has the same length as Dk−1 [i, j]; or it goes through k, in which case we can glue together the shortest paths found
from i to k and k to j using only intermediate nodes 1 . . . k − 1 to find it.

It might seem that we need at least two matrices to code this, but in fact it can all be done in one loop. (Exercise:
think about it!)

D = (di j ), distance array, with weights from all i to all j


for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
D[i, j] = min(D[i, j], D[i.k] + D[k, j])

Note that again we can keep an auxiliary array to recall the actual paths. We simply keep track of the last
intermediate node found on the path from i to j. We reconstruct the path by succesively reconstructing intermediate
nodes, until we reach the ends.
Lecture 9 9-8

9.4 Traveling salesman problem

Suppose that you are given n cities and the distances d i j between them. The traveling salesman problem (TSP) is to
find the shortest tour that takes you from your home city to all the other cities and back again. As there are (n − 1)!
possible paths, this can clearly be done in O(n!) time by trying all possible paths. Of course this is not very efficient.

Since the TSP is NP-complete, we cannot really hope to find a polynomial time algorithm. But dynamic
programming gives us a much better algorithm than trying all the paths.

The key is to define the appropriate subproblem. Suppose that we label our home city by the symbol 1, and
other cities are labeled 2, . . . , n. In this case, we use the following: for a subset S of vertices including 1 and at least
one other city, let C(S, j) be the shortest path that start at 1, visits all other nodes in S, and ends at j. Note that our
subproblems here look slightly different: instead of finding tours, we are simply finding paths. The important point
is that the shortest path from i to j through all the vertices in S consists of some shortest path from i to a vertex x,
where x ∈ S − { j}, and the additional edge from x to j.

for all j do C({i, j}, j) := d1 j


for s = 3 to n do % s is the size of the subset
for all subsets S of {1, . . . , n} of size s containing 1 do
for all j ∈ S, j 6= 1 do
C(S, j) := mini6= j,i∈S [C(S − { j}, i) + di j ]
opt := min j6=i C({1, . . . , n}, j) + d j1

The idea is to build up paths one node at a time, not worrying (at least temporarily) where they will end up.
Once we have paths that go through all the vertices, it is easy to check the tours, since they consists of a shortest path
through all the vertices plus an additional edge. The algorithm takes time O(n 2 2n ), as there are O(n2n ) entries in the
table (one for each pair of set and city), and each takes O(n) time to fill. Of course we can add in structures so that
we can actually find the tour as well. Exercise: Consider how memory-efficient you can make this algorithm.
CS124 Lecture 10 Spring 1999

10.1 The Birthday Paradox

How many people do there need to be in a room before with probability greater than 1/2 some two of them have the
same birthday? (Assume birthdays are distributed uniformly at random.)
Surprisingly, only 23. This is easily determined as follows: the probability the first two people have different
birthdays is (1 − 1/365). The probability that the third person in the room then has a birthday different from the first
two, given the first two people have different birthdays, is (1 − 2/365), and so on. So the probability that all of the
first k people have different birthdays is the product of these terms, or

1 2 3 k−1
(1 − ) · (1 − ) · (1 − ) . . . · (1 − ).
365 365 365 365

Determining the right value of k is now a simple exercise.

10-1
Lecture 10 10-2

10.2 Balls into Bins

Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated in
terms of balls and bins. Some number of balls n are thrown into some number of bins m. What does the distribution
of balls and bins look like?
The birthday paradox is focused on the first time a ball lands in a bin with another ball. One might also ask how
many of the bins are empty, how many balls are in the most full bin, and other sorts of questions.
Let us consider the question of how many bins are empty. Look at the first bin. For it to be empty, it has to
be missed by all n balls. Since each ball hits the first bin with probability 1/m, the probability the first bin remains
empty is
1
(1 − )n ≈ e−n/m .
m
Since the same argument holds for all bins, on average a fraction e −n/m of the bins will remain empty.
Exercise: Howmany bins have 1 ball? 2?
Lecture 10 10-3

10.3 Hash functions

A hash function is a deterministic mapping from one set into another that appears random. For example, mapping
people into their birthdays can be thought of as a hash function.
In general, a hash function is a mapping f : {0, . . . , n − 1} → {0, . . . , m − 1}. Generally n >> m; for example,
the number of people in the world in much bigger than the number of possible birthdays. There is a great deal of
theory behind designing hash functions that “appear random.” We will not go into that theory here, and instead
assume that the hash functions we have available are in fact completely random. In other words, we assume that for
each i (0 ≤ i ≤ n − 1), the probability that f (i) = j is 1/m (for (0 ≤ j ≤ m − 1). Notice that this does mean that every
time we look at f (i), we get a different random answer! The value of f (i) is fixed for all time; it is just equally likely
to take on any value in the range.
While such completely random hash functions are unavailable in practice, they generally provide a good rough
idea of how hashing schemes perform.
(An aside: in reality, birthdays are not completely random either. Seasonal distributions skew the calculation.
How might this affect the birthday paradox?)
Lecture 10 10-4

10.4 Applications: A Password-checker

We now consider a hashing application. Suppose you are adminstering a computer system, and you would like to
make sure that nobody uses a common password. This protects against hackers, who can often determine if someone
is using a common password (such as a name, or a common dictionary word) by gaining access to the encrypted
password file and using an exhaustive search. When the user attempts to change their password, you would like to
check their password against a dictionary of common passwords as quickly as possible.
One way to do this would be to use a standard search technique, such as binary search, on the string. This
approach has two negative features. First, one must store the entire dictionary, which takes memory. Second, on
a large dictionary, this approach might be slow. Instead we present a quick and space-efficient scheme based on
hashing. The data structure we consider is commonly called a Bloom filter, after the originator.
Choose a table size m. Create a table consisting of m bits, initially all set to 0. Use a hash function on each of
the n words in the dictionary, where the range of the hash function is [0, m). If the word hashes to value k, set the kth
bit of the table to 1.
When a user attempts to change the password, hash the user’s desired password and check the appropriate
entry in the table. If there is a 1 there, reject the password; it could be a common one. Otherwise, accept it. A
common password from the dictionary is always rejected. Assuming other strings are hashed to a random location,
the probability of rejecting a password that should be accepted is 1 − e −n/m .
It would seem one would need to choose m to be fairly large in order to make the probability of rejecting a
potentially good password small. Space can be used more efficiently by making multiple tables, using a different
hash function to set the bits for each table. To check a proposed password now requires more time, since several
hash functions must be checked. However, as soon as a single 0 entry is found, the password can be accepted. The
probability of rejecting a password that should be accepted when using h tables, each of size m, is then
 h
1 − e−n/m .

The total space used is merely hm bits. Notice that the Bloom filter sometimes returns the wrong answer – we may
reject a proposed password, even though it is not a common password. This sort of error is probably acceptable, as
long as it doesn’t happen so frequently as to bother users. Fortunately this error is one-sided; a common password is
never accepted. One must set the parameters m and h appropriately to trade off this error probability against space
and time requirements.
For example, consider a dictionary of 100,000 common passwords, each of which is on average 7 characters
long. Uncompressed this would be 700,000 bytes. Compression might reduce it substantially, to around 300,000
bytes. Of course, then one has the problems of searching efficiently on a compressed list.
Instead, one could keep a 100,000 byte Bloom filter, consisting of 5 tables of 160,000 bits. The probability of
rejecting a reasonable password is just over 2%. The cost for checking a password is at most 5 hashes and 5 lookups
into the table.
CS 124 Lecture 11

11.1 Applications: Fingerprinting for pattern matching

Suppose we are trying to find a pattern string P in a long document D. How can we do it quickly and efficiently?
Hash the pattern P into say a 16 bit value. Now, run through the file, hashing each set of |P| consecutive
characters into a 16 bit value. If we ever get a match for a pattern, we can check to see if it corresponds to an actual
pattern match. (In this case, we want to double-check and not report any false matches!) Otherwise we can just move
on. We can use more than 16 bits, too; we would like to use enough bits so that we will obtain few false matches.
This scheme is efficient, as long as hashing is efficient. Of course hashing can be a very expensive operation, so
in order for this approach to work, we need to be able to hash quickly on average. In fact, a simple hashing technique
allows us to do so in constant time per operation!
The easiest way to picture the process is to think of the file as a sequence of digits, and the pattern as a number.
Then we move a pointer in the file one character at a time, seeing if the next |P| digits gives us a number equal to
the number corresponding to the pattern. Each time we read a character in the file, the number we are looking at
changes is a natural way: the leftmost digit a is removed, and a new rightmost digit b is inserted. Hence, we update
an old number N and obtain a new number N 0 by computing

N 0 = 10 · (N − 10|P|−1 · a) + b.

When dealing with a string, we will be reading characters (bytes) instead of numbers. Also, we will not want
to keep the whole pattern as a number. If the pattern is large, then the corresponding number may be too large
to do effective comparisons! Instead, we hash all numbers down into say 16 bits, by reducing them modulo some
appropriate prime p. We then do all the mathematics (multiplication, addition) modulo p, i.e.

N 0 = [10 · (N − 10|P|−1 · a) + b] mod p.

All operations mod p can be made quite efficient, so each new hash value takes only constant time to compute!
This pattern matching technique is often called fingerprinting. The idea is that the hash of the pattern creates
an almost unique identifier for the pattern– like a fingerprint. If we ever find two fingerprints that match, we have a
good reason to expect that they must come the same pattern. Of course, unlike real fingerprints, our hashing-based
fingerprints do not actually uniquely identify a pattern, so we still need to check for false matches. But since false
matches should be rare, the algorithm is very efficient!
See Figure 11.1 for an example of fingerprinting.

11-1
Lecture 11 11-2

P = 17935
p = 251
P mod p = 114

6386179357342...

63861 mod p = 107


38617 mod p = 214
86179 mod p = 86
61793 mod p = 47
17935 mod p = 114
79357 mod p = 41
93573 mod p = 201
35734 mod p = 92
57342 mod p = 114

Figure 11.1: A fingerprinting example. The pattern P is a 5 digit number. Note successive calculations take constant
time: 38617 mod p = ( (63861 mod p) - (60000 mod p)) · 10 + 7 mod p. Also note that false matches are possible
(but unlikely); 57432 = 17935 mod p.

One question remains. How should we choose the prime p? We would like the prime we choose to work well,
in that it should have few false matches. The problem is that for every prime, there are certainly some bad patterns
and documents. If we choose a prime in advance, then someone can try to set up a document and pattern that will
cause a lot of false matches, making our fingerprinting algorithm go very slowly.
A natural approach is to choose the prime p randomly. This way, nobody can set up a bad pattern and document
in advance, since they are not sure what prime we will choose.
Let us make this a bit more rigorous. Let π(x) represent the number of primes that are less than or equal to x. It
will be helpful to use the following fact:
Fact: x
ln x ≤ π(x) ≤ 1.26 lnxx .
Consider any point in the algorithm, where the pattern and document do not match. If our pattern has length
|P|, then at that point we are comparing two numbers that are each less than 10 |P| . In particular, their difference (in
absolute value) is less than 10|P| . What is the probability that a random prime divides this difference? That is, what
is the probability that for the random prime we choose, the two numbers corresponding to the pattern and the current
|P| digits in the document are equal modulo p.
First, note that there are at most log 2 10|P| distinct primes that divide the difference, since the difference is at
most 10|P| (in absoulte value), and each distinct prime divisor is at least 2. Hence, if we choose our prime randomly
Lecture 11 11-3

from all primes up to Z, the probability we have a false match is at most

log2 10|P|
π(Z).

Now the probability that we have a false match anywhere is at most |D| times the probability that we have a false
match in any single location, by the union bound. Hence the probability that we have a false match anywhere is at
most
|D| log2 10|P|
π(Z).

Exercise: How big should we make Z in order to make the probability of a false match anywhere in the
algorithm less than 1/100?
Lecture 11 11-4

How could we improve the probability of a false match? One way is to choose from a larger set of primes.
Another way is to choose not just one random prime, but several random primes from Z. This is like choosing
several hash functions in the Bloom filter problem. There is a false match only if there is a false match at every
random prime we choose. If we choose k primes (with replacement) from the primes up to Z, the probability of a
false match at a specific point is at most
!k
log2 10|P|
.
π(Z)
CS124 Lecture 12

12.1 Near duplicate documents1

Suppose we are designing a major search engine. We would like to avoid answering user queries with multiple
copies of the same page. That is, there may be several pages with exactly the same text. These duplicates occur
for a variety of reasons. Some are mirror sites, some are copies of common pages (such as Unix man pages), some
are multiple spam advertisements, etc. Returning just one of the duplicates should be sufficient for the end user;
returning all of them will clutter the response page, wasting valuable real estate and frustraing the user. How can we
cope with duplicate pages?
Determining exact duplicates has a simple solution, based on hashing. Use the text of each page and an ap-
propriate hash function to hash the text into a 64 bit signature. If two documents have the same signature, it is
reasonable to assume that they share the same text. (Why? How often is this assumption wrong? Is it a terrible thing
if the assumption turns out to be false?) By comparing signatures on the fly, we can avoid returning duplicates.
This solution works extremely well if we want to catch exact duplicates. What if, however, we want to capture
the idea of “near duplicate” documents, or similar documents. For example, consider two mirror sites on the Web.
It may be that the documents share the same text, except that the text corresponding to the links on the page are
different, with each referring to the correct mirror site. In this case, the two pages will not yield the same signature,
although again, we would not want to return both pages to the end user, because they are so similar. As another
example, consider two copies of a newspaper article, one with a proper copyright notice added, and one without. We
do not need to return both pages to the user. Again, hashing the document appears to be of no help. Finally, consider
the case of advertisers who submit slightly modified versions of their ads over and over again, trying to get more or
better spots on the response pages sent back to users. We want to stop their nefarious plans!
We will describe a scheme used to detect similar documents efficiently, using a hashing based scheme. Like the
Bloom filter solution for password dictionaries, our solution is highly efficient in terms of space and time. The cost
for this efficienty is accuracy; our algorithm will sometimes make mistakes, because it uses randomness.

12.2 Set resemblance

We describe a more general problem that will relate to our document similarity problem.
Consider two sets of numbers, A and B. For concreteness, we will assume that A and B are subsets of 64 bit
numbers. We may define the resemblance of A and B as

|A ∩ B|
resemblance(A, B) = R(A, B) = .
|A ∪ B|

The resemblance is a real number between 0 and 1. Intuitively, the resemblance accurately captures how close
the two sets are. Sets and documents will be related, as we will see later.
1 This
lecture is based on the work of Andrei Broder, who developed these ideas, and convinced Altavista to use them! (The second feat
may have been even more difficult than the first.)

12-1
Lecture 12 12-2

How quickly can we determine the resemblance of two sets? If the sets are each of size n, the natural approach
(compare each element to in A to each element in B) is O(n2 ). We can do better by sorting the sets. Still, these
approaches are all rather slow, when we consider that we will have many sets to deal with and hence many pairs of
sets to consider.
Instead we should ocnsider relaxing the problem. Suppose that we do not need an exact calculation of the
resemblance R(A, B). A reasonable estimate or approximation of the resemblance will suffice. Also, since we will
be answering a variety of queries over a long period of time, it makes sense to consider algorithms that first do
a preprocessing phase, in order to handle the queries much more quickly. That is, we will first do some work,
preparing the appropriate data structures and data in a preprocessing phase. The advantage of doing all this work in
advance will be that queries regarding resemblance can then be quickly answered.
Our estimation process will require a black box that does the following: it produces an effective random per-
mutation on the set of 64 bit numbers. What do we mean by a random permutation? Let us consider just the case of
four bit number, of which there are 16. Suppose we write each number on a card. Generating a random permutation
is like shuffling this deck of 16 cards and looking at the order at which the numbers appear after ths shuffling. For
example, if we find the number 0011 on the first card, then our permutation maps the number 3 to the number 1. We
write this as π(3) = 1, where π is a function that represents the permutation.
Suppose we have an efficient implemenation of random permutations, which we think of as a black box proce-
dure. That is, when we invoke the black box procedure BB(1, x) on a 64 bit number x, we get out y = π1 (x) for some
fixed, completely random permutation π1 . Similarly, if we invoke the black box BB(2, x), we get out π2 (x) for some
different random permutation π2 . (In fact in practice we cannot achieve this black box, but we can get close enough
that it is useful to think in these terms for analysis.)
Let us use the notation π1 (A) to denote the set of elements obtained by computing BB(1, x) for every x in A.
Consider the following procedure: we compute the set π1 (A) and π1 (B), and record the minimum of each set. When
does min{π1 (A)} = min{π1 (B)}? This happens only when there is some element x satisfying π1 (x) = min{π1 (A)} =
min{π1 (B)}. In other words, the element x that is the minimum element in the set A ∪ B has to be the intersection of
the sets A ∩ B.
If π1 is a random permutation, then every element in A ∪ B has equal probability of mapping to the minimum
element after the permutation is applies. That is, for all x and y in A ∪ B,

Pr[π1 (x) = min{π1 (A ∪ B)}] = Pr[π1 (y) = min{π1 (A ∪ B)}].

Thus, for the minimum of π1 (A) and π1 (B) to be the same, the minimum element must lie in π1 (A ∩ B) (see Fig-
ure 12.1). Hence
|A ∩ B|
Pr[min{π1 (A)} = min{π1 (B)}] = .
|A ∪ B|
But this is just the resemblance R(A, B)!
This gives us a way to estimate the resemblance. Instead of taking just one permutation, we take many– say
100. For each set A, we preprocess by computing min{πj (A)} for j = 1 to 100, and store these values. To estimate
the resemblance of two sets A and B, we count how often the minima are the same, and divide by 100. It is like each
permutation gives us a coin flip, where the probability of a heads (a match) is exactly the resemblance R(A, B) of the
two sets.
Lecture 12 12-3

A B

AIB
Figure 12.1: If the minimum element of π1 (A) and π1 (B) are the same, the minimum element must lie in π1 (A ∩ B).

Four score and seven years ago, our founding


Four score and seven
score and seven years
and seven years ago
seven years ago our
years ago our founding

Figure 12.2: Shingling: the document is broken up into all segments of k consecutive words; each segment leads to
a 64 bit hash value.

12.3 Turning Document Similarity into a Set Resemblance Problem

We now return to the original application. How do we turn document similarity into a set resemblance problem? The
key idea is to hash pieces of the document– say every four consecutive words– into 64 bit numbers. This process has
been called shingling, and each set of consecutive words is called a shingle. (See Figure 12.2.) Using hashing, the
shingles give rise to the resulting numbers for the set resemblance problem, so that for each document D there is a
set SD . There are many possible variations and improvements possible. For example, one could modify the number
of bits in a shingle or the method for shingling. Similarly, one could throw out all shingles that are not 0 mod 16,
say, in order to reduce the number of shingles per document.
This approach obscures some important information in the document– such as the order paragraphs appear
in, say. However, it seems reasonable to say that if the resulting sets have high resemblance, the documents are
reasonably similar.
Once we have the shingles for the document, we associate a document sketch with each document. The sketch
of a document SD is a list of say 100 numbers: (min{π1 (SD )}, min{π2 (SD )}, min{π3 (SD )}, . . . , min{π100 (SD )}).
Now we choose a threshold– for example, we might say that two documents are the similar if 90 out of the 100
entries in the sketch match. Now whenever a user queries the search engine, we check the sketches of the documents
we wish to return. If two sketches share 90 entries, we only send one of them. (Alternatively, we could catch the
duplicates on the crawling side– we check all the documents as we crawl the Web, and whenever two sketches share
more than 90 entries, we assume the associated documents are similar, so that we only need to store one of them!)
Recall that our scheme uses random permutations. So, if we set our sketch threshold to 90 out of 100 entries,
Lecture 12 12-4

this does not guarantee that any pair of documents with high resemblance are caught. Also, some pairs of documents
that do not have high resemblance may get marked as having high resemblance. How well does this scheme do?
We analyze how well the scheme does with the following argument. For each permutation πi , the probability
that two documents A and B have the same value in the ith position of the sketch is just the resemblance of the two
documents R(A, B) = r. (Here the resemblance R(A, B) of course refers to the resemblance of the sets of numbers
obtained by shingling A and B.) Hence, the probability p(r) that at least 90 out of the 100 entries in the sketch match
is
100  
100 k
p(r) = ∑ r (1 − r)100−k .
k=90 k

What does p(r) look like as a function of r? The graph is shown in Figure 12.3. Notice that p(r) stays very
small until r approaches 0.9, and then quickly grows towards 1. This is exactly the property we want our scheme to
have– if two documents are not similar, we will rarely mistake them for being similar, and if they are similar, we are
likely to catch them!
For example, even if the resemablance is 0.8, we will only get 90 matches with probability less than 0.006!
When the resemblance is only 0.5, the probability of having 90 entries in the sketch match falls to almost 10−18 ! If
documents are not alike, we will rarely mistake them as being similar.
If documents are alike, we will most likely catch them. If the resemblance is 0.95, the documents will have 90
or more entries in common in the sketch with probability greater than .988; if the resemblance is 0.96, the probability
jumps to over .997.
We are dealing with a very large number of dcouments– most search engines currently index twenty-five to over
one hundred million Web pages! So even though the probability of making a mistake is small, it will happen. The
worst that happens, though, is that the search engine fails to index a few pages that it should, and it fails to catch a
few duplicates that it should. These problems are not a big deal.
Lecture 12 12-5

1
0.9
Probability of 90 or more matches

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Resemblance

Figure 12.3: Making the threshold for document similarity 90 out of 100 matches in the sketch leads to the following
graph relating resemblance to the probability two documents are considered similar. Notice the sharp change in
behavior near where the resemablance is 0.90. Essentially, the procedure behaves like a low pass filter.
CS124 Lecture 13

Hopefully the ideas we saw in our hashing problems have convinced you that randomness is a useful tool in
the design and analysis of algorithms. Just to make sure, we will consider several more example of how to use
randomness to design algorithms.

13.1 Primality testing

A great deal of modern cryptography is based on the fact that factoring is apparently hard. At least nobody has
published a fast way to factor yet. (It is rumored the NSA knows how to factor, and is keeping it a secret. Some
of you might well have worked or will work for the NSA, at which point you will be required to keep this secret.
Shame on you.) Of course, certain numbers are easy to factor– numbers with small prime factors, for example. So
often, for cryptographic purposes, we may want to generate very large prime numbers and multiply them together.
How can we find large prime numbers?
We are fortunate to find that prime numbers are pretty dense. That is, there’s an awful lot of them. Let π(x) be
the number of primes less than or equal to x. Then
x
π(x) ≈ ,
ln x
or more exactly,
π(x)
lim x = 1.
x→∞
ln x
This means that on average about one out of every ln x numbers is prime, if we are looking for primes about the size
of x. So if we want to find prime numbers of say 250 digits, we would have to check about ln 10250 ≈ 576 numbers
on average before finding a prime. (We can search smarter, too, throwing out multiples of 2,3,5, etc. in order to
check fewer numbers.) Hence, all we need is a good method for testing if a number is prime. With such a test, we
can generate large primes easily– just keep generating random large numbers, and test them for primality until we
find a suitable prime number.
How can we test if a number n is prime? The pedantic way is to try dividing n by all smaller numbers.

Alternatively, we can try to divide n by all primes up to n. Of course, both of these approaches are quite slow;

when n is about 10250 , the value of n is still huge! The point is that 10250 has only 250 (or more generally O(log n))
digits, so we’d like the running time of the algorithm to be based on the size 250, not 10250 !
How can we quickly test if a number is prime? Let’s start by looking at some ways that work pretty well, but
have a few problems. We will use the following result from number theory:

Theorem 13.1 If p is a prime and 1 ≤ a < p, then


a p−1 = 1 mod p.

Proof: There are two nice proofs for this fact. One uses a simple induction to prove the equivalent statement
that a p = a mod p. This is clearly true when a = 1. Now
p  
p p−i
(a + 1) = ∑
p
a .
i=0 i

13-1
Lecture 13 13-2

 p
The coefficient i is divisible by p, unless i = 0 or i = p. Hence

(a + 1) p = a p + 1 mod p = a + 1 mod p,

where the last step follows by the induction hypothesis.


An alternative proof uses the following idea. Consider the numbers 1, 2, . . . , p − 1. Multiply them all by a, so
now we have a, 2a, . . . , (p − 1)a. Each of these number is distinct mod p, and there are p − 1 such numbers, so in
fact the sequence a, 2a, . . . , (p − 1)a is the same as the sequence 1, 2, . . . , p − 1 when considered modulo p, except
for the order. Hence

1 · 2 · . . . · (p − 1) = a · 2a · . . . · (p − 1)a mod p = ap−1 · 1 · 2 · . . . · (p − 1) mod p.

Thus we have ap−1 = 1 mod p.


This immediately suggests one way to check if a number n is prime. Compute 2n−1 mod n. If it is not 1, then
n is certainly not prime! Note that we can compute 2n−1 mod n quite efficiently, using our previously discussed
methods for exponentiation, which require only O(log n) multiplications! Thus this test is efficient.
But so far this test is just one-way; if n is composite, we may have that 2n−1 = 1 mod n, so we cannot assume
that n is prime just because it passes the test. For example, 2340 = 1 mod 341, and 341 is not prime. Such a number
is called a 2-psuedoprime, and unfortunately there are infinitely many of them. (Of course, even though there are
infinitely many 2-pseudoprimes, they are not as dense as the primes– that is, there are relatively very few of them.
So if we generate a large number n randomly, and see if 2n−1 = 1 mod n, we will most likely be right if we then say
n is prime if it passes this test. In practice, this might be good enough! This is not a good primality test, however, if
an NSA official you know gives you a number to test for primality, and you think they might be trying to fool you.
The NSA might be purposely giving you a 2-pseudoprime. They can be tricky that way.)
You might think to try a different base, other than 2. For example, you might choose 3, or a random value of
a. Unfortunately, there are infinitely many 3-pseudoprimes. In fact, there are infinitely many composite numbers n
such that an−1 = 1 mod n for all a that do not share a factor with n. (That is, for all a such that the greatest common
divisor of a and n is 1.) Such n are called Carmichael numbers– the smallest such number is 561. So a test based on
this approach is destined to fail for some numbers.
There is a way around this problem, due to Rabin. Let n − 1 = 2t u. Suppose we choose a random base a and
compute an−1 by first computing au and then repeatedly squaring. Along the way, we will check to see for the values
au , a2u , . . . whether they have the following property:
i−1 u i
a2 = ±1 mod n, a2 u = 1 mod n.

That is, suppose we find a non-trivial square root of 1 modulo n. It turns out that only composite numbers have
non-trivial square roots – prime numbers don’t. In fact, if we choose a randomly, and n is composite, for at least 3/4
of the values of a, one of two things will happen: we will either find a non-trivial square root of 1 using this process,
or we will find that an−1 = 1 mod n. In either case, we know that n is composite!
A value of a for which either an−1 = 1 mod n or the computation of an−1 yields a non-trivial square root is
called a witness to the compositeness of n. We have said that 3/4 of the possible values of a are witnesses (we will
not prove this here!). So if we pick a single value of a randomly, and n is composite, we will determine that n is
composite with probability at least 3/4. How can we improve the probability of catching when n is composite?
The simplest way is just to repeat the test several times, each time choosing a value of a randomly. (Note that
we do not even have to go to the trouble of making sure we try different values of a each time; we can choose values
with replacement!) Each time we try this we have a probability of at least 3/4 of catching that n is composite, so if
Lecture 13 13-3

we try the test k times, we will return the wrong answer in the case where n is composite with probability (1/4)k . For
k = 25, the probability of the algorithm itself making an error is thus (1/2)50 ; the probability that a random cosmic
ray affected your arithmetic unit is probably higher!
This trick comes up again and again with randomized algorithms. If the probability of catching an error on a
single trial is p, the probability of failing to catch an error after t trials is (1 − p)t , assuming each trial is independent.
By making t sufficiently large, the probability of error can be reduced. Since the probability shrinks exponentially
in t, few trials can produce a great deal of security in the answer.
CS 124 Lecture 14

14.1 Cryptography Fundamentals

Cryptography is concerned with the following scenario: two people, Alice and Bob, wish to communicate privately
in the presence of an eavesdropper, Eve. In particular, suppose Alice wants to send Bob a message x. (For conve-
nience, we will always assume our message has been converted into a bit string.) Using cryptography, Alice would
compute a function e(x), the encoding of x, using some secret key, and transmit e(x) to Bob. Bob receives e(x),
and using his own secret key, would compute a function d(e(x)) = x. The function d provides the decoding of the
encoding e(x). Eve is presumably unable to recover x from e(x) because she does not have the key – without the
key, computing x is either impossible or computationally difficult.

14.1.1 One-Time Pad

A classical cryptographic method is the one-time pad. A one-time pad is a random string of bits r, equal in length to
the message x, that Alice and Bob share and is secret. By random, here we mean that r is equally like to be any bit
string of the right length, |r|. Alice compute e(x) = x ⊕ r; Bob computes d(e(x)) = e(x) ⊕ r = x ⊕ r ⊕ r = x.

The claim is that Eve gets absolutely no information about the message by seeing e(x). More concretely, we
claim
Pr(message is x | e(x)) = Pr(message is x);

that is, knowing e(x) gives no more information to Eve than she already had. This is a nice exercise in condtional
probabilities.

Since e(x) provides no information, the one-time pad is completely secure. (Notice that this does not rely
on notions of computational difficulty; Eve really obtains no additional information!) There are, however, crucial
drawbacks.

• The key r has to be as long as x.

• The key r can only be used once. (To see this, suppose we use the same key r to encode x and y. The Eve can
compute e(x) ⊕ e(y) = x ⊕ y, which might yield useful information!)

14-1
Lecture 14 14-2

• The key r has to be exchanged, by some other means. (Private courier?)

14.1.2 DES

The Data Encrytpion Standard, or DES, is a U.S. government sponsored cryptographic method proposed in 1976. It
uses a 56 bit key, again shared by Alice and Bob, and it encodes blocks of 64 bits using a complicated sequence of
bit operations.

Many have suspected that the government engineered the DES standard, so that they could break it easily, but
nobody has shown a simpler method for breaking DES other than trying the 256 possible keys. These days, however,
trying even this large number of keys can be accomplished in just a few days with specialized hardware. Hence DES
is widely considered no longer secure.

14.1.3 RSA

RSA (named after its inventors, Ron Rivest, Adi Shamir, and Len Adleman) was developed around the same time as
DES. RSA is an example of public key cryptography. In public key cryptography, Bob has two keys: a public key,
ke , known to everyone, and a private key, kd , known only to Bob. If Alice (or anyone else) wants to send a message x
to Bob, she encrypts it as e(x) using the public key; Bob then decrypts it using his private key. For this to be secure,
the private key must be hard to compute from the public key, and similarly e(x) must be hard to compute from x.

The RSA algorithm depends on some number theory and simple algorithms, which we will consider before
describing RSA. We will then describe how RSA is efficient and secure.

14.2 Tools for RSA

14.2.1 Primality

For the time being, we will assume that it is possible to generate large prime numbers. In fact, there are simple and
efficient randomized algorithms for generating large primes, that we will consider later in the course.
Lecture 14 14-3

14.2.2 Euclid’s Greatest Common Divisor Algorithm

Definition: The greatest common divisor (or gcd) of integers a, b ≥ 0 is the largest integer d ≥ 0 such that d|a and
d|b, where d|a denotes that d divides a.

Example: gcd(360,84) = 12.

One way of computing the gcd is to factor the two numbers, and find the common prime factors (with the right
multiplicity). Factoring, however, is a problem for which we do not have general efficient algorithms.

The following algorithm, due to Euclid, avoids factoring. Assume a ≥ b ≥ 0.

function Euclid(a, b)
if b = 0 return(a)
return(Euclid(b, a mod b))
end Euclid

Euclid’s algorithm relies on the fact that gcd(a, b) = gcd(b, a mod b). You should prove this as an exercise.

We need to check that this algorithm is efficient. We will assume that mod operations are efficient (in fact they
can be done in O(log2 a) bit operations). How many mod operations must be performed?

To analyze this, we notice that in the recursive calls of Euclid’s algorithms, the numbers always get smaller.
For the algorithm to be efficient, we’d like to have only about O(log a) recursive calls. This will require the numbers
to shrink by a constant factor after a constant number of rounds. In fact, we can show that the larger number shrinks
by a factor of 2 every 2 rounds.

Claim 1: a mod b ≤ a/2.

Proof: The claim is trivially true if b ≤ a/2. If b > a/2, then a mod b = a − b ≤ a/2.

Claim 2: On calling Euclid(a, b), after the second recursive call Euclid(a , b ) has a ≤ a/2.

Proof: For the second recursive call, we will have a = a mod b.

14.2.3 Extended Euclid’s Algorithm

Euclid’s algorithm can be extended to give not just the greatest common divisor d = gcd(a, b), but also two integers
x and y such that ax + by = d. This will prove useful to us subsequently, as we will explain.
Lecture 14 14-4

Extended-Euclid(a, b)
if b = 0 return(a, 1, 0)
Compute k such that a = bk + (a mod b)
(d, x, y) = Extended-Euclid(b, a mod b)
return((d, y, x − ky))
end Extended-Euclid

Claim 3: The Extended Euclid’s algorithm returns the correct answer.

Proof: By induction on a + b. It clearly works if b = 0. (Note the understanding that all numbers divide
0!) If b = 0, then we may assume the recursive call provides the correct answer by induction, as a mod b < a.
Hence we have x and y such that bx + (a mod b)y = d. But (a mod b) = a − bk, and hence by substitution we get
bx + (a − bk)y = d, or ay + b(x − ky) = d. This shows the algorithm provides the correct output.

Note that the Extended Euclid’s algorithm is clearly efficient, as it requires only a few extra arithmetic opera-
tions per recursive call over Euclid’s algorithm.

The Extended Euclid’s algorithm is useful if we wish to compute the inverse of a number. That is, suppose we
wish to find a−1 mod n. The number a has a multiplicative inverse modulo n if and only if the gcd of a and n is 1.
Moreover, the Extended Euclid’s algorithm gives us that number. Since in this case computing gcd(a, n) gives x, y
such that ax + ny = 1, we have that x = a−1 mod n.

14.2.4 Exponentiation

Suppose we have to compute xy mod z, for integers x, y, z. Multiplying x by itself y times is one possibility, but
it is too slow. A more efficient approach is to repeatedly square from x, to get x2 mod z, x4 mod z, x8 mod z . . .,
log y

x2 mod z. Now xy can be computed by multiplying together modulo z the powers that correspond to ones in the
binary representation of y.

14.3 The RSA Protocol

To create a public key, Bob finds two large primes, p and q, of roughly the same size. (Large should be a few hundred
decimal digits. Recently, with a lot of work, 512-bit RSA has been broken; this corresponds to n = pq being 512
Lecture 14 14-5

bits long.) Bob computes n = pq, and also computes a random integer e, such that gcd((p − 1)(q − 1), e) = 1. (An
alternative to choosing e randomly often used in practice is to choose e = 3, in which case p and q cannot equal 1
modulo 3.)

The pair (n, e) is Bob’s public key, which he announces to the world. Bob’s private key is d = e−1 mod (p −
1)(q − 1), which can be computed by Euclid’s algorithm. More specifically, (p, q, d) is Bob’s private key.

Suppose Alice wants to send a message to Bob. We think of the message as being a number x from the range
[1, n]. (If the message is too big to be represented by a number this small, it must be broken up into pieces; for
example, the message could be broken into bit strings of length log n
.) To encode the message, Alice computes
and sends to Bob
e(x) = xe mod n.

Upon receipt, Bob computes


d(e(x)) = (e(x))d mod n.

To show that this operation decodes correctly, we must prove:

Claim 4: d(e(x)) = x.

Proof: We use the steps:


e(x)d = xde = x1+k(p−1)(q−1) = x mod n.

The first equation recalls the definition of e(x). The second uses the fact that d = e−1 mod (p − 1)(q − 1), and hence
de = 1+ k(p− 1)(q− 1) for some integer k. The last equality is much less trivial. It will help us to have the following
lemma:

Claim 5: (Fermat’s Little Theorem) If p is prime, then for a = 0 mod p, we have ap−1 = 1 mod p.

Proof: Look at the numbers 1, 2, . . . , p − 1. Suppose we multiply them all by a modulo p, to get a · 1 mod p, a ·
2 mod p, . . . , a · (p − 1) mod p. We claim that the two sets of numbers are the same! This is because every pair of
numbers in the second group is different; this follows since if a · i = a · j mod p, then by multiplying by a−1 , we
must have i = j mod p. But if all the numbers in the second group are different modulo p, since none of them are 0,
they must just be 1, 2, . . . , p − 1. (To get a feel for this, take an example: when p = 7 and a = 5, multiplying a by the
numbers {1, 2, 3, 4, 5, 6} yields {5, 3, 1, 6, 4, 2}.)

From the above equality of sets of numbers, we conclude

1 · 2 · · · (p − 1) = (a · 1) · (a · 2) · · · (a · (p − 1)) mod p.
Lecture 14 14-6

Multiplying both sides by 1−1 , 2−1 , . . . , (p − 1)−1 we have

1 = a p−1 mod p.

This proves Claim 5.

We now return to the end of Claim 4, where we must prove

x1+k(p−1)(q−1) = x mod n.

We first claim that x1+k(p−1)(q−1) = x mod p. This is clearly true if x = 0 mod p. If x = 0 mod p, then by Fermat’s
Little Theorem, x(p−1) = 1 mod p, and hence xk(p−1)(q−1) = 1 mod p, from which we have x1+k(p−1)(q−1) = x mod p.
by the same argument we also have x1+k(p−1)(q−1) = x mod q. But if a number is equal to x both modulo p and
modulo q, it is equal to x modulo n = p · q. Hence x1+k(p−1)(q−1) = x mod n, and Claim 4 is proven.

We have shown that the RSA protocol allows for correct encoding and decoding. We also should be convinced
it is efficient, since it requires only operations that we know to be efficient, such as Euclid’s algorithm and modular
exponentiation. One thing we have not yet asked is why the scheme is secure. That is, why can’t the eavesdropper
Eve recover the message x also?

The answer, unfortunately, is that there is no proof that Eve cannot compute x efficiently from e(x). There
is simply a belief that this is a hard problem. It is an unproven assumption that there is no efficient algorithm for
computing x from e(x). There is the real but unlikely possibility that someone out there can read all messages sent
using RSA!

Let us seek some idea of why RSA is believed to be secure. If Eve obtains e(x) = xe mod n, what can she do?
She could try all possible values of x to try to find the correct one; this clearly takes too long. Or she could try to
factor n and compute d. Factoring, however, is a widely known and well studied problem, and nobody has come up
with a polynomial time algorithm for the problem. In fact, it is widely believed that no such algorithm exists.

It would be nice if we could make some sort of guarantee. For example, suppose that breaking RSA allowed
us to factor n. Then we could say that RSA is as hard as factoring. Unfortunately, this is not the case either. It
is possible that RSA could be broken without providing a general factoring algorithm, although it seems that any
natural approach for breaking RSA would also provide a way to factor n.
CS124 Lecture 15

15.1 2SAT

We begin by showing yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical
expression that is the conjunction (AND) of a set of clauses, where each clause is the disjunction (OR) of two literals.
(A literal is either a Boolean variable or the negation of a Boolean variable.) For example, the following expression
is an instance of 2SAT:
(x1 ∨ x2 ) ∧ (x1 ∨ x3 ) ∧ (x1 ∨ x2 ) ∧ (x4 ∨ x3 ) ∧ (x4 ∨ x1 ).

A solution to an instance of a 2SAT formula is an assignment of the variables to the values T (true) and F
(false) so that all the clauses are satisfied– that is, there is at least one true literal in each clause. For example, the
assingment x1 = T, x2 = F, x3 = F, x4 = T satisfies the 2SAT formula above.
Here is a simple randomized solution to the 2SAT problem. Start with some truth assignment, say by setting all
the variables to false. Find some clause that is not yet satisfied. Randomly choose one the variables in that clause,
say by flipping a coin, and change its value. Continue this process, until either all clauses are satisfied or you get
tired of flipping coins.
In the example above, when we begin with all variables set to F, the clause (x1 ∨ x2 ) is not satisfied. So we
might randomly choose to set x1 to be T. In this case this would leave the clause (x4 ∨ x1 ) unsatisfied, so we would
have to flip a variable in the clause, and so on.
Why would this algorithm tend to lead to a solution? Let us suppose that there is a solution, call it S. Suppose
we keep track of the number of variables in our current assignment A that match S. Call this number k. We would
like to get to the point where k = n, the number of variables in the formula, for then A would match the solution S.
How does k evolve over time?
At each step, we choose a clause that is unsatisfied. Hence we know that A and S disagree on the value of at
least one of the variables in this clause– if they agreed, the clause would have to be satisfied! If they disagree on both,
then clearly changing either one of the values will increase k. If they disagree on the value one of the two variables,
then with probability 1/2 we choose that variable and make increase k by 1; with probability 1/2 we choose the other
variable and decrease k by 1.
Hence, in the worse case, k behaves like a random walk– it either goes up or down by 1, randomly. This leaves
us with the following question: if we start k at 0, how many steps does it take (on average, or with high probability)
for k to stumble all the way up to n, the number of variables?
We can check that the average amount of steps to walk (randomly) from 0 to n is just n2 . In fact, the average
amount of time to walk from i to n is n2 − i2 . Note that the time average time T (i) to walk from i to n is given by:

T (n) = 0
T (i − 1) T (i + 1)
T (i) = + + 1, i ≥ 1
2 2
T (0) = T (1) + 1.

15-1
Lecture 15 15-2

These equations completely determine T (i), and our solution satisfies these equations!
Hence, on average, we will find a solution in at most n2 steps. (We might do better– we might not start with all
of our variables wrong, or we might have some moves where we must improve the number of matches!)
We can run our algorithm for say 100n2 steps, and report that no solution was found if none was found. This
algorithm might return the wrong answer– there may be a truth assignment, and we have just been unlucky. But
most of the time it will be right.
CS124 Lecture 16

An introductory example
Suppose that a company that produces three products wishes to decide the level of production of each so as to
maximize profits. Let x1 be the amount of Product 1 produced in a month, x2 that of Product 2, and x3 that of Product
3. Each unit of Product 1 yields a profit of 100, each unit of Product 2 a profit of 600, and each unit of Product 3 a
profit of 1400. There are limitations on x1 , x2 , and x3 (besides the obvious one, that x1 , x2 , x3 ≥ 0). First, x1 cannot
be more than 200, and x2 cannot be more than 300, presumably because of supply limitations. Also, the sum of the
three must be, because of labor constraints, at most 400. Finally, it turns out that Products 2 and 3 use the same
piece of equipment, with Product 3 using three times as much, and hence we have another constraint x2 + 3x3 ≤ 600.
What are the best levels of production?

We represent the situation by a linear program, as follows:

max 100x1 + 600x2 + 1400x3

x1 ≤ 200

x2 ≤ 300

x1 + x2 + x3 ≤ 400

x2 + 3x3 ≤ 600

x1 , x2 , x3 ≥ 0

The set of all feasible solutions of this linear program (that is, all vectors in 3-d space that satisfy all constraints)
is precisely the polyhedron shown in Figure 16.1.

We wish to maximize the linear function 100x1 + 600x2 + 1400x3 over all points of this polyhedron. Geometri-
cally, the linear equation 100x1 + 600x2 + 1400x3 = c can be represented by a plane parallel to the one determined
by the equation 100x1 + 600x2 + 1400x3 = 0. This means that we want to find the plane of this type that touches the
polyhedron and is as far towards the positive orthant as possible. Obviously, the optimum solution will be a vertex
(or the optimum solution will not be unique, but a vertex will do). Of course, two other possibilities with linear
programming are that (a) the optimum solution may be infinity, or (b) that there may be no feasible solution at all.

16-1
Lecture 16 16-2

x2
300
opt

200 x1

200
x3
Figure 16.1: The feasible region

In this case, an optimal solution exists, and moreover we shall show that it is easy to find.

Linear programs
Linear programs, in general, have the following form: there is an objective function that one seeks to optimize,
along with constraints on the variables. The objective function and the constraints are all linear in the variables;
that is, all equations have no powers of the variables, nor are the variables multiplied together. As we shall see,
almost all problems can be represented by linear programs, and for many problems it is an extremely convenient
representation. So once we explain how to solve linear programs, the question then becomes how to reduce other
problems to linear programming (LP).

There are polynomial time algorithms for solving linear programs. In practice, however, such problems are
solved by the simplex method devised by George Dantzig in 1947. The simplex method starts from a vertex (in this
Lecture 16 16-3

case the vertex (0, 0, 0)) and repeatedly looks for a vertex that is adjacent, and has better objective value. That is, it
is a kind of hill-climbing in the vertices of the polytope. When a vertex is found that has no better neighbor, simplex
stops and declares this vertex to be the optimum. For example, in the figure one of the possible paths followed by
simplex is shown. No known variant of the simplex algorithm has been proven to take polynomial time, and most of
the variations used in practice have been shown to take exponential time on some examples. Fortunately, in practice,
bad cases rarely arise, and the simplex algorithm runs extremely quickly. There are now implementations of simplex
that solve routinely linear programs with many thousands of variables and constraints.

Of course, given a linear program, it is possible either that (a) the optimum solution may be infinity, or (b) that
there may be no feasible solution at all. If this is the case, simplex algorithm will discover it.

Reductions between versions of simplex


A general linear programming problem may involve constraints that are equalities or inequalities in either
direction. Its variables may be nonnegative, or could be unrestricted in sign. And we may be either minimizing
or maximizing a linear function. It turns out that we can easily translate any such version to any other. One
such translation that is particularly useful is from the general form to the one required by simplex: minimization,
nonnegative variables, and equality constraints.

To turn an inequality ∑ ai xi ≤ b into an equality constraint, we introduce a new variable s (the slack variable for
this inequality), and rewrite this inequality as ∑ ai xi + s = b, s ≥ 0. Similarly, any inequality ∑ ai xi ≥ b is rewritten
as ∑ ai xi − s = b, s ≥ 0; s is now called a surplus variable.

We handle an unrestricted variable x as follows: we introduce two nonnegative variables, x+ and x− , and
replace x by x+ − x− everywhere. The idea is that we let x = x+ − x− , where we may restrict both x+ and x− to be
nonnegative. This way, x can take on any value, but there are only nonnegative variables.

Finally, to turn a maximization problem into a minimization one, we just multiply the objective function by −1.

A production scheduling example


We have the demand estimates for our product for all months of 1997, di : i = 1, . . . , 12, and they are very
uneven, ranging from 440 to 920. We currently have 30 employees, each of which produce 20 units of the product
each month at a salary of 2,000; we have no stock of the product. How can we handle such fluctuations in demand?
Three ways:
Lecture 16 16-4

• overtime —but this is expensive since it costs 80% more than regular production, and has limitations, as
workers can only work 30% overtime.

• hire and fire workers —but hiring costs 320, and firing costs 400.

• store the surplus production —but this costs 8 per item per month

This rather involved problem can be formulated and solved as a linear program. As in all such reductions, the
crucial first step is defining the variables:

• Let w0 be the number of workers we have the ith month —we have w0 = 30.

• Let xi be the production for month i.

• oi is the number of items produced by overtime in month i.

• hi and fi are the number of workers hired/fired in the beginning of month i.

• si is the amount of product stored after the end of month i.

We now must write the constraints:

• xi = 20wi + oi —the amount produced is the one produced by regular production, plus overtime.

• wi = wi−1 + hi − fi , wi ≥ 0 —the changing number of workers.

• si = si−1 + xi − di ≥ 0 —the amount stored in the end of this month is what we started with, plus the production,
minus the demand.

• oi ≤ 6wi —only 30% overtime.

Finally, what is the objective function? It is

min 2000 ∑ wi + 400 ∑ fi + 320 ∑ hi + 8 ∑ si + 180 ∑ oi ,

where the summations are from i = 1 to 12.

A Communication Network Problem


Lecture 16 16-5

We have a network whose lines have the bandwidth shown in Figure 16.2. We wish to establish three calls: one
between A and B (call 1), one between B and C (call 2), and one between A and C (call 3). We must give each call
at least 2 units of bandwidth, but possibly more. The link from A to B pays 3 per unit of bandwidth, from B to C
pays 2, and from A to C pays 4. Notice that each call can be routed in two ways (the long and the short path), or by a
combination (for example, two units of bandwidth via the short route, and three via the long route). Suppose we are
a shady network administrator, and our goals is to maximize the network’s income (rather than minimize the overall
cost). How do we route these calls to maximize the network’s income?

B
10

13 6

8 11 12
C A
Figure 16.2: A communication network

This is also a linear program. We have variables for each call and each path (long or short); for example x1 is
the short path for call 1, and x2 the long path for call 2. We demand that (1) no edge bandwidth is exceeded, and (2)
each call gets a bandwidth of 2.

max 3x1 + 3x1 + 2x2 + 2x2 + 4x3 + 4x3

x1 + x1 + x2 + x2 ≤ 10

x1 + x1 + x3 + x3 ≤ 12

x2 + x2 + x3 + x3 ≤ 8

x1 + x2 + x3 ≤ 6
Lecture 16 16-6

x1 + x2 + x3 ≤ 13

x1 + x2 + x3 ≤ 11

x1 + x1 ≥ 2

x2 + x2 ≥ 2

x3 + x3 ≥ 2

x1 , x1 . . . , x3 ≥ 0

The solution, obtained via simplex in a few milliseconds, is the following: x1 = 0, x1 = 7, x2 = x2 = 1.5, x3 =
.5, x3 = 4.5.

Question: Suppose that we removed the constraints stating that each call should receive at least two units.
Would the optimum change?

Approximate Separation
An interesting last application: Suppose that we have two sets of points in the plane, the black points (xi , yi ) :
i = 1, . . . , m and the white points (xi , yi ) : i = m+ 1, . . . , m+ n. We wish to separate them by a straight line ax+ by = c,
so that for all black points ax + by ≤ c, and for all white points ax + by ≥ c. In general, this would be impossible.
Still, we may want to separate them by a line that minimizes the sum of the “displacement errors” (distance from the
boundary) over all misclassified points. Here is the LP that achieves this:

min e1 +e2 + . . . + em + em+1 + . . . + em+n


e1 ≥ ax1 + by1 − c
e2 ≥ ax2 + by2 − c
..
.
em ≥ axm + bym − c
em+1 ≥ c − axm+1 − bym+1
..
.
em+n ≥ c − axm+n − bym+n
ei ≥ 0

Network Flows
Suppose that we are given the network in top of Figure 16.3, where the numbers indicate capacities, that is, the
amount of flow that can go through the edge in unit time. We wish to find the maximum amount of flow that can go
through this network, from S to T .
Lecture 16 16-7

A C
3
5 2
2
S 1 1 T

2 5
B 3 D
A C
3
5 2 2
2 2
S 2 1 1 T

2 5
B 3 D
A C
3
5 2 2
2 2
S 4 1 1 T
2
2
2 5
B 3 D
A C
3 minimum cut,
capacity 6
5 2 2
2 2
S 4 1 1 T
2
2
4
2 2 5
B 3 D

Figure 16.3: Max flow


Lecture 16 16-8

This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, rep-
resenting the flow through this edge. These variables are denoted fSA , fSB , . . . We have two kinds of constraints:
capacity constraints such as fSA ≤ 5 (a total of 9 such constraints, one for each edge), and flow conservation con-
straints (one for each node except S and T ), such as fAD + fBD = fDC + fDT (a total of 4 such constraints). We wish
to maximize fSA + fSB , the amount of flow that leaves S, subject to these constraints. It is easy to see that this linear
program is equivalent to the max-flow problem. The simplex method would correctly solve it.

In the case of max-flow, it is very instructive to “simulate” the simplex method, to see what effect its various
iterations would have on the given network. Simplex would start with the all-zero flow, and would try to improve it.
How can it find a small improvement in the flow? Answer: it finds a path from S to T (say, by depth-first search),
and moves flow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously
do no better). This is the first iteration of simplex (see Figure 16.3).

How would simplex continue? It would look for another path from S to T . Since this time we already partially
(or totally) use some of the edges, we should do depth-first search on the edges that have some residual capacity,
above and beyond the flow they already carry. Thus, the edge CT would be ignored, as if it were not there. The
depth-first search would now find the path S − A − D − T , and augment the flow by two more units, as shown in
Figure 16.3.

Next, simplex would again try to find a path from S to T . The path is now S − A − B − D − T (the edges C − T
and A − D are full are are therefore ignored), and we augment the flow as shown in the bottom of Figure 16.3.

Next simplex would again try to find a path. But since edges A − D, C − T , and S − B are full, they must be
ignored, and therefore depth-first search would fail to find a path, after marking the nodes S, A,C as reachable from
S. Simplex then returns the flow shown, of value 6, as maximum.

How can we be sure that it is the maximum? Notice that these reachable nodes define a cut (a set of nodes
containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is
6, the same as the max-flow value. (It must be the same, since this flow passes through this cut.) The existence of
this cut establishes that the flow is optimum!

There is a complication that we have swept under the rug so far: when we do depth-first search looking for a
path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all
edges that already have some non-zero flow. This would have the effect of canceling some flow; canceling may be
necessary to achieve optimality, see Figure 16.4. In this figure the only way to augment the current flow is via the
path S − B − A − T , which traverses the edge A − B in the reverse direction (a legal traversal, since A − B is carrying
Lecture 16 16-9

non-zero flow).

1 1

S 1 T

1 1
B

Figure 16.4: Flows may have to be canceled

In general, a path from the source to the sink along which we can increase the flow is called an augmenting
path. We can look for an augmenting path by doing for example a depth first search along the residual network,
which we now describe. For an edge (u, v), let c(u, v) be its capacity, and let f (u, v) be the flow across the edge.
Note that we adopt the following convention: if 4 units flow from u to v, then f (u, v) = 4, and f (v, u) = −4. That is,
we interpret the fact that we could reverse the flow across an edge as being equivalent to a “negative flow”. Then the
residual capacity of an edge (u, v) is just
c(u, v) − f (u, v).

The residual network has the same vertices as the original graph; the edges of the residual network consist of all
weighted edges with strictly positive residual capacity. The idea is then if we find a path from the source to the sink
in the residual network, we have an augmenting path to increase the flow in the original network. As an exercise,
you may want to consider the residual network at each step in Figure 16.3.

Suppose we look for a path in the residual network using depth first search. In the case where the capacities
are integers, we will always be able to push an integral amount of flow along an augmenting path. Hence, if the
maximum flow is f ∗ , the total time to find the maximum flow is O(E f ∗ ), since we may have to do an O(E) depth
first search up to f ∗ times. This is not so great.

Note that we do not have to do a depth-first search to find an augmenting path in the residual network. In fact,
using a breadth-first search each time yields an algorithm that provably runs in O(V E2 ) time, regardless of whether
or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the
Lecture 16 16-10

max-flow problem as well that improve on this running time.

To summarize: the max-flow problem can be easily reduced to linear programming and solved by simplex. But
it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly
finds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse
edges with non-zero flow. If an S − T path is found, we augment the flow along this path, and repeat. When a path
cannot be found, the set of nodes reachable from S defines a cut of capacity equal to the max-flow. Thus, the value
of the maximum flow is always equal to the capacity of the minimum cut. This is the important max-flow min-cut
theorem. One direction (that max-flow≤min-cut) is easy (think about it: any cut is larger than any flow); the other
direction is proved by the algorithm just described.
CS124 Lecture 17

A C
3
5 2
2
S 1 1 T

2 5
B 3 D
A C
3
5 2 2
2 2
S 2 1 1 T

2 5
B 3 D
A C
3
5 2 2
2 2
S 4 1 1 T
2
2
2 5
B 3 D
A C
3 minimum cut,
capacity 6
5 2 2
2 2
S 4 1 1 T
2
2
4
2 2 5
B 3 D

Figure 17.1: Max flow

17-1
Lecture 17 17-2

Network Flows
Suppose that we are given the network in top of Figure 17.1, where the numbers indicate capacities, that is, the
amount of flow that can go through the edge in unit time. We wish to find the maximum amount of flow that can go
through this network, from S to T .

This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, rep-
resenting the flow through this edge. These variables are denoted fSA fSB  We have two kinds of constraints:

capacity constraints such as fSA 5 (a total of 9 such constraints, one for each edge), and flow conservation con-
straints (one for each node except S and T ), such as fAD  fBD  fDC  fDT (a total of 4 such constraints). We wish
to maximize fSA  fSB , the amount of flow that leaves S, subject to these constraints. It is easy to see that this linear
program is equivalent to the max-flow problem. The simplex method would correctly solve it.
Lecture 17 17-3

In the case of max-flow, it is very instructive to “simulate” the simplex method, to see what effect its various
iterations would have on the given network. Simplex would start with the all-zero flow, and would try to improve it.
How can it find a small improvement in the flow? Answer: it finds a path from S to T (say, by depth-first search),
and moves flow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously
do no better). This is the first iteration of simplex (see Figure 17.1).

How would simplex continue? It would look for another path from S to T . Since this time we already partially
(or totally) use some of the edges, we should do depth-first search on the edges that have some residual capacity,
above and beyond the flow they already carry. Thus, the edge CT would be ignored, as if it were not there. The
depth-first search would now find the path S  A  D  T , and augment the flow by two more units, as shown in
Figure 17.1.

Next, simplex would again try to find a path from S to T . The path is now S  A  B  D  T (the edges C  T
and A  D are full are are therefore ignored), and we augment the flow as shown in the bottom of Figure 17.1.

Next simplex would again try to find a path. But since edges A  D, C  T , and S  B are full, they must be
ignored, and therefore depth-first search would fail to find a path, after marking the nodes S A C as reachable from
S. Simplex then returns the flow shown, of value 6, as maximum.
Lecture 17 17-4

How can we be sure that it is the maximum? Notice that these reachable nodes define a cut (a set of nodes
containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is
6, the same as the max-flow value. (It must be the same, since this flow passes through this cut.) The existence of
this cut establishes that the flow is optimum!

There is a complication that we have swept under the rug so far: when we do depth-first search looking for a
path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all
edges that already have some non-zero flow. This would have the effect of canceling some flow; canceling may be
necessary to achieve optimality, see Figure 17.2. In this figure the only way to augment the current flow is via the
path S  B  A  T , which traverses the edge A  B in the reverse direction (a legal traversal, since A  B is carrying
non-zero flow).

1 1

S 1 T

1 1
B

Figure 17.2: Flows may have to be canceled


Lecture 17 17-5

In general, a path from the source to the sink along which we can increase the flow is called an augmenting
path. We can look for an augmenting path by doing for example a depth first search along the residual network,
which we now describe. For an edge  u v  , let c  u v  be its capacity, and let f  u v  be the flow across the edge.
Note that we adopt the following convention: if 4 units flow from u to v, then f  u v   4, and f  v u    4. That is,
we interpret the fact that we could reverse the flow across an edge as being equivalent to a “negative flow”. Then the
residual capacity of an edge  u v  is just
c u v  f  u v 

The residual network has the same vertices as the original graph; the edges of the residual network consist of all
weighted edges with strictly positive residual capacity. The idea is then if we find a path from the source to the sink
in the residual network, we have an augmenting path to increase the flow in the original network. As an exercise,
you may want to consider the residual network at each step in Figure 17.1.

Suppose we look for a path in the residual network using depth first search. In the case where the capacities
are integers, we will always be able to push an integral amount of flow along an augmenting path. Hence, if the
maximum flow is f
, the total time to find the maximum flow is O  E f
 , since we may have to do an O  E  depth
first search up to f
times. This is not so great.

Note that we do not have to do a depth-first search to find an augmenting path in the residual network. In fact,
using a breadth-first search each time yields an algorithm that provably runs in O  V E 2  time, regardless of whether
or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the
max-flow problem as well that improve on this running time.

To summarize: the max-flow problem can be easily reduced to linear programming and solved by simplex. But
it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly
finds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse
edges with non-zero flow. If an S  T path is found, we augment the flow along this path, and repeat. When a path
cannot be found, the set of nodes reachable from S defines a cut of capacity equal to the max-flow. Thus, the value
of the maximum flow is always equal to the capacity of the minimum cut. This is the important max-flow min-cut

theorem. One direction (that max-flow min-cut) is easy (think about it: any cut is larger than any flow); the other
direction is proved by the algorithm just described.
Lecture 17 17-6

Duality
As it turns out, the max-flow min-cut theorem is a special case of a more general phenomenon called duality.
Basically, duality means that for each maximization problem there is a corresponding minimizations problem with
the property that any feasible solution of the min problem is greater than or equal any feasible solution of the max
problem. Furthermore, and more importantly, they have the same optimum.

Consider the network shown in Figure 17.3, and the corresponding max-flow problem. We know that it can be
written as a linear program as follows:

3 1

S 1 T

2 3
B

Figure 17.3: A simple max-flow problem

max fSA  fSB



fSA 3

fSB 2

fAB 1

fAT 1 P

fBT 3
fSA  fAB  fAT  0
fSA  fAB  fBT  0
f 0
Lecture 17 17-7

Consider now the following linear program:

min 3ySA  2ySB  yAB  yAT  3yBT


ySA  uA 1
ySB  uB 1
yAB  uA  uB 0 D
yAT  uA 0
yBT  uB 0
y 0

This LP describes the min-cut problem! To see why, suppose that the uA variable is meant to be 1 if A is in the
cut with S, and 0 otherwise, and similarly for B (naturally, by the definition of a cut, S will always be with S in the
cut, and T will never be with S). Each of the y variables is to be 1 if the corresponding edge contributes to the cut
capacity, and 0 otherwise. Then the constraints make sure that these variables behave exactly as they should. For
example, the second constraint states that if A is not with S, then SA must be added to the cut. The third one states
that if A is with S and B is not (this is the only case in which the sum  uA  uB becomes  1), then AB must contribute
to the cut. And so on. Although the y and u’s are free to take values larger than one, they will be “slammed” by the
minimization down to 1 or 0.
Lecture 17 17-8

Let us now make a remarkable observation: these two programs have strikingly symmetric, dual, structure.
This structure is most easily seen by putting the linear programs in matrix form. The first program, which we call
the primal (P), we write as:
max 1 1 0 0 0

1 0 0 0 0 3

0 1 0 0 0 2

0 0 1 0 0 1

0 0 0 1 0 1

0 0 0 0 1 3
1 0  1  1 1  0
0 1 1 0  1  0

Here we have removed the actual variable names, and we have included an additional row at the bottom denoting
that all the variables are non-negative. (An unrestricted variable will be denoted by unr.

The second program, which we call the dual (D), we write as:

min 3 2 1 1 3 0 0
1 0 0 0 0 1 0 1
0 1 0 0 0 0 1 1
0 0 1 0 0  1 1 0
0 0 0 1 0  1 0 0
0 0 0 0 1 0  1 0
unr unr

Each variable of P corresponds to a constraint of D, and vice-versa. Equality constraints correspond to unre-
stricted variables (the u’s), and inequality constraints to restricted variables. Minimization becomes maximization.
The matrices are transpose of one another, and the roles of right-hand side and objective function are interchanged.
Lecture 17 17-9

Such LP’s are called dual to each other. It is mechanical, given an LP, to form its dual. Suppose we start with

a maximization problem. Change all inequality constraints into constraints, negating both sides of an equation if
necessary. Then


transpose the coefficient matrix


invert maximization to minimization


interchange the roles of the right-hand side and the objective function


introduce a nonnegative variable for each inequality, and an unrestricted one for each equality


for each nonnegative variable introduce a constraint, and for each unrestricted variable introduce an equality
constraint.

If we start with a minimization problem, we instead begin by turning all inequality constraints into con-
straints, we make the dual a maximization, and we change the last step so that each nonnegative variable corresponds

to a constraint. Note that it is easy to show from this description that the dual of the dual is the original primal
problem!

By the max-flow min-cut theorem, the two LP’s P and D above have the same optimum. In fact, this is true
for general dual LP’s! This is the duality theorem, which can be stated as follows (we shall not prove it; the best
proof comes from the simplex algorithm, very much as the max-flow min-cut theorem comes from the max-flow
algorithm):

If an LP has a bounded optimum, then so does its dual, and the two optimal values coincide.
Lecture 17 17-10

Matching
It is often useful to compose reductions. That is, we can reduce a problem A to B, and B to C, and since C we
know how to solve, we end up solving A. A good example is the matching problem.

Suppose that the bipartite graph shown in Figure 17.4 records the compatibility relation between four boys and
four girls. We seek a maximum matching, that is, a set of edges that is as large as possible, and in which no two
edges share a node. For example, in Figure 17.4 there is a complete matching (a matching that involves all nodes).

Al Eve

Bob Fay
S T

Charlie Grace

Dave Helen

Figure 17.4: Reduction from matching to max-flow (all capacities are 1)


Lecture 17 17-11

To reduce this problem to max-flow, we create a new source and a new sink, connect the source with all boys
and all girls with the sinks, and direct all edges of the original bipartite graph from the boys to the girls. All edges
have capacity one. It is easy to see that the maximum flow in this network corresponds to the maximum matching.

Well, the situation is slightly more complicated than was stated above: what is easy to see is that the optimum
integer-valued flow corresponds to the optimum matching. We would be at a loss interpreting as a matching a flow
that ships .7 units along the edge Al-Eve! Fortunately, what the algorithm in the previous section establishes is that if
the capacities are integers, then the maximum flow is integer. This is because we only deal with integers throughout
the algorithm. Hence integrality comes for free in the max-flow problem.

Unfortunately, max-flow is about the only problem for which integrality comes for free. It is a very difficult
problem to find the optimum solution (or any solution) of a general linear program with the additional constraint that
(some or all of) the variables be integers. We will see why in forthcoming lectures.
Lecture 17 17-12

Games
We can represent various situations of conflict in life in terms of matrix games. For example, the game shown
below is the rock-paper-scissors game. The Row player chooses a row strategy, the Column player chooses a column
strategy, and then Column pays to Row the value at the intersection (if it is negative, Row ends up paying Column).

r p s

r 0  1 1
p 1 0  1
s  1 1 0

Games do not necessarily have to be symmetric (that is, Row and Column have the same strategies, or, in terms of
matrices, A   AT ). For example, in the following fictitious Clinton-Dole game the strategies may be the issues on
which a candidate for office may focus (the initials stand for “economy,” “society,” “morality,” and “tax-cut”) and
the entries are the number of voters lost by Column.

m t

e 3  1
s  2 1

We want to explore how the two players may play “optimally” these games. It is not clear what this means. For
example, in the first game there is no such thing as an optimal “pure” strategy (it very much depends on what your
opponent does; similarly in the second game). But suppose that you play this game repeatedly. Then it makes sense
to randomize. That is, consider a game given by an m n matrix Gi j ; define a mixed strategy for the row player
to be a vector  x1  xm  , such that xi 0, and ∑m
i  1 xi  1. Intuitively, xi is the probability with which Row plays
strategy i. Similarly, a mixed strategy for Column is a vector  y1  yn  , such that y j 0, and ∑nj 1 yj  1.
Lecture 17 17-13

Suppose that, in the Clinton-Dole game, Row decides to play the mixed strategy   5 5 . What should Column
do? The answer is easy: If the xi ’s are given, there is a pure strategy (that is, a mixed strategy with all y j ’s zero except
for one) that is optimal. It is found by comparing the n numbers ∑m
i  1 Gi j xi , for j  1  n (in the Clinton-Dole
game, Column would compare  5 with 0, and of course choose the smallest —remember, the entries denote what
Column pays). That is, if Column knew Row’s mixed strategy, s/he would end up paying the smallest among the
n outcomes ∑m
i  1 Gi j xi , for j  1  n. On the other hand, Row will seek the mixed strategy that maximizes this
minimum; that is,
m
max min ∑ Gi j xi 
x j
i 1
This maximum would be the best possible guarantee about an expected outcome that Row can have by choosing a
mixed strategy. Let us call this guarantee z; what Row is trying to do is solve the following LP:

max z

z  3x1  2x2 0

z  x1  x2 0
x1  x2  1

Symmetrically, it is easy to see that Column would solve the following LP:

min w
w  3y1  y2 0
w  2y1  y2 0
y1  y2  1

The crucial observation now is that these LP’s are dual to each other, and hence have the same optimum, call it V .
Lecture 17 17-14

Let us summarize: By solving an LP, Row can guarantee an expected income of at least V , and by solving the
dual LP, Column can guarantee an expected loss of at most the same value. It follows that this is the uniquely defined
optimal play (it was not a priori certain that such a play exists). V is called the value of the game. In this case, the
optimum mixed strategy for Row is  3  7 4  7  , and for Column  2  7 5  7  , with a value of 1  7 for the Row player.

The existence of mixed strategies that are optimal for both players and achieve the same value is a fundamental
result in Game Theory called the min-max theorem. It can be written in equations as follows:

max min ∑ xi y j Gi j  min max ∑ xi y j Gi j 


x y y x

It is surprising, because the left-hand side, in which Column optimizes last, and therefore has presumably an ad-
vantage, should be intuitively smaller than the right-hand side, in which Column decides first. Duality equalizes the
two, as it does in max-flow min-cut.
Lecture 17 17-15

Circuit Evaluation

OR

AND AND

NOT

AND OR AND

T F F T

Figure 17.5: A Boolean circuit


Lecture 17 17-16

We have seen many interesting and diverse applications of linear programming. In some sense, the next one is
the ultimate application. Suppose that we are given a Boolean circuit, that is, a DAG of gates, each of which is either
an input gate (indegree zero, and has a value T or F), or an OR gate (indegree two), or an AND gate (indegree two),
or a NOT gate (indegree one). One of them is designated as the output gate. We wish to tell if this circuit evaluates
(following the laws of Boolean values bottom-up) to T. This is known as the circuit value problem.

There is a very simple and automatic way of translating the circuit value problem into an LP: for each gate g
 
we have a variable xg . For all gates we have 0 xg 1. If g is a T input gate, we have the equation xg  1; if it is F,

xg  0. If it is an OR gate, say of the gates h and h , then we have the inequality xg xh  xh . If it is an AND gate
 
of h and h , we have the inequalities xg xh , xg xh (notice the difference). For a NOT gate we say xg  1  xh .
Finally, we want to max xo , where o is the output gate. It is easy to see that the optimum value of xo will be 1 if the
circuit value if T, and 0 if it is F.

This is a rather straight-forward reduction to LP, from a problem that may not seem very interesting or hard at
first. However, the circuit value problem is in some sense the most general problem solvable in polynomial time!
Here is a justification of this statement: after all, a polynomial time algorithm runs on a computer, and the computer
is ultimately a Boolean combinational circuit implemented on a chip. Since the algorithm runs in polynomial time,
it can be rendered as a circuit consisting of polynomially many superpositions of the computer’s circuit. Hence, the
fact that circuit value problem reduces to LP means that all polynomially solvable problems do!

In our next topic, Complexity and NP-completeness, we shall see that a class that contains many hard problems
reduces, much the same way, to integer programming.
CS124 NP-Completeness Review

Where We Are Headed


Up to this point, we have generally assumed that if we were given a problem, we could find a way to solve
it. Unfortunately, as most of you know, there are many fundamental problems for which we have no efficient
algorithms. In fact, by classifying these hard problems, we can show that there is a large class of simple problems
for which there is (probably) no efficient algorithm– the NP-complete problems. Moreover, if you could design an
efficient algorithm for any one of these problems, you could design an algorithm for all of them! It’s an all or none
proposition, so if you could solve just one of them, you would become rich and famous overnight. These notes will
review the main concepts behind the theory of NP-complete problems.

One might ask why it is important to study what problems we cannot solve, instead of focusing on problems
we can solve. Especially for an algorithms course. There are several possible responses, but perhaps the best is that
if you do not know what is impossible, you might waste a great deal of time trying to solve it, instead of coming to
terms with its impossibility and finding suitable alternatives (such as, for example, approximations instead of exact
answers).

18-1
Lecture 18 18-2

Polynomial Running Times


The faster the running time, the better. Linear is great, quadratic is all right, cubic is perhaps a bit slow. But
how exactly should we classify which problems have efficient algorithms? Where is the cut off point?

The choice computer scientists have made is to group together all problems that are solvable in polynomial
time. That is, we define a class of problems P as follows:

Definition: P is the set of all problems Z with a yes-no answer such that there is an algorithm A and a positive
integer k such that A solves Z in O nk  steps (on inputs of size n).

Let us clarify some points in the definition. The restriction to problems with a yes-no answer is really just a
technical convenience. For example, the problem of finding the minimum spanning tree ( on a tree with integer
weights) can be recast as the problem of answering the following question: is the size of the minimum spanning tree
at least j? If you can answer one question, you can answer the other; considering only yes-no problems proves more
convenient.

From the definition, all problems with linear, quadratic, or cubic time algorithms are all in P. But so are
problems with algorithms that require time Θ n100  . This may seem a little strange; for example, would a problem
with an algorithm that runs in time Θ n100  really be said to have an efficient solution? But the main point of defining
the class P is to separate these problems from those that require exponential time, or Ω 2n  steps (for some ε  0.
ε

Problems that require this much time to solve are clearly asymptotically inefficient, compared with polynomial time
algorithms. The class P is also useful because, as we shall see below, it is closed under polynomial time reductions.
Lecture 18 18-3

Reductions
Let A and B be two problems whose instances require a “yes” or “no” answer. (For example, 2SAT is such a
problem, as is the question of whether a bipartite graph has a perfect matching.) A (polynomial time) reduction from
A to B is a polynomial time algorithm R which transforms an input of problem A into an input for problem B. That
is, given an input x to problem A, R will produce an input R x  to problem B, such that the answer to x is yes for
problem A if and only if the answer for R x  is yes for problem B.

This idea of reduction should not seem unfamiliar; all along we have seen the idea of reducing one problem
to another. (For example, we recently saw how to reduce the matching problem into the max-flow problem, which
could be reduced to linear programming.) The only difference is, right now, for convenience we are only considering
yes-no type problems.

A reduction from A to B, together with a polynomial time algorithm for B, yields a polynomial time algorithm
for A. (See Figure 18.1.) Let us explain this in more detail. For any input x of A of size n, the reduction R takes time
p n , where p is a polynomial, to produce an input R x  for B. This input R x can have size at most p n  , since
this is the largest input R could possibly construct in p n  time! We now submit R x  as an input to the algorithm
for B, which we assume runs in time q m  on inputs of size m, where q is another polynomial. The algorithm for B
gives us the right answer for B on R x  , and hence also the right answer for A on x. The total time taken was at most
p n q p n  , which is itself just a polynomial in n!

This idea of reduction explains why the class P is so useful. If we have a problem A in P, and some other
problem B reduces to it, then B is in P as well. Hence we say that P is closed under polynomial time reductions.

If we can reduce A to B, we are essentially establishing that, give or take a polynomial, A is no harder or B. We
can write this as
A  B

where here the inequality is represents a fact about the complexities of the two problems. If we know that B is easy,
then A  B establishes that B is easy.

We can also look at this inequality the other way. If we know that A is hard, then the inequality establishes that
B is hard. It is this implication that we will now use, to show that problems are hard. This way of using reductions is
very different from the way we have used reductions so far; it is also much more sophisticated and counter-intuitive.
Lecture 18 18-4

x Reduction R R(x) Algorithm yes/no


Input Input for B Output Output
for A for B for B for A

Algorithm for A

Figure 18.1: Reductions lead to algorithms.


Lecture 18 18-5

Short Certificates and the Class NP


We will now begin to examine a class of problems that includes several “hard” problems. What we mean by
“hard” in this setting is that although nobody has yet shown that there are no polynomial time algorithms to solve
these problems, there is overwhelming evidence that this is the case.

Recall that the class P is the class of yes-no problems that can be solved in polynomial time. The new class we
define, NP, consists of yes-no problems with a different property: if the answer to the problem is yes, then there is a
short certificate that can be checked to show that the answer is correct. A bit more formally, a short certificate must
have the following properties:

 It must be short: the length of the polynomial is no more than polynomial in the length of the input.

 It must certify: there is a polynomial time checker (an algorithm!) that takes the input and the short certificate

and checks that the certificate is valid.

The idea of the short certificate is the following: a problem is in NP if someone else can convince you in polynomial
time that the answer is yes when the answer is yes, and they cannot fool you into thinking the answer is yes when
the answer is no.

Let us move from the abstract to some specific problems.

Compositeness: Testing whether a number is composite is in NP, since if somebody wanted to convince you
a number is composite, they could give you its factorization (the short certificate). You could then check that the
factorization was correct by doing the multiplication, in polynomial time. (Notice you can’t be fooled!)

3SAT: 3SAT is like the 2SAT problem we have seen in the homework, except that there can be up to three
literals in each clause. 3SAT is in NP, since if somebody wanted to convince you that a formula is satisfiable,
they could give you a satisyfing truth assignment (the short certificate). You could then check the proposed truth
assignment in polynomial time by plugging it in and checking each clause. (Again, notice you can’t be fooled!)

Finally, note that P is a subset of NP. To see why, note that if a problem is in P, we don’t even need a short
certificate; someone can convince themselves of the correct answer just by running the polynomial time algorithm!

Now, let us see an example of a problem which does not appear to have short certificates:

not-satisfiable-3SAT: This is like 3SAT, but now the answer is yes if there is no satisyfing assignment for the
formula. Given a formula with no solution, how can we convince people there is no solution? The obvious way is to
list all possible truth assignments, and show that they do not work, but this would not yield a short certificate.
Lecture 18 18-6

NP-completeness
The “hard” problems we will be looking at will be the hardest problems in NP; we call these problems NP-
complete. An NP-complete problem will have two properties:

 it is in NP

 all other problems in NP reduce to it

Thus, our concept of “being the hardest” is based on reductions. If all other problems in NP reduce to a
problem, it must be at least as hard as any of them! It may seem surprising, that there are problems in NP that have
this property.

We will start by proving (well, sketching a proof) that an easily stated problem, circuit SAT, is NP-
complete. Once we have a first problem done, it will turn out to be much easier to prove that other problems
are NP-complete. This is because once we have one NP-complete problem, it is much easier to prove others:

Claim 18.1 Suppose problem A is NP-complete, problem B is in NP, and problem A reduces to problem B. Then
problem B is NP-complete.

Intuitively, this must be true because if A reduces to B, then B is at least as hard as A. So as long as B is in NP,
and the hardest problems in NP are the NP-complete ones, then B must also be NP-complete.

Slightly more formally, we have to show that every problem in NP reduces to B. But we already know that
every problem reduces to A, and A reduces to B. By combining reductions, as in the picture below, we have that
every problem in NP reduces to B. So once we have one problem, we can start building up “chains” of NP-complete
problems easily.
Lecture 18 18-7

x Reduction R R(x) Algorithm yes/no


Input Input for B Output Output
for A for B for B for A

Algorithm for A

Figure 18.2: If C reduces to A, and A reduces to B, then C reduces to B. (Transitivity!)


Lecture 18 18-8

Cook’s Theorem
The problem circuit SAT is defined as follows: given a Boolean circuit and the values of some of its inputs, is
there a way to set the rest of its inputs so that the output is T? It is easy to show that circuit SAT is in NP.

Claim 18.2 A problem is in NP if and only if it can be reduced to circuit SAT.

This statement is known as Cook’s theorem, and it is one of the most important results in Computer Science.

One direction is easy. If a problem A can be reduced to circuit SAT , it can easily be shown to be in NP. A
short certificate for an input to problem A consists of the short certificate for the circuit that results from running the
reduction from A to circuit SAT on the input. Given this short certificate, a polynomial time algorithm could run
the reduction on the input to A to get the appropriate circuit, and then use the short certificate to check the circuit.

The other direction is more complicated, so we offer a somewhat informal explanation. Suppose that we have
a problem A in NP. We need to show that it reduces to circuit SAT. Since A is in NP, there is a polynomial time
algorithm that checks the validity of inputs of A together with the appropriate certificates. But we could program this
algorithm on a computer, and this program would really be just a huge Boolean circuit. (After all, computers are just
big Boolean circuits themselves!) The input to this circuit is the input to problem A along with a short certificate.
Now suppose we are given a specific instance x of A. The question of whether x is a yes instance or no instance
is exactly the question of whether there is an appropriate short certificate, which is exactly the same question ask
asking if there is some way of setting the rest of the inputs to the Boolean circuit so that the answer is T. Hence, the
construction of the circuit we described is the sought reduction from A to circuit SAT!
Lecture 18 18-9

More NP-complete problems


Now that we have proved that circuit SAT is NP-complete, we will build on this to find other NP-complete
problems. For example, we will now show that circuit SAT reduces to 3SAT, and since 3SAT is clearly in NP, this
shows that 3SAT is NP-complete.

Suppose we are given a circuit C with some input gates unset. We must (quickly, in polynomial time) construct
from this circuit a 3SAT-formula R C  which is satisfiable if and only if there is a satisfying assignment of the circuit
inputs. In essence, we want to mimic the actions of the circuit with a suitable formula.

The formula R C  will have one variable for each gate (that is, each input, and each output of an AND, OR, or
NOT), and each gate will also lead to certain clauses, as described below:

1. If x is a T input gate, then add the clause x .

2. If x is a F input gate, then add the clause x  .

3. If x is an unknown input gate, then no clauses are added for it.

4. If x is the OR of gates y and z, then add the clauses y x  , z x  , and x y z  . (It is easy to see that the
conjunction of these clauses is equivalent to
x y z .

5. If x is the AND of gates y and z, then add the clauses x y  , x z  , and y z x  . (It is easy to see that the
conjunction of these clauses is equivalent to
x y z .

6. If x is the NOT of gate y, then add the clauses x y  and x y  . (It is easy to see that the conjunction of these
clauses is equivalent to
x y .

7. Finally, if gate x is the output gate, add the clause x  , expressing the condition that the output gate should be
T.

The conjuction of all of these clauses yields the formula R C  . It should be apparent that this reduction R can
be accomplished in polynomial (in fact, in linear) time. To verify it is a valid reduction, we must now show that C
has a setting of the unknown input gates that makes the output T if and only if R(C) is satifiable.

Suppose C has a valid setting. Then we claim R C  can be satisfied by the truth assignment that gives each
variable the same value as the appropriate gate when C is run on this valid setting. This truth assignment must
satisfy all the clauses of R C  , since we constructed R C  to compute the same values as the circuit. Note that the
output gate is T for C, and hence the final clause listed above is also satisfied.
Lecture 18 18-10

Conversely (and this is more subtle!), if there is a valid truth assignment for R C  , then there is a valid setting
for the inputs of C that makes the output T. Just set the unknown input gates in the manner proscribed by the truth
assignment for R C  . Since R C  effectively mimics the computation of the circuit, we know the output gate must
be T when these inputs are applied.
Lecture 18 18-11

From 3SAT to Integer Linear Programming

We must take a 3SAT formula and convert it to an integer linear program. This reduction is easy. Restrict all
variables so that they are either 0 or 1 by including the constrating 0  x  1. Now a clause such as x y z  can
be turned into a linear constraint by replacing by  , a literal x by x, and a literal x by 1  x  , and then forcing the
whole thing to be at least 1. For example, the above clause becomes x  1  y  z  1. The appropriate clause is
clearly satisifed if and only if this constraint is; all terms on the left of the equation are either 0 or 1, and there is at
least one 1 if and only if one of the literals of the clause is true.

It is somewhat strange that linear programming can be solved polynomial time, but when we try to restrict the
solutions to be integers, then the problem appears not be solvable in polynomial time (since it is NP-complete).
Lecture 18 18-12

From 3SAT to Independent Set

In an input to Independent Set we are given a graph G V  E  and an integer K. We are asked if there is a set
I  V with  I  K such that if u  v  I then u  v 
   E. That is, we are asked to find a set of vertices of size at least

K such that no two are connected by an edge. The problem is clearly in NP. (Why?)

We reduce 3SAT to Independent Set. That is, given a Boolean formula φ with at most 3 literals in each clause,
we must (in polynomial time) come up with a graph G V E  and an integer K so that G has an independent set of
size K or more if and only if the formula φ is satisfiable.

The reduction is illustrated in Figure 18.3. For each clause, we have a group of vertices, one for each literal in
the clause, connected by all possible edges. Between groups of vertices, we connect two vertices if they correspond
to opposite literals (like x and x). We let K be the number of clauses. This completes the reduction, and it is clear
that it can be accomplished in polynomial time. We now show there is a satisfying truth assignment for φ if and only
if there is an independent set of size at least K.
Lecture 18 18-13

x+y+z x+y
x+y+z
x+y+z

x x x
x

z y y y
y z z

Figure 18.3: Turning formulae into graphs.


Lecture 18 18-14

If there is a truth assignment for φ, then there is at least one true literal in each clause. Pick just one for each
clause in any way. The set I of corresponding vertices must give an independent set of size K. This is because we
use only one vertex per clause, so the only way I could not be independent is if it included two opposite literals,
which is impossible, because the satisfying assignment cannot set two opposite literals to T.

Now suppose G has an independent set I of size K. Since there are K groups, and each group is completely
interconnected, there must be one vertex from each group in I. Consider the assignment that sets all literals in
the assignment to T, their opposites to F, and any unused variables arbitrarily. It is clear that this is a valid truth
assignment (since if a variable is set to T, its opposite must be set to F).
Lecture 18 18-15

From Independent Set to Vertex Cover and Clique

Let G V  E  be a graph. A vertex cover of G is a set G  V such that all edges in E have at least one endpoint
in C. That is, each edge is adjacent to at least one vertex in the vertex cover. The Vertex Cover problem is, given a
graph G and a number K, to determine if G has a vertex cover of size at most K.

The reduction from Independent Set to Vertex Cover is immediate from the following observation: C is a
vertex cover of G V  E  if and only if V  C is an independent set! (For example, suppose I is an independent set,
and consider some edge u v  . Both u and v can’t be in the independent set, so V  I contatins either u or v or both,
and the edge is covered.) So the reduction is trivial; given an instance G  K  of Independent Set, we produce the
instance G  V  K  of Vertex Cover.

A clique in a graph is a set of fully connected nodes– every possible edge between every pair of the nodes is
there. The clique problem asks whether there is a clique of size K or larger in the graph. Again, the reduction from
Independent Set is immediate from a simple observation. Let G be the complement of G, which is the graph with
the same nodes as G, but the edges of G are precisely those edges that are missing from G. Then C is a clique in
G V E  if and only if C is an independent set in G. (See Figure 18.4.)
Lecture 18 18-16

Figure 18.4: Independent sets become cliques in the complement.


CS124 Lecture 19

We have defined the class of NP-complete problems, which have the property that if there is a polynomial time
algorithm for any one of these problems, there is a polynomial time algorithm for all of them. Unfortunately, nobody
has found an algorithm for any NP-complete problem, and it is widely believed that it is impossible to do so.

This might seem like a big hint that we should just give up, and go back to solving problems like MAX-FLOW,
where we can find a polynomial time solution. Unfortunately, NP-complete show up all the time in the real world,
and people want solutions to these problems. What can we do?

19-1
Lecture 19 19-2

What Can We Do?


Actually, there is a great deal we can do. Here are just a few possibilities:

Restrict the inputs.

NP-completeness refers to the worst case inputs for a problem. Often, inputs are not as bad as those that arise
in NP-completeness proofs. For example, although the general SAT problem is hard, we have seen that the
cases of 2SAT and Horn formulae have simple polynomial times algorithms.

Provide approximations.

NP-completeness resutls often arise because we want an exact answer. If we relax the problem so that we only
have to return a good answer, then we might be able to develop a polynomial time algorithm. For example,
we have seen that a greedy algorithm provides an approximate answer for the SET COVER problem.

Develop heuristics.

Sometimes we might not be able to make absolute guarantees, but we can develop algorithms that seem to
work well in practice, and have arguments suggesting why they should work well. For example, the simplex
algorithm for linear programming is exponential in the worst case, but in practice it’s generally the right tool
for solving linear programming problems.

Use randomness.

So far, all our algorithms have been deterministic; they always run the same way on the same input. Perhaps
if we let our algorithm do some things randomly, we can avoid the NP-completeness problem?

Actually, the question of whether one can use randomness to solve an NP-complete problem is still open,
though it appears unlikely. (As is, of course, the problem of whether one can solve an NP-complete problem
in polynomial time!) However, randomness proves a useful tool when we try to come up with approximation
algorithms and heuristics. Also, if one can assume the input comes from a suitable “random distribution”,
then often one can develop an algorithm that works well on average.

To begin, we will look at heuristic methods. The amount we can prove about these methods is (as yet) very
limited. However, these techniques have had some success in practice, and there are arguments in favor of why they
are reasonable thing to try for some problems.
Lecture 19 19-3

Local Search
“Local search” is meant to represent a large class of similar techniques that can be used to find a good solution
for a problem. The idea is to think of the solution space as being represented by an undirected graph. That is, each
possible solution is a node in the graph. An edge in the graph represents a possible move we can make between
solutions.

For example, consider the Number Partition problem for the homework assignment. Each possible solution,
or division of the set of numbers into two groups, would be a vertex in the graph of all possible solutions. For our
possible moves, we could move between solutions by changing the sign associated with a number. So in this case,
our graph of all possible solutions, we have an edge between any two possible solutions that differ in only one sign.
Of course this graph of all possible solutions is huge; there are 2n possible solutions when there are n numbers in
the original problem! We could never hope to even write this graph down. The idea of local search is that we never
actually try to write the whole graph down; we just move from one possible solution to a “nearby” possible solution,
either for as long as we like, or until we happen to find an optimal solution.
Lecture 19 19-4

To set up a local search algorithm, we need to have the following:

1. A set of possible solutions, which will be the vertices in our local search graph.

2. A notion of what the neighbors of each vertex in the graph are. For each vertex x, we will call the set of
 
adjacent vertices N x  . The neighbors must satisfy several properties: N x  must be easy to compute from x
 
(since if we try to move from x we will need to compute the neighbors), if y  N x  then x  N y  (so it makes

sense to represent neighbors as undirected edges), and N x  cannot be too big, or more than polynomial in the
input size (so that the neighbors of a node are easy to search through).

3. A cost function, from possible solutions to the real numbers.

The most basic local search algorithm (say to minimize the cost function) is easily described:

1. Pick a starting point x.


 
2. While there is a neighbor y of x with f y  f x  , move to it; that is, set x to y and continue.

3. Return the final solution.


Lecture 19 19-5

The Reasoning Behind Local Search

The idea behind local search is clear; if keep getting better and better solutions, we should end up with a good
one. Pictorially, if we “project” the state space down to a two dimensional curve, we are hoping that the picture has
a sink, or global optimum, and that we will quickly move toward it. See Figure 19.1.

f(x)

x*, global optimum

Figure 19.1: A very nice state space.

There are two possible problems with this line of thinking. First, even if the space does look this way, we
might not move quickly enough toward the right solution. For example, for the number partition problem from the
homework, it might be that each move improves our solution, but only by improving the residue by 1 each time. If
we start with a bad solution, it will take a lot of moves to reach the minimum. Generally, however, this is not much
of a problem, as long as the cost function is reasonably simple.
Lecture 19 19-6

The more important problem is that the solution space might not look this way at all. For example, our cost
function might not change smoothly when we move from a state to it neighbor. Also, it may be that there are several
local optima, in which cas our local search algorithm will hone in a local optimum and get stuck. See Figure 19.2.

f(x) local optima

x*, global optimum

Figure 19.2: A state space with many local optima; it will be hard to find the best solution.

This second problem, that the solution space might not “look nice”, is crucial, and it underscores the importance
of setting up the problem. When we choose the possible moves between solutions – that is, when we construct the
mapping that gives us the neighborhood of each node– we are setting up how local search will behave, including how
the cost function will change between neighbors, and how many local optima there are. How well local search will
work depends tremendously on how smart one is in setting up the right neighborhoods, so that the solution space
really does look the way we would like it to.
Lecture 19 19-7

Examples of Neighborhoods

We have already seen an example of a neighborhood for the homework problem. Here are possible neighbor-
hoods for other problems:

MAX3SAT: A possible neighborhood structure is two truth assignments are neighbors if they differ in only
one variables. A more extensive neighborhood could make two truth assignments neighbors if they differ in
at most two variables; this trades increased flexibility for increase size in the neighborhood.

Travelling Salesperson: The k-opt neighborhood of x is given by all tours that differ in at most k edges from
x. In practice, using the 3-opt neighborhood seems to perform better than the 2-opt neighborhood, and using
4-opt or larger increases the neighborhood size to a point where it is inefficient.
Lecture 19 19-8

Lots of Room to Experiment

There are several aspects of local search algorithms that we can vary, and all can have an impact on performance.
For example:


1. What are the nieghborhoods N x  ?

2. How do we choose an inital starting point?

3. How do we choose a neighbor y to move to? (Do we take the first one we find, a random neighbor that
improves f , the neighbor that improves f the most, or do we use other criteria?)

4. What if there are ties?

There are other practical considerations to keep in mind. Can we re-run the algorithm several times? Can we try
several of the algorithms on different machines? Issues like these can have a big impact on actual performance.
However, perhaps the most important issue is to think of the right neighborhood structure to begin with; if this is
right, then other issues are generally secondary, and if this is wrong, you are likely to fail no matter what you do.
Lecture 19 19-9

Local Search Variations


There are many variations on the local search technique (below, assume the goal is to minimize the cost func-
tion):

Hill-climbing – this is the name for the basic variation, where one moves to a vertex of lower (or possibly
equal) cost.

Metropolis rule – pick a random neighbor, and if the cost is lower, move there. If the cost is higher, move there
with some probability (that is usually set to depend on the cost differential). The idea is that possibly moving
to a worse state helps avoid getting trapped at local minima.

Simulated annealing – this method is similar to the Metropolis rule, except that the probability of going to a
higher cost neighbor varies with time. This is analogous to a physical system (such as a chemical polymer)
being cooled down over time.

Tabu search – this adds some memory to hill climbing. Like with the Metropolis rule and simulated annealing,
you can go to worse solutions. A penalty function is added to the cost function to try to prevent cycling and
promote searching new areas of the search space.

Parallel search (“go with the winners”)– do multiple searches in parallel, occasionally killing off searches that
appear less successful and replacing them with copies of searches that appear to be doing better.

Genetic algorithms – this trendy area is actually quite related to local search. An important difference is
that instead instead of keeping one solution at a time, a group of them (called a population) is kept, and the
population changes at each step.

It is still quite unclear what exactly each of these techniques adds to the pot. For example, some people
swear that genetic algorithms lead to better solutions more quickly than other methods, while others claim that by
choosing the right neighborhood functin one can do as well with hill climbing. In the years to come, hopefully more
will become understood about all of these methods.

If you’re interested, you might try looking for genetic algorithms and simulated annealing in Yahoo. They’re
both there.
CS124 Lecture 20

Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms
give guarantees. It is worth keeping in mind that sometimes approximation algorithms do not always perform as well
as heuristic-based algorithms. Other times they provide insight into the problem, so they can help determine good
heuristics.

Often when we talk about an approximation algorithm, we give an approximation ratio. The approximation
ratio gives the ratio between our solution and the actual solution. The goal is to obtain an approximation ratio as
close to 1 as possible. If the problem involves a minimization, the approximation ratio will be greater than 1; if it
involves a maximization, the approximation ratio will be less than 1.

20-1
Lecture 20 20-2

Vertex Cover Approximations


In the Vertex Cover problem, we wish to find a set of vertices of minimal size such that every edge is adjacent
to some vertex in the cover. That is, given an undirected graph G = (V, E), we wish to find U ⊆ V such that every
edge e ∈ E has an endpoint in U. We have seen that Vertex Cover is NP-complete.

A natural greedy algorithm for Vertex Cover is to repeatedly choose a vertex with the highest degree, and put it
into the cover. When we put the vertex in the cover, we remove the vertex and all its adjacent edges from the graph,
and continue. Unfortunately, in this case the greedy algorithm gives us a rather poor aprroximation, as can be seen
with the following example:

vertices chosen
by greedy

vertices in the
min cover

Figure 20.1: A bad greedy example.

In the example, all edges are connected to the base level; there are m/2 vertices at the next level, m/3 vertices
at the next level, and so on. Each vertex at the base level is connected to one vertex at each other level, and the
connections are spread as evenly as possible at each level. A greedy algorithm could always choose a rightmost
vertex, whereas the optimal cover consists of the leftmost vertices. This example shows that, in general, the greedy
approach could be off by a factor of Ω(log n), where n is the number of vertices.
Lecture 20 20-3

A better algorithm for vertex cover is the following: repeatedly choose an edge, and throw both of its endpoints
into the cover. Throw the vertices and its adjacent edges out of the graph, and continue.

It is easy to show that this second algorithm uses at most twice as many vertices as the optimal vertex cover.
This is because each edge that gets chosen during the course of the algorithm must have one of its endpoints in the
cover; hence we have merely always thrown two vertices in where we might have gotten away with throwing in 1.

Somewhat surprsingly, this simple algorithm is still the best knwon approximation algorithm for the vertex
cover problem. That is, no algorithm has been proven to do better than within a factor of 2.
Lecture 20 20-4

Maximum Cut Approximation


We will provide both a randomized and a deterministic approximation algorithm for the MAX CUT problem.
The MAX CUT problem is to divide the vertices in a graph into two disjoint sets so that the numbers of edges
between vertices in different sets is maximized. This problem is NP-hard. Notice that the MIN CUT problem can
be solved in polynomial time by repeated using the min cut-max flow algorithm. (Exercise: Prove this!)

The randomized version of the algorithm is as follows: we divide the vertices into two sets, HEADS and TAILS.
We decide where each vertex goes by flipping a (fair) coin.

What is the probability an edge crosses between the sets of the cut? This will happen only if its two endpoints
lie on different sides, which happens 1/2 of the time. (There are 4 possibilities for the two endpoints – HH,HT,TT,TH
– and two of these put the vertices on different sides.) So, on average, we expect 1/2 the edges in the graph to cross
the cut. Since the most we could have is for all the edges to cross the cut, this random assignment will, on average,
be within a factor of 2 of optimal.
Lecture 20 20-5

We now examine a deterministic algorithm with the same “approximation ratio”. (In fact, the two algorithms
are intrinsically related– but this is not so easy to see!) The algorithm implements the hill climbing approximation
heuristic. We will split the vertices into sets S 1 and S2 . Start with all vertices on one side of the cut. Now, if you can
switch a vertex to a different side so that it increases the number of edges across the cut, do so. Repeat this action
until the cut can no longer be improved by this simple switch.

We switch vertices at most |E| times (since each time, the number of edges across the cut increases). Moreover,
when the process finishes we are within a factor of 2 of the optimal, as we shall now show. In fact, when the process
finishes, at least |E|/2 edges lie in the cut.

We can count the edges in the cut in the following way: consider any vertex v ∈ S 1 . For every vertex w in S2
that it is connected to by an edge, we add 1/2 to a running sum. We do the same for each vertex in S 2 . Note that
each edge crossing the cut contributes 1 to the sum– 1/2 for each vertex of the edge.

Hence the cut C satisfies


!
1
C=
2 ∑ |{w : (v, w) ∈ E, w ∈ S2 }| + ∑ |{w : (v, w) ∈ E, w ∈ S1 }| .
v∈S1 v∈S2

Since we are using the local search algorithm, at least half the edges from any vertex v must lie in the set opposite
from v; otherwise, we could switch what side vertex v is on, and improve the cut! Hence, if vertex v has degree δ(v),
then
!
1
C =
2 ∑ |{w : (v, w) ∈ E, w ∈ S2 }| + ∑ |{w : (v, w) ∈ E, w ∈ S1 }|
v∈S1 v∈S2
!
1 δ(v) δ(v)

2 ∑ +∑
v∈S1 2 v∈S2 2
1
= ∑ δ(v)
4 v∈V
1
= |E|,
2

where the last equality follows from the fact that if we sum the degree of all vertices, we obtain twice the number of
edges, since we have counted each edge twice.

In practice, we might expect that hill climbing algorithm would do better than just getting a cut within a factor
of 2.
Lecture 20 20-6

Euclidean Travelling Salesperson Problem


In the Euclidean Travelling Salesman Problem, we are given n points (cities) in the x − y plane, and we seek
the tour (cycle) of minimum length that travels through all the cities. This problem is NP-complete (showing this is
somewhat difficult).

Our approximation algorithm involves the following steps:

1. Find a minimum spanning tree T for the points.

2. Create a psuedo tour by walking around the tree. The pseduo tour may visit some vertices twice.

3. Remove repeats from the tour by short-cutting through the repeated vertices. (See Figure 20.2.)
Lecture 20 20-7

Minimum spanning tree


Constructed pseudo tour
Constructed tour

Figure 20.2: Building an approximate tour. Start at X, move in the direction shown, short-cutting repeated vertices.
Lecture 20 20-8

We now show the following inequalities:

length of tour ≤ length of pseudo tour

≤ 2(size of T)

≤ 2(length of optimal tour)

Short-cutting edges can only decrease the length of the tour, so the tour given by the algorithm is at most the
length of the pseudo tour. The length of our psuedo tour is at most twice the size of the spanning tree, since this
pseudo tour consists of walking through each edge of the tree at most twice. Finally, the length of the optimal tour
is at least the size of the minimum spanning tree, since any tour contains a spanning tree (plus an edge!).

Using a similar idea, one can come up with an approximation algorithm that returns a tour that is within a factor
of 3/2 of the optimal. Also, note that this algorithm will work in any setting where short-cutting is effective. More
specifically, it will work for any instance of the travelling salesperson problem that satisfies the triangle inequality
for distances: that is, if d(x, y) represents the distance between vertices x and y, and d(x, z) ≤ d(x, y) + d(y, z) for all
x, y and z.
Lecture 20 20-9

MAX-SAT: Applying Randomness


Consider the MAX-SAT problem. What happens if we do the simplest random thing we can think of– we
decide whether each variable should be TRUE or FALSE by flipping a coin.

Theorem 20.1 On average, at least half the clauses will be satisfied if we just flip a coin to decide the value of each
variable. Moreover, if each clause has k literals, then on average 1 − 2 −k clauses will be satisfied.

The proof is simple. Look at each clause. If it has k literals in it, then each literal could make the clause TRUE
with probability 1/2. So the probability the clause is not satisfied is 1 − 2 −k , where k is the number of literals in the
clause.
Lecture 20 20-10

Linear Programming Relaxation


The next approach we describe, linear programming relaxation, can often be used as a good heuristic, and
in some cases it leads to approximation algorithms with provable guarantees. Again, we will use the MAX-SAT
problem as an example of how to use this technique.

The idea is simple. Most NP-complete problems can be easily described by a natural Integer Programming
problem. (Of course, all NP-complete problems can be transformed into some Integer Programming problem, since
Integer Programming is NP-complete; but what we mean here is in many cases the transformation is quite natural.)
Even though we cannot solve the related Integer Program, if we pretend it is a linear program, then we can solve it,
using (for example) the simplex method. This idea is konwn as relaxation, since we are relaxing the constraints on
the solution; we are no longer requiring that we get a solution where the variables take on integer values.

If we are extremely lucky, we might find a solution of the linear program where all the variables are integers,
in which case we will have solved our original problem. Usually, we will not. In this case we will have to try to
somehow take the linear programming solution, and modify it into a solution where all the variables take on integer
values. Randomized Rouding is one technique for doing this.
Lecture 20 20-11

MAX-SAT

We may formulate MAX-SAT as an integer programming problem in a straightforward way (in fact, we have
seen a similar reduction before, back when we examined reducitons; it is repeated here). Suppose the formula
contains variables x1 , x2 , . . . , xn which must be set to TRUE or FALSE, and clauses C1 ,C2 , . . . ,Cm . For each variable
xi we associate a variable yi which should be 1 if the variable is TRUE, and 0 if it is FALSE. For each clause C j we
have a variable z j which should be 1 if the clause is satisfied and 0 otherwise.

We wish to maximize the number of satisfied clauses s, or


m
∑ z j.
j=1

The constraints include that that 0 ≤ y i , z j ≤ 1; since this is an integer program, this forces all these variables
to be either 0 or 1. Finally, we need a constraint for each clause saying that its associated variable z j can be 1 if and
only if the clause is actually satisfied. If the clause C j is (x2 ∨ x4 ∨ x6 ∨ x8 ), for example, then we need the restriction:

y2 + y6 + (1 − y4 ) + (1 − y8 ) ≥ z j .

This forces z j to be 0 unless the clause can be satisfied. In general, we replace x i by yi , xi by 1 − yi , ∨ by +, and set
the whole thing ≥ z j to get the appropriate constraint.

When we solve the linear program, we will get a solution that might have y 1 = 0.7 and z1 = 0.6, for instance.
This initially appears to make no sense, since a variable cannot be 0.7 TRUE. But we can still use these values in a
reasonable way. If y1 = 0.7, it suggests that we would prefer to set the variable x 1 to TRUE (1). In fact, we could
try just rounding each variable up or down to 0 or 1, and use that as a solution! This would be one way to turn
the non-integer solution into an integer solution. Unfortunately, there are problems with this method. For example,
suppose we have the clause C1 = (x1 ∨x2 ∨x3 ), and y1 = y2 = y3 = 0.4. Then by simple rounding, this clause will not
be TRUE, even though it “seems satisfied” to our linear program (that is, z 1 = 1). If we have a lot of these clauses,
regular rounding might perform very poorly.

It turns out that there an interpretation for 0.7 that suggests a better way than simple rounding. We think of the
0.7 as a probability. That is, we interpret y 1 = 0.7 as meaning that x1 would like to be true with probability 0.7.
So we take each variable xi , and independently we set it to 1 with the probability given by y i (and with probability
1 − yi we set xi to 0). This process is known as randomized rounding. One reason randomized rounding is useful is
it allows us to prove that the expected number of clauses we satisfy using this rounding is a within a constant factor
of the true optimum.
Lecture 20 20-12

First, note that whatever the maximum number of clauses s we can satisfy is, the value found by the linear
program, or ∑mj=1 z j , is at least as big as s. This is because the linear program could achieve a value of at least s
simply by using as the values for yi the truth assignment that make satisfying s clauses possible.

Now consider a clause with k variables; for convenience, suppose the clause is just C 1 = (x1 ∨ x2 . . . ∨ xk ).
Suppose that when we solve the linear program, we find z 1 = β. Then we claim that the probability that this clause
is satisfied after the rounding is at least (1 − 1/e)β. This can be checked (using a bit of sophisticated math), but it
follows by noting (with experiments) that the worst possibility is that y 1 = y2 . . . = yk = β/k. In this case, each x1
is FALSE with probability (1 − β/k), and so C1 ends up being unsatisfied with probability (1 − β/k) k . Hence the
probability it is satisfied is at least (again using some math) 1 − (1 − β/k) k ≥ (1 − 1/e)β.

Hence the ith clause is satisfied with probability at least (1 − 1/e)z i , so the expected number of satisfied clauses
after randomized rounding is at least (1 − 1/e) ∑ mj=1 z j . This is within a factor of (1 − 1/e) of our upper bound on the
maximum number of satisfiable clauses, ∑mj=1 z j . Hence we expected to get within a constant factor of the maximum.
Lecture 20 20-13

Combining the Two


Surprisingly, by combining the simple coin flipping algorithm with the randomized rounding algorithm, we can
get an even better algorithm. The idea is that the coin flipping algorithm does best on long clauses, since each literal
in the clause makes it more likely the clause gets set to TRUE. On the other hand, randomized rounding does best
on short clauses; the probability the clause is satisfied (1 − (1 − β/k) k ) decreases with k. It turns out that if we try
both algorithms, and take the better result, on average we will satisfy 3/4 of the clauses.

We also point out that there are even more sophisticated approximation algorithms for MAX-SAT, with better
approximation ratios. However, these algorithms point out some very interesting and useful general techniques.
CS 124 Lecture 21

We now consider a natural problem that arises in many applications, particularly in conjunction with suffix
trees, which we will study later. Suppose we have a rooted tree T with n nodes. We would like to be able to answer
questions of the following form: what is the least common ancestor of nodes u and v; that is, what is the common
ancestor of u and v closest to the root?

In this setting, we will not be answering a single questions, but many questions on the same fixed tree T . If
we are given the tree T in advance, we can design an appropriate data structure for answering future queries. Our
algorithm will therefore be measured on several criteria. Of course one important criterion is the query time, or the
time to answer a specific query. However, a second consideration is how much preprocessing time, or time to set up
the data structure, is required to answer the questions. A third related aspect to study is the memory required to store
the data structure.

For example, a trivial algorithm for the problem is to consider each pair of vertices, and compute their least
common ancestor by following both paths toward the root until the first shared vertex is found. Then all the the
answers can be stored in a table. There are n2 pairs of vertices, so our table will require Θ(n 2 ) space. Queries can


be answered by a table lookup, which is constant time. Preprocessing, however, can require Θ(n 3 ) time.

The problem of designing an appropriate data structure for this is called the Least Common Ancestor (LCA)
Problem. We will show that there is an algorithm for LCA that require only linear preprocessing time and memory,
but still answers any query in constant time! This result is as efficient as we could hope for.

We will reduce the LCA problem to a seemingly different but in fact quite related problem, called the Range
Minimum Query (RMQ) Problem. The RMQ problem applies to an array A of length n of numbers. We would like
to be able to answer questions of the following form: given two indices i and j, what is the index of the smallest
element in the subarray A[i . . . j]? Again, we may prepocess the array A to derive some alternative data structure to
answer the questions quickly. There is a trivial solution for the RMQ problem completely similar to the one above
for the LCA problem.

21-1
Lecture 21 21-2

21.1 Reduction: From LCA to RMQ

How to we convert an LCA problem to an RMQ problem? Note that we must do the conversion in linear time, if we
are going to totally complete the preprocessing in linear time for the LCA problem.

Linear time suggests that we want to do a tree traversal. In fact, the observation we will use is that the LCA of
nodes u and v is just the shallowest node encountered between visiting u and v during a depth first search of the tree
starting at the root. So let us do a DFS on the tree, and we can record in an array V the nodes we visit. An example
is shown in Figure 1. Notice each node can appear multiple times, but the total length of the array is 2n − 1, where n
is the number of nodes in the tree. Each of the n − 1 edges yields two values in the array, one when we go down the
edge and one when we go up the edge. The first value is the root. Also, from now on we will refer to each node by
its number on the DFS search.

We will also require two further arrays. The Level Array is derived from V ; L[i] is the distance from the root
of V [i]. Adjacent elements in L can only differ by +1 or −1, since adjacent steps in the DFS are connected by an
edge. Finally, R[i] is the representative array; R[i] contains the first index of V that contains the value i. (Actually,
any occurrence of i can be stored in R[i], but we might as well choose a specific one.)

Clearly, to compute LCA(u, v) it suffices to compute RMQ(R[u], R[v]) over the array L. This gives us the index
of the shallowest node between u and v, and the array V can be used to determine the actual node from the index.
Lecture 21 21-3

1 4 5

2 3 6 7 9

V: 0 1 2 1 3 1 0 4 0 5 6 5 7 8 7 5 9 5 0
L: 0 1 2 1 2 1 0 1 0 1 2 1 2 3 2 1 2 1 0
R: 0 1 2 4 7 9 10 12 13 16

Figure 1: Changing an LCA problem into an RMQ problem.

21.2 Solutions for RMQ

We first note that we can do better than the naive Θ(n 3 ) preprocessing time for RMQ on an array A by doing a trivial
dynamic programming, using the recurrence

RMQ(i, j) = A−1 [min(A[RMQ(i, j − 1)], A[ j])].

Here we are using convenient notation. Clearly min(A[RMQ(i, j − 1)], A[ j]) gives the value A[k], where A[k] is the
smallest value that in the subarray A[i . . . j]. However, we want the index of this value. We use the notation A −1 to
represent that we want the index of this value; note that if multiple indices have this value, we do not particularly
care which index we obtain. Each table entry can be calculated in constant time by building the table in order of
ranges [i, j] of increasing size, leading to preprocessing time Θ(n 2 ).

In fact, we can reduce our table size and memory using a different dynamic program, and by using a few
additional operations per query. Let us create a table M(i, j) such that M(i, j) = A −1 [mink∈[i,i+2 j ) A[k]]. That is,
M(i, j) contains the location of the minimum value over the 2 j positions starting from i. This table has size O(n log n),
and it can easily be filled in O(n log n) step by using dynamic programming, based on the fact that M(i, j) can be
Lecture 21 21-4

determined from M(i, j − 1) and M(i + 2 j−1 , j − 1).

How do we use the M(i, j) to compute RMQ(i, j), if j is not a power of 2? We may use two overlapping
intervals that cover the range [i, j] as follows. Let k = blog( j − i + 1)c, so that 2 k is the largest power of 2 such that
i + 2k ≤ j + 1. Then RMQ(i, j) = A−1 [minA[M(i, k)], A[M( j − 2k , k)]], and this can be computed in constant time
from the M.

We have shown that we can achieve preprocessing time and memory size Θ(n log n) while maintaining con-
stant query time. Interestingly, this method can be enhanced so as to require preprocessing time and memory size
Θ(n log log n) through a recursive construction. (This will be an exercise.) In practice, such a result would probably
be good enough – log log n is quite small for reasonable values of n. By continuing the recursive construction for
further levels, we could even achieve Θ(n log log log n) preprocessing time and memory size, and so on for any fixed
number of logs, while maintaining constant query time. However, this recursive construction would add significant
complexity to an actual program, and it still would not lead us to a linear preprocessing time solution.
Lecture 21 21-5

21.3 ±1 RMQ

In order to achieve linear preprocessing, we will use an additional fact about the RMQ problem we obtain from
the reduction from LCA. Recall that our RMQ problem is on the Level Array obtained from the LCA problem.
The Level Array has one additional property that we are not yet taking advantage of: each entry differs from the
previous entry by +1 or −1. We can take advantage of this fact to split the RMQ problem into a different set of
small subproblems in such a way that we can avoid some work by doing table look-ups.

log n
The split works as follows: partition A into blocks of size 2 . Let X[1, . . . , 2n/ log n] and Y [1, . . . , 2n/ log n] be
arrays such that X[i] stores the minimum element in the ith block of A, and Y [i] stores the position in the ith block
where the element X[i] occurs. Now to answer an RMQ query for indices i and j with i < j on the array A, we can
do the following:

1. If i and j are in the same block, we can perform an RMQ on this block. Notice that this requires that each
block be preprocessed.

2. If i and j are in different blocks, we have to compute the following values, and take the minimum of them:

(a) The minimum from position i to the end of i’s block.

(b) The minimum from the beginning of j’s block to position j.

(c) The minimum of all blocks between i’s block and j’s block.

Steps 2a and 2c also require that we preprocess for RMQ queries on each block. Step 2b requires that we perform
an RMQ over the array X. Assuming we have done all this preprocessing, the total query time is still constant.
However, if we preprocess each block in order to do RMQ’s, we have not saved on the running time. We need a
faster way to deal with preprocessing each block.

How can we possibly avoid preprocessing each block separately? We use the following observation. Consider
two arrays X and X 0 . Suppose that these two arrays differ by a constant at each position; for example, the arrays
might be 1, 2, 3, 4, 3, 2 . . . and 3, 4, 5, 6, 5, 4 . . . and. Then the RMQ answers, which give the index of the minimum
element, will be the same for these two arrays. Hence we can “share” the preprocessing used for these two arrays!

Another way to explain this is that in the ±1 RMQ problem, the initial value of the array does not matter, only
the sequence of +1 and −1 values are necessary to determine the answer. Now, how many different such sequences
are there? Since there are only log n/2 elements in a block, there are only (log n/2) − 1 values in the sequence of +1
Lecture 21 21-6


and −1 values. Hence there are only 2 (log n/2)−1 = n/2 possible sequences. This number is so small, we can afford
to compute and store tables for every possible sequence! Even if we use quadratic preprocessing time and memory,
√ √
these tables would take time O( n log2 n) to preprocess and O( n log2 n) memory. For each block in A, we have to
determine which table to use; this can easily be done in linear time.
Lecture 21 21-7

21.4 Back to the standard RMQ

We have shown that ±1 RMQ problems can be solved with linear time preprocessing, and therefore we have a linear
time preprocessing solution for LCA. What about the general RMQ problem? It turns out that we can also reduce
the RMQ problem to the LCA problem in linear time. So we can obtain a linear time solution the general RMQ
problem, by turning it into an LCA problem, and solving that as a ±1 RMQ problem! The details of this reduction
are omitted here.
CS 124 Lecture 22 Spring 2000

Suffix trees are an old data structure that have become new again, thanks to a recent new linear time algorithm
for constructing suffix trees due to Ukkonen that proves more useful for many applications. Here, we will describe
a suffix tree and discuss their classical use, pattern matching.

22.1 Definition

A suffix tree T is built for a string S[1 . . . m]. The tree is rooted and directed with m leaves, which are numbered from
1 to m. Each edge is labeled with a nonempty substring of S. The internal nodes of the tree (other than the root)
all have at least two outgoing edges, and the labels of all outgoing edges are labeled with different characters. By
following the path from the root to leaf i and concatenating the edge labels, one obtains the suffix S[i . . . m].

An example of a suffix tree for the string xyzxzxy$ is given in Figure 22.1. The figure helps understand some
important points about the suffix tree. First, each internal node has two or more children with different starting
characters along the edges, since otherwise the node could be removed or moved in order to make this the case.
Also, it is important that the last character of the string be a “unique” character, as this guarantees that the suffix tree
as defined actually exists. For example, suppose our string was just xyzxzxy. The suffix tree would remain largely
the same. In particular, in the not-quite-suffix tree in Figure 22.1 the path for the suffix xy does not end at a leaf,
violating the definition. The problem is that the suffix xy is also the prefix of the string. This problem can be avoided
by terminating the string with a special character that does not appear elsewhere, since then no suffix can also be a
prefix (except for the entire string itself). Hence, from now on, we will assume all strings end with a special character
$.

It is also worth noting that a more convenient represenation of the suffix tree does not actually label the edges
with characters. Instead, these labels can be represented by a pair of indices; labeling an edge [i, j] represents that
the edge label corresponds to characters S[i . . . j]. Besides saving space and ensuring that each edge is conveniently
represented by two numbers, this scheme is important for the linear time algorithm for suffix tree construction.

22-1
Lecture 22 22-2

22.2 Construction algorithm

To see that constructing suffix trees is possible, let us consider a simple O(m 2 ) algorithm. Before beginning, we
emphasize that in this case, the O notation is being used to hide a potentially substantial constant, that depends on
the size of the alphabet. That is, if our alphabet is Σ, the O notation is hiding some factor dependent on |Σ|.

The goal is to build up the tree, one suffix at a time. We think of the intermediate results we get at each stage as
partial trees, T1 , T2 , . . . , Tm . Initially the tree T1 consists of one edge, with label S[1 . . . m]; the end node is labeled with
1. For tree Ti , we modify the tree so that the suffix S[i . . . m] is handled properly. To do this, we start from the root
and follow the path down the tree matching characters from S[i . . . m] as far as possible. This just requires character
comparisons, and the path followed is necessarily unique since no two edges leading out of a node are labeled with
string that begin with the same character.

(Note, however, that whenever we reach an intermediate node in the tree, we have to look at all the branches
and decide which one, if any, to follow. Since there is at most one branch for each character, there are at most |Σ|
branches; since |Σ| is a constant, this takes only constant time! In practice, one might want to set up a hash table
based on a number assigned to each node and the first character on an edge in order to make finding the right edge
branch out from a node more efficient.)

At some point, no further matches are possible. Note that this cannot happen at a leaf node, because we end our
string with the special character $. Therefore it either happens when our character matching is either in the middle
of an edge or at a node. In the first case, we break the edge into two edges by inserting a new node. The edge to
the new node contains the characters that have matched so far along the old edge, and the edge from the new node
contains the remaining characters from the old edge. With this addition, we can now add the remainder of the suffix
S[i . . . m] by adding another edge from the new node. If instead when no further matches are possible we are at a
node, we can simply add a new edge from that node with the remainder of the suffix S[i . . . m]. In both cases, when
we add the new edge, we label the new leaf with the value i. The time to add each suffix is proportional to the length
of the suffix, leading to an O(m2 ) algorithm.

Although this algorithm is very simple, the quadratic construction time is extremely limiting. Suffix trees are
used, for example, for large pattern matching problems, where the input strings might be DNA strands of thousands
or even millions of characters. Quadratic time will not suffice for these applications.
Lecture 22 22-3

Fortunately, there are slightly more complex construction algorithms that require only O(m) time. We will not
discuss the algorithm at this point; the details and the subsequent analysis would require a non-trivial amount of
time. A reasonable introduction to the algorithm, however, has been written by Mark Nelson and has appeared in
Dr. Dobb’s Journal. You can currently find it at
http://www.dogma.net/markn/articles/suffixt/suffixt.htm.
Lecture 22 22-4

22.3 Using suffix trees for pattern matching

Once we have constructed our suffix tree, we can use it to efficiently solve pattern matching problems. There are of
course other methods for pattern matching, but using suffix trees has an interesting advantage. Once the suffix tree
has been constructed, finding all the occurences of any pattern P[1 . . . n] in the string S takes time O(n + k), where
k is the number of times that the string S appears in the text. So by incurring a one-time preprocessing charge to
establish the suffix trees, we can handle any pattern matching problem after that in time essentially proportional to
the length of the pattern, independent of the length of the original string! This is quite powerful, particularly for
things like DNA databases, where the underlying database is large and fixed but must be able to deal with lots of
queries.

Suppose that P lies in the string S; for example, suppose P corresponds to S[i . . . i + n − 1]. Then P is the prefix
of the suffix S[i . . . m]. Hence, if we starting matching characters in P against the labels in the suffix tree for S, we
will follow part of the path from the root to the leaf vertex labeled i. Hence, to find all occurences of P in S, start at
the root, and match down the tree as far as possible. This takes time O(n). If P does not match some path in the tree,
then P does not lie in S. If P does match some path in the tree, in matches down to some point z. All the leaves in the
subtree below z correspond to suffixes for which P is a prefix, so the labels on these leaves correspond to locations
that begin an occurence of P. To find these positions, we just traverse the subtree below z, using for example depth
first search. If there are k leaves, the depth first search takes only O(k) time.
Lecture 22 22-5

22.4 Representation

An important point about suffix trees: to make sure everything takes linear time, it is important to use the correct
representation. For example, we do not explicitly label each edge with a group of characters– this could take as
much as Ω(n2 ) time to just write down! Instead, each edge is labeled with a pair of values, representing characters.
For example, an edge labeled [a, b] should be thought of as being labeled by the character S[a] . . . S[b]. Hence each
edge is just labeled by two numbers, and only linear space is required.
Lecture 22 22-6

$
8
x
y
zx

y zxy$ $ zxzxy$
zxy$ y$
4 7 2
zxzxy$ 3 5
$
1 6

x
y zx
7
y zxy zxzxy
zxy y
6 4 2
zxzxy 3 5
1

Figure 22.1: A true suffix tree (top); why we need the $ character (bottom).
Lecture 22 22-7

22.5 Generalized suffix trees

You may want to put a set {S1 , S2 , . . . , Sk } of strings in a suffix tree data structure. (Note– we assume each string ends
with the special character $.) The structure in this case is called a generalized suffix tree. There are two primary
differences. First, now each leaf node may contain multiple pairs of numbers. Each pair of numbers identifies a
string Si and a location where the suffix from the root to that leaf starts in S i . Note that multiple strings can have a
suffix that share a leaf node! Second, each edge label must be represented by three numbers: a number i and a pair
[a, b] represent that the characters on the edge label are S i [a] . . . Si [b].

Construcing a generalized suffix tree can easily be done by extending our quadratic time algorithm. However,
the linear time algorithm for suffix trees can also be used to build a generalized suffix tree. Hence if m = ∑ki=1 |Si |,
constructing the generalized suffix tree can be done in O(m) time.
Lecture 22 22-8

22.6 Longest common extension

Using generalized suffix trees and the LCA algorithm, we can solve a very general problem called the largest
common extension problem. Given strings S 1 and S2 , we wish to pre-process the string so that we can answer
questions of the following form: given a pair (i, j), find the longest substring of S 1 that begins at position i that
matches a substring of S2 that begins at position j.

We will use linear time pre-processing and linear space, after which we can answer queries in constant time.

The solution is to build a generalized search tree for S 1 and S2 . When we build this tree, we should also compute
the string depth of each node. The string depth of a node is simply the number of characters along the edges from
the root to that node. Notice the string depth is not the same as the tree depth. Also, after building the tree, we
precompute the information necessary to do LCA queries on the tree.

Given a pair (i, j) we compute the least common ancestor u of the leaf nodes corresponding to the suffix
beginning at i in S1 and the suffix beginning at j in S2 . The path from the root to u is longest common extension, and
hence the string depth of this node is all we need.
Lecture 22 22-9

22.7 Maximal palindromes

A palindrome is a string that reads the same forwards as backwards, such as axbccbxa.

A substring U of a string S is a maximal palindrome if and only if it is a palindrome and extending it one
character in both directions yields a string that is not a palindrome. Generally we separate even-length maximal
palindromes, or even palindromes for short, and odd-length maximal palindromes (odd palindromes) for conve-
nience. For example, in S = axbccbbbaa, the maximal even palindromes are bccb, bb, and aa. The string bbb is
a maximal odd palindrome, and we will skip writing the maximal odd palindromes of length 1. Note that every
palindrome is contained in a maximal palindrome.

Here is a simple way to find all even-length maximal palindromes in linear time. (Finding odd-length maximal
palindromes is similar.)

Consider S and Sr , the reversal of S. There is a palindrome of length 2k with the middle just after position q if
the string of length k starting from position q + 1 of S matches the string of length k starting from position n − q + 1
of Sr . In particular, this palindrome will be maximal if this is the length of the longest match from these positions.

Thus, solving the even-length maximal palindrome problem corresponds to computing the longest common
extension of (q + 1, n − q + 1) for all possible q. The data stucture can be processed in linear time, and each of the
linear number of queries can be answered in constant time, so the total time is linear.

You might also like