You are on page 1of 90

Root-finding algorithm

A root-finding algorithm is a numerical method, or algorithm, for finding a value x such that f(x) = 0, for a given function f. Such an x is called a root of the function f. This article is concerned with finding scalar, real or complex roots, approximated as floating point numbers. Finding integer roots or exact algebraic roots are separate problems, whose algorithms have little in common with those discussed here. (Diophantine equation as for integer roots) Finding a root of f(x) g(x) = 0 is the same as solving the equation f(x) = g(x). Here, x is called the unknown in the equation. Conversely, any equation can take the canonical form f(x) = 0, so equation solving is the same thing as computing (or finding) a root of a function. Numerical root-finding methods use iteration, producing a sequence of numbers that hopefully converge towards a limit (the so called "fixed point") which is a root. The first values of this series are initial guesses. The method computes subsequent values based on the old ones and the function f. The behaviour of root-finding algorithms is studied in numerical analysis. Algorithms perform best when they take advantage of known characteristics of the given function. Thus an algorithm to find isolated real roots of a low-degree polynomial in one variable may bear little resemblance to an algorithm for complex roots of a "black-box" function which is not even known to be differentiable. Questions include ability to separate close roots, robustness in achieving reliable answers despite inevitable numerical errors, and rate of convergence.

Specific algorithms
The simplest root-finding algorithm is the bisection method. It works when f is a continuous function and it requires previous knowledge of two initial guesses, a and b, such that f(a) and f(b) have opposite signs. Although it is reliable, it converges slowly, gaining one bit of accuracy with each iteration. Newton's method assumes the function f to have a continuous derivative. Newton's method may not converge if started too far away from a root. However, when it does converge, it is faster than the bisection method, and is usually quadratic. Newton's method is also important because it readily generalizes to higher-dimensional problems. Newton-like methods with higher order of convergence are the Householder's methods. The first one after Newton's method is Halley's method with cubic order of convergence. Replacing the derivative in Newton's method with a finite difference, we get the secant method. This method does not require the computation (nor the existence) of a derivative, but the price is slower convergence (the order is approximately 1.6). The false position method, also called the regula falsi method, is like the secant method. However, instead of retaining the last two points, it makes sure to keep one point on either side of the root. The false position method is faster than the bisection method and more robust than the secant method, but requires the two starting points to bracket the root.

The secant method also arises if one approximates the unknown function f by linear interpolation. When quadratic interpolation is used instead, one arrives at Mller's method. It converges faster than the secant method. A particular feature of this method is that the iterates xn may become complex. This can be avoided by interpolating the inverse of f, resulting in the inverse quadratic interpolation method. Again, convergence is asymptotically faster than the secant method, but inverse quadratic interpolation often behaves poorly when the iterates are not close to the root. Finally, Brent's method is a combination of the bisection method, the secant method and inverse quadratic interpolation. At every iteration, Brent's method decides which method out of these three is likely to do best, and proceeds by doing a step according to that method. This gives a robust and fast method, which therefore enjoys considerable popularity.

Bisection method
In mathematics, the bisection method is a root-finding algorithm which repeatedly bisects an interval then selects a subinterval in which a root must lie for further processing. It is a very simple and robust method, but it is also relatively slow. The method The method is applicable when we wish to solve the equation f(x) = 0 for the scalar variable x, where f is a continuous function.

A few steps of the bisection method applied over the starting range [a1;b1]. The bigger red dot is the root of the function. The bisection method requires two initial points a and b such that f(a) and f(b) have opposite signs. This is called a bracket of a root, for by the intermediate value theorem the continuous function f must have at least one root in the interval (a, b). The method now divides the interval in two by computing the midpoint c = (a+b) / 2 of the interval. Unless c is itself a root--which is very unlikely, but possible--there are now two possibilities: either f(a) and f(c)

have opposite signs and bracket a root, or f(c) and f(b) have opposite signs and bracket a root. We select the subinterval that is a bracket, and apply the same bisection step to it. In this way the interval that might contain a zero of f is reduced in width by 50% at each step. We continue until we have a bracket sufficiently small for our purposes. Explicitly, if f(a) f(c) < 0, then the method sets b equal to c, and if f(b) f(c) < 0, then the method sets a equal to c. In both cases, the new f(a) and f(b) have opposite signs, so the method is applicable to this smaller interval. A practical implementation of this method must guard against the uncommon occurrence that the midpoint is indeed a solution. Analysis If f is a continuous function on the interval [a, b] and f(a)f(b) < 0, then the bisection method converges to a root of f. In fact, the absolute error is halved at each step. Thus, the method converges linearly, which is quite slow. On the other hand, the method is guaranteed to converge if f(a) and f(b) have different signs. The bisection method gives only a range where the root exists, rather than a single estimate for the root's location. Without using any other information, the best estimate for the location of the root is the midpoint of the smallest bracket found. In that case, the absolute error after n steps is at most

If either endpoint of the interval is used, then the maximum absolute error is

the entire length of the interval. These formulas can be used to determine in advance the number of iterations that the bisection method would need to converge to a root to within a certain tolerance. For, using the second formula for the error, the number of iterations n has to satisfy

to ensure that the error is smaller than the tolerance . If f has several simple roots in the interval [a,b], then the bisection method will find one of them. Practical considerations For robust usage in practice, some care is required due to the properties of floating-point arithmetic.

1. The expression f(left) * f(midpoint) is very likely to underflow to 0 since both arguments are approaching a zero of f. To avoid this possibility, the two function values should be tested separately rather than multiplied. 2. If epsilon is too small, the value of abs(right - left) might never become as small as 2*epsilon, as left and right can get stuck at adjacent non-equal floating-point values. This possibility can be avoided by disallowing epsilon to be too small (depending on the precision of the arithmetic) or by adding extra tests to detect the stuck condition. Another method relies on the assumption that (right+left)/2 exactly equals left or right if left and right are adjacent non-equal floating-point values. This is true for most hardware, including IEEE arithmetic in the absence of overflow, but can be violated if intermediate expressions are calculated to greater precision than stored variables (depending on computer optimization).

Newton's method
In numerical analysis, Newton's method (also known as the NewtonRaphson method), named after Isaac Newton and Joseph Raphson, is perhaps the best known method for finding successively better approximations to the zeroes (or roots) of a real-valued function. Newton's method can often converge remarkably quickly, especially if the iteration begins "sufficiently near" the desired root. Just how near "sufficiently near" needs to be, and just how quickly "remarkably quickly" can be, depends on the problem. This is discussed in detail below. Unfortunately, when iteration begins far from the desired root, Newton's method can easily lead an unwary user astray with little warning. Thus, good implementations of the method embed it in a routine that also detects and perhaps overcomes possible convergence failures. Given a function (x) and its derivative '(x), we begin with a first guess x0. A better approximation x1 is

An important and somewhat surprising application is NewtonRaphson division, which can be used to quickly find the reciprocal of a number using only multiplication and subtraction. The algorithm is first in the class of Householder's methods, succeeded by Halley's method. Description of the method The idea of the method is as follows: one starts with an initial guess which is reasonably close to the true root, then the function is approximated by its tangent line (which can be computed using the tools of calculus), and one computes the x-intercept of this tangent line (which is easily done with elementary algebra). This x-intercept will typically be a better approximation to the function's root than the original guess, and the method can be iterated.

An illustration of one iteration of Newton's method (the function is shown in blue and the tangent line is in red). We see that xn+1 is a better approximation than xn for the root x of the function f. Suppose : [a, b] R is a differentiable function defined on the interval [a, b] with values in the real numbers R. The formula for converging on the root can be easily derived. Suppose we have some current approximation xn. Then we can derive the formula for a better approximation, xn+1 by referring to the diagram on the right. We know from the definition of the derivative at a given point that it is the slope of a tangent at that point. That is

Here, f' denotes the derivative of the function f. Then by simple algebra we can derive

We start the process off with some arbitrary initial value x0. (The closer to the zero, the better. But, in the absence of any intuition about where the zero might lie, a "guess and check" method might narrow the possibilities to a reasonably small interval by appealing to the intermediate value theorem.) The method will usually converge, provided this initial guess is close enough to the unknown zero, and that '(x0) 0. Furthermore, for a zero of multiplicity 1, the convergence is at least quadratic in a neighbourhood of the zero, which intuitively means that the number of correct digits roughly at least doubles in every step. More details can be found in the analysis section below. Application to minimization and maximization problems Newton's method can also be used to find a minimum or maximum of a function. The derivative is zero at a minimum or maximum, so minima and maxima can be found by applying Newton's method to the derivative. The iteration becomes:

Practical considerations Newton's method is an extremely powerful technique -- in general the convergence is quadratic: the error is essentially squared at each step (that is, the number of accurate digits doubles in each step). However, there are some difficulties with the method. 1. Newton's method requires that the derivative be calculated directly. In most practical problems, the function in question may be given by a long and complicated formula, and hence an analytical expression for the derivative may not be easily obtainable. In these situations, it may be appropriate to approximate the derivative by using the slope of a line through two points on the function. In this case, the Secant method results. This has slightly slower convergence than Newton's method but does not require the existence of derivatives. 2. If the initial value is too far from the true zero, Newton's method may fail to converge. For this reason, Newton's method is often referred to as a local technique. Most practical implementations of Newton's method put an upper limit on the number of iterations and perhaps on the size of the iterates. 3. If the derivative of the function is not continuous the method may fail to converge. 4. It is clear from the formula for Newton's method that it will fail in cases where the derivative is zero. Similarly, when the derivative is close to zero, the tangent line is nearly horizontal and hence may "shoot" wildly past the desired root. 5. If the root being sought has multiplicity greater than one, the convergence rate is merely linear (errors reduced by a constant factor at each step) unless special steps are taken. When there are two or more roots that are close together then it may take many iterations before the iterates get close enough to one of them for the quadratic convergence to be apparent. Since the most serious of the problems above is the possibility of a failure of convergence, Press et al. (1992) present a version of Newton's method that starts at the midpoint of an interval in which the root is known to lie and stops the iteration if an iterate is generated that lies outside the interval. Developers of large scale computer systems involving root finding tend to prefer the secant method over Newton's method because the use of a difference quotient in place of the derivative in Newton's method implies that the additional code to compute the derivative need not be maintained. In practice, the advantages of maintaining a smaller code base usually outweigh the superior convergence characteristics of Newton's method. Analysis Suppose that the function has a zero at , i.e., () = 0. If f is continuously differentiable and its derivative is nonzero at , then there exists a neighborhood of such that for all starting values x0 in that neighborhood, the sequence {xn} will converge to .

If the function is continuously differentiable and its derivative is not 0 at and it has a second derivative at then the convergence is quadratic or faster. If the second derivative is not 0 at then the convergence is merely quadratic. If the derivative is 0 at , then the convergence is usually only linear. Specifically, if is twice continuously differentiable, '() = 0 and ''() 0, then there exists a neighborhood of such that for all starting values x0 in that neighborhood, the sequence of iterates converges linearly, with rate log102. Alternatively if '() = 0 and '(x) 0 for x 0, x in a neighborhood U of , being a zero of multiplicity r, and if Cr(U) then there exists a neighborhood of such that for all starting values x0 in that neighborhood, the sequence of iterates converges linearly. However, even linear convergence is not guaranteed in pathological situations. In practice these results are local and the neighborhood of convergence are not known a priori, but there are also some results on global convergence, for instance, given a right neighborhood U+ of , if f is twice differentiable in U+ and if ' 0, '' > 0 in U+, then, for each x0 in U+ the sequence xk is monotonically decreasing to . Counterexamples Newton's method is only guaranteed to converge if certain conditions are satisfied, so depending on the shape of the function and the starting point it may or may not converge. In some cases the conditions on function necessary for convergence are satisfied, but the point chosen as the initial point is not in the interval where the method converges. In such cases a different method, such as bisection, should be used to obtain a better estimate for the zero to use as an initial point. If we start iterating from the stationary point x0=0 (where the derivative is zero), x1 will be undefined. The same issue occurs if, instead of the starting point, any iteration point is stationary. Even if the derivative is not zero but is small, the next iteration will be far away from the desired zero. For some functions, some starting points may enter an infinite cycle, preventing convergence. In general, the behavior of the sequence can be very complex. If the function is not continuously differentiable in a neighborhood of the root then it's possible that Newton's method will always diverge, unless the solution is guessed on the first try. In such cases a different method should be used. A simple example of a function where Newton's method diverges is the cube root, which is continuous and infinitely differentiable with continuity, except for x = 0, where its derivative is undefined (this, however, does not affect the algorithm, since it will never require the derivative if the solution is already found. If the derivative is not continuous at the root, then convergence may fail to occur in any neighborhood of the root. In some cases the iterates converge but do not converge as quickly as promised. In these cases simpler methods converge just as quickly as Newton's method. If the first derivative is zero at the root, then convergence will not be quadratic. If there is no second derivative at the root, then convergence may fail to be quadratic.

Generalizations When dealing with complex functions, Newton's method can be directly applied to find their zeroes. Each zero has a basin of attraction, the set of all starting values that cause the method to converge to that particular zero. These sets can be mapped as in the image shown. For many complex functions, the boundary of the basins of attraction is a fractal. In some cases there are regions in the complex plane which are not in any of these basins of attraction, meaning the iterates do not converge. One may use Newton's method also to solve systems of k (non-linear) equations, which amounts to finding the zeroes of continuously differentiable functions F: Rk Rk. In the formulation given above, one then has to left multiply with the inverse of the k-by-k Jacobian matrix JF(xn) instead of dividing by f'(xn). Rather than actually computing the inverse of this matrix, one can save time by solving the system of linear equations

for the unknown xn+1 xn. Again, this method only works if the initial value x0 is close enough to the true zero. Typically, a region which is well-behaved is located first with some other method and Newton's method is then used to "polish" a root which is already known approximately. Another generalization is the Newton's method to find a root of a function F defined in a Banach space. In this case the formulation is

where F'Xn is the Frchet derivative applied at the point Xn. One needs the Frchet derivative to be boundedly invertible at each Xn in order for the method to be applicable. A condition for existence of and convergence to a root is given by the NewtonKantorovich theorem.

Householder's method
In numerical analysis, the class of Householder's methods are root-finding algorithms used for functions of one real variable with continuous derivatives up to some order d+1, where d will be the order of the Householder's method. The algorithm is iterative and it has rate of convergence of d+1. Method Like any root-finding method, the Householder's method is a numerical algorithm for solving the nonlinear equation f(x) = 0. In this case, the function f has to be a function of one real variable. The method consists of a sequence of iterations:

beginning with an initial guess x0.

If f is a (d+1) times continuously differentiable function and a is a zero of f but not of its derivative, then, in a neighborhood of a, the iterates xn satisfy: , for some This means that the iterates converge to the zero if the initial guess is sufficiently close, and that the convergence has rate (d+1). Motivation An approximate idea of the origin of the Householder's method derives from the geometric series. Let the real-valued, continuously differentiable function f(x) have a simple zero at x=a, that is f(a)=0 while '() 0. Then 1/f(x) has a simple pole at a, and close to a the behavior of 1/f(x) is dominated by the factor 1/(x-a). Approximatively one gets

Here '() 0 because a is a simple zero of f(x). The coefficient of degree d has the value Ca-d. Thus, one can now reconstruct the zero a by dividing the coefficient of degree d-1 by the coefficient of degree d. Since this geometric series is an approximation to the Taylor expansion of 1/f(x), one can get estimates of the zero of f(x) now without prior knowledge of the location of this zero by dividing the corresponding coefficients of the Taylor expansion of 1/f(x) or, more generally, 1/f(b+x). From that one gets, for any integer d, and if the corresponding derivatives exist,

The methods of lower order The Householder's method of order 1 is just Newton's method, since:

For the Householder's method of order 2 one gets Halley's method, since the identities and result in

In the last line, hn = f(xn) / f'(xn) is the update of the Newton iteration at the point xn. This line was added to demonstrate where the difference to the simple Newton's method lies. The third order method is obtained from the identity of the third order derivative of 1/f

and has the formula

and so on...

Halley's method
In numerical analysis, Halley's method is a root-finding algorithm used for functions of one real variable with a continuous second derivative, i.e. C2 functions. It is named after its inventor Edmond Halley who also discovered Halley's Comet. The algorithm is second in the class of Householder's methods, right after the Newton's method. Like the latter it produces iteratively a sequence of approximations to the root, their rate of convergence to the root is cubic. Multidimensional versions of this method exist. Method Like any root-finding method, Halley's method is a numerical algorithm for solving the nonlinear equation (x) = 0. In this case, the function has to be a function of one real variable. The method consists of a sequence of iterations:

10

beginning with an initial guess x0. If is a thrice continuously differentiable function and a is a zero of but not of its derivative, then, in a neighborhood of a, the iterates xn satisfy:

This means that the iterates converge to the zero if the initial guess is sufficiently close, and that the convergence is cubic. Derivation Consider the function

Any root of which is not a root of its derivative is a root of g; and any root of g is a root of . Applying Newton's method to g gives

with

and the result follows. Notice that if '(c) = 0, then one cannot apply this at c because g(c) would be undefined. Cubic convergence Suppose a is a root of f but not of its derivative. And suppose that the third derivative of f exists and is continuous in a neighborhood of a and xn is in that neighborhood. Then Taylor's theorem implies:

and also

11

where and are numbers lying between a and xn. Multiply the first equation by 2f'(xn) and subtract from it the second equation times f''(xn)(a xn) to give:

Canceling f'(xn)f''(xn)(a xn)2 and re-organizing terms yields:

Put the second term on the left side and divide through by 2[f'(xn)]2 f(xn)f''(xn) to get:

Thus:

The limit of the coefficient on the right side as xn approaches a is:

If we take K to be a little larger than the absolute value of this, we can take absolute values of both sides of the formula and replace the absolute value of coefficient by its upper bound near a to get:

which is what was to be proved.

Secant method
In numerical analysis, the secant method is a root-finding algorithm that uses a succession of roots of secant lines to better approximate a root of a function f.

12

The method The secant method is defined by the recurrence relation

As can be seen from the recurrence relation, the secant method requires two initial values, x0 and x1, which should ideally be chosen to lie close to the root.

The first two iterations of the secant method. The red curve shows the function f and the blue lines are the secants. Derivation of the method Given xn1 and xn, we construct the line through the points (xn1, f(xn1)) and (xn, f(xn)), as demonstrated in the picture. Note that this line is a secant or chord of the graph of the function f. In point-slope form, it can be defined as

We now choose xn+1 to be the root of this line, so xn+1 is chosen such that

Solving this equation gives the recurrence relation for the secant method. Convergence The iterates xn of the secant method converge to a root of f, if the initial values x0 and x1 are sufficiently close to the root. The order of convergence is , where

13

is the golden ratio. In particular, the convergence is superlinear. This result only holds under some technical conditions, namely that f be twice continuously differentiable and the root in question be simple (i.e., with multiplicity 1). If the initial values are not close to the root, then there is no guarantee that the secant method converges. Comparison with other root-finding methods The secant method does not require that the root remain bracketed like the bisection method does, and hence it does not always converge. The false position method uses the same formula as the secant method. However, it does not apply the formula on xn1 and xn, like the secant method, but on xn and on the last iterate xk such that f(xk) and f(xn) have a different sign. This means that the false position method always converges. The recurrence formula of the secant method can be derived from the formula for Newton's method

by using the finite difference approximation

If we compare Newton's method with the secant method, we see that Newton's method converges faster (order 2 against 1.6). However, Newton's method requires the evaluation of both f and its derivative at every step, while the secant method only requires the evaluation of f. Therefore, the secant method may well be faster in practice. For instance, if we assume that evaluating f takes as much time as evaluating its derivative and we neglect all other costs, we can do two steps of the secant method (decreasing the logarithm of the error by a factor 2.6) for the same cost as one step of Newton's method (decreasing the logarithm of the error by a factor 2), so the secant method is faster. Generalizations Broyden's method is a generalization of the secant method to more than one dimension.

False position method


In numerical analysis, the false position method or regula falsi method is a root-finding algorithm that combines features from the bisection method and the secant method.

14

The method Like the bisection method, the false position method starts with two points a0 and b0 such that f(a0) and f(b0) are of opposite signs, which implies by the intermediate value theorem that the function f has a root in the interval [a0, b0]. The method proceeds by producing a sequence of shrinking intervals [ak, bk] that all contain a root of f.

The first two iterations of the false position method. The red curve shows the function f and the blue lines are the secants. At iteration number k, the number

is computed. As explained below, ck is the root of the secant line through (ak, f(ak)) and (bk, f(bk)). If f(ak) and f(ck) have the same sign, then we set ak+1 = ck and bk+1 = bk, otherwise we set ak+1 = ak and bk+1 = ck. This process is repeated until the root is approximated sufficiently well. The above formula is also used in the secant method, but the secant method always retains the last two computed points, while the false position method retains two points which certainly bracket a root. On the other hand, the only difference between the false position method and the bisection method is that the latter uses ck = (ak + bk) / 2. Finding the root of the secant Given ak and bk, we construct the line through the points (ak, f(ak)) and (bk, f(bk)), as demonstrated in the picture on the right. Note that this line is a secant or chord of the graph of the function f. In point-slope form, it can be defined as

15

We now choose ck to be the root of this line, so c is chosen such that

Solving this equation gives the above equation for ck. Analysis If the initial end-points a0 and b0 are chosen such that f(a0) and f(b0) are of opposite signs, then one of the end-points will converge to a root of f. Asymptotically, the other end-point will remain fixed for all subsequent iterations while the converging endpoint becomes updated. As a result, unlike the bisection method, the width of the bracket does not tend to zero. As a consequence, the linear approximation to f(x), which is used to pick the false position, does not improve in its quality. One example of this phenomenon is the function f(x) = 2x3 4x2 + 3x on the initial bracket [1,1]. The left end, 1, is never replaced and thus the width of the bracket never falls below 1. Hence, the right endpoint approaches 0 at a linear rate (with a rate of convergence of 2/3). While it is a misunderstanding to think that the method of false position is a good method, it is equally a mistake to think that it is unsalvageable. The failure mode is easy to detect (the same end-point is retained twice in a row) and easily remedied by next picking a modified false position, such as

or

down-weighting one of the endpoint values to force the next ck to occur on that side of the function. The factor of 2 above looks like a hack, but it guarantees superlinear convergence (asymptotically, the algorithm will perform two regular steps after any modified step). There are other ways to pick the rescaling which give even better superlinear convergence rates.

16

Ford (1995) summarizes and analyzes the superlinear variants of the modified method of false position. Judging from the bibliography, modified regula falsi methods were well known in the 1970s and have been subsequently forgotten or misremembered in current textbooks.

Mller's method
Mller's method is a root-finding algorithm, a numerical method for solving equations of the form f(x) = 0. It is first presented by D. E. Mller in 1956. Mller's method is based on the secant method, which constructs at every iteration a line through two points on the graph of f. Instead, Mller's method uses three points, constructs the parabola through these three points, and takes the intersection of the x-axis with the parabola to be the next approximation. Recurrence relation The three initial values needed are denoted as xk, xk-1 and xk-2. The parabola going through the three points (xk, f(xk)), (xk-1, f(xk-1)) and (xk-2, f(xk-2)), when written in the Newton form, is

where f[xk, xk-1] and f[xk, xk-1, xk-2] denote divided differences. This can be rewritten as

where

The next iterate is now given by the root of the quadratic equation y = 0. This yields the recurrence relation

In this formula, the sign should be chosen such that the denominator is as large as possible in magnitude. We do not use the standard formula for solving quadratic equations because that may lead to loss of significance. Note that xk+1 can be complex, even if the previous iterates were all real. This is in contrast with other root-finding algorithms like the secant method or Newton's method, whose iterates will remain real if one starts with real numbers. Having complex iterates can be an advantage (if one is looking for complex roots) or a disadvantage (if it is known that all roots are real), depending on the problem.

17

Speed of convergence The order of convergence of Mller's method is approximately 1.84. This can be compared with 1.62 for the secant method and 2 for Newton's method. So, the secant method makes less progress per iteration than Mller's method and Newton's method makes more progress. More precisely, if denotes a single root of f (so f() = 0 and f'() 0), f is thrice continuously differentiable, and the initial guesses x0, x1, and x2 are taken sufficiently close to , then the iterates satisfy

where p 1.84 is the positive root of x3 x2 x 1 = 0.

Inverse quadratic interpolation


In numerical analysis, inverse quadratic interpolation is a root-finding algorithm, meaning that it is an algorithm for solving equations of the form f(x) = 0. The idea is to use quadratic interpolation to approximate the inverse of f. This algorithm is rarely used on its own, but it is important because it forms part of the popular Brent's method. The method The inverse quadratic interpolation algorithm is defined by the recurrence relation

where fk = f(xk). As can be seen from the recurrence relation, this method requires three initial values, x0, x1 and x2. Explanation of the method We use the three preceding iterates, xn2, xn1 and xn, with their function values, fn2, fn1 and fn. Applying the Lagrange interpolation formula to do quadratic interpolation on the inverse of f yields

We are looking for a root of f, so we substitute y = f(x) = 0 in the above equation and this results in the above recursion formula. 18

Behaviour The asymptotic behaviour is very good: generally, the iterates xn converge fast to the root once they get close. However, performance is often quite poor if you do not start very close to the actual root. For instance, if by any chance two of the function values fn2, fn1 and fn coincide, the algorithm fails completely. Thus, inverse quadratic interpolation is seldom used as a stand-alone algorithm. Comparison with other root-finding methods As noted in the introduction, inverse quadratic interpolation is used in Brent's method. Inverse quadratic interpolation is also closely related to some other root-finding methods. Using linear interpolation instead of quadratic interpolation gives the secant method. Interpolating f instead of the inverse of f gives Muller's method.

Brent's method
In numerical analysis, Brent's method is a complicated but popular root-finding algorithm combining the bisection method, the secant method and inverse quadratic interpolation. It has the reliability of bisection but it can be as quick as some of the less reliable methods. The idea is to use the secant method or inverse quadratic interpolation if possible, because they converge faster, but to fall back to the more robust bisection method if necessary. Brent's method is due to Richard Brent (1973) and builds on an earlier algorithm of Theodorus Dekker (1969). Jack Crenshaw says a variation on Brent's method -- one that alternates between bisection and attempting inverse quadratic interpolation -- is the best root finder he has found. A variation on Brent's method was implemented in one of the first available software libraries.

Dekker's method
The idea to combine the bisection method with the secant method goes back to Dekker. Suppose that we want to solve the equation f(x) = 0. As with the bisection method, we need to initialize Dekker's method with two points, say a0 and b0, such that f(a0) and f(b0) have opposite signs. If f is continuous on [a0, b0], the intermediate value theorem guarantees the existence of a solution between a0 and b0. Three points are involved in every iteration:

bk is the current iterate, i.e., the current guess for the root of f. ak is the "contrapoint," i.e., a point such that f(ak) and f(bk) have opposite signs, so the interval [ak, bk] contains the solution. Furthermore, |f(bk)| should be less than or equal to |f(ak)|, so that bk is a better guess for the unknown solution than ak. bk1 is the previous iterate (for the first iteration, we set bk1 = a0).

Two provisional values for the next iterate are computed. The first one is given by the secant method: 19

and the second one is given by the bisection method

If the result of the secant method, s, lies between bk and m, then it becomes the next iterate (bk+1 = s), otherwise the midpoint is used (bk+1 = m). Then, the value of the new contrapoint is chosen such that f(ak+1) and f(bk+1) have opposite signs. If f(ak) and f(bk+1) have opposite signs, then the contrapoint remains the same: ak+1 = ak. Otherwise, f(bk+1) and f(bk) have opposite signs, so the new contrapoint becomes ak+1 = bk. Finally, if |f(ak+1)| < |f(bk+1)|, then ak+1 is probably a better guess for the solution than bk+1, and hence the values of ak+1 and bk+1 are exchanged. This ends the description of a single iteration of Dekker's method.

Brent's method
Dekker's method performs well if the function f is reasonably well-behaved. However, there are circumstances in which every iteration employs the secant method, but the iterates bk converge very slowly (in particular, |bk bk1| may be arbitrarily small). Dekker's method requires far more iterations than the bisection method in this case. Brent proposed a small modification to avoid this problem. He inserted an additional test which must be satisfied before the result of the secant method is accepted as the next iterate. Two inequalities must be simultaneously satisfied:

given a specific numerical tolerance , if the previous step used the bisection method, the inequality | | < | bk bk 1 | must hold, otherwise the bisection method is performed and its result used for the next iteration. If the previous step performed interpolation, then the inequality | | < | bk 1 bk 2 | is used instead.

Also, if the previous step used the bisection method, the inequality

must hold, otherwise the bisection method is performed and its result used for the next iteration. If the previous step performed interpolation, then the inequality 20

is used instead. This modification ensures that at the kth iteration, a bisection step will be performed in at most 2log2( | bk 1 bk 2 | / ) additional iterations, because the above conditions force consecutive interpolation step sizes to halve every two iterations, and after at most 2log2( | bk 1 bk 2 | / ) iterations, the step size will be smaller than , which invokes a bisection step. Brent proved that his method requires at most N2 iterations, where N denotes the number of iterations for the bisection method. If the function f is well-behaved, then Brent's method will usually proceed by either inverse quadratic or linear interpolation, in which case it will converge superlinearly. Furthermore, Brent's method uses inverse quadratic interpolation instead of linear interpolation (as used by the secant method) if f(bk), f(ak) and f(bk1) are distinct. This slightly increases the efficiency. As a consequence, the condition for accepting s (the value proposed by either linear interpolation or inverse quadratic interpolation) has to be changed: s has to lie between (3ak + bk) / 4 and bk.

21

22

Interpolation and extrapolation


Interpolation
In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points. In engineering and science one often has a number of data points, as obtained by sampling or experimentation, and tries to construct a function which closely fits those data points. This is called curve fitting or regression analysis. Interpolation is a specific case of curve fitting, in which the function must go exactly through the data points. A different problem which is closely related to interpolation is the approximation of a complicated function by a simple function. Suppose we know the function but it is too complex to evaluate efficiently. Then we could pick a few known data points from the complicated function, creating a lookup table, and try to interpolate those data points to construct a simpler function. Of course, when using the simple function to calculate new data points we usually do not receive the same result as when using the original function, but depending on the problem domain and the interpolation method used the gain in simplicity might offset the error. There are many different interpolation methods, some of which are described below. Some of the concerns to take into account when choosing an appropriate algorithm are: How accurate is the method? How expensive is it? How smooth is the interpolant? How many data points are needed?

Linear interpolation
Linear interpolation is a method of curve fitting using linear polynomials. It is heavily employed in mathematics (particularly numerical analysis), and numerous applications including computer graphics. It is a simple form of interpolation. Linear interpolation is quick and easy, but it is not very precise. Another disadvantage is that the interpolant is not differentiable at the point xk. Linear interpolation is often used to fill the gaps in a table. Suppose you have a table listing the population of some country in 1970, 1980, 1990 and 2000, and that you want to estimate the population in 1994. Linear interpolation gives you an easy way to do this. The basic operation of linear interpolation between two values is so commonly used in computer graphics that it is sometimes called a lerp in that field's jargon. The term can be used as a verb or noun for the operation. e.g. "Bresenham's algorithm lerps incrementally between the two endpoints of the line." Lerp operations are built into the hardware of all modern computer graphics processors. They are often used as building blocks for more complex operations: for example, a bilinear interpolation can be accomplished in two lerps. Because this operation is cheap, it's also a good way to implement accurate lookup tables with quick lookup for smooth functions without having too many table entries.

23

Linear interpolation between two known points If the two known points are given by the coordinates (x0, y0) and (x1, y1), the linear interpolant is the straight line between these points. For a value x in the interval (x0, x1), the value y along the straight line is given from the equation

which can be derived geometrically from the figure below.

Given the two red points, the blue line is the linear interpolant between the points, and the value y at x may be found by linear interpolation. Solving this equation for y, which is the unknown value at x, gives

which is the formula for linear interpolation in the interval (x0, x1). Outside this interval, the formula is identical to linear extrapolation. Interpolation of a data set Linear interpolation on a set of data points (x0, y0), (x1, y1),, (xn, yn) is defined as the concatenation of linear interpolants between each pair of data points. This results in a continuous curve, with a discontinuous derivative.

24

Linear interpolation on a data set (red points) consists of pieces of linear interpolants (blue lines). Linear interpolation as approximation Linear interpolation is often used to approximate a value of some function f using two known values of that function at other points. The error of this approximation is defined as

where p denotes the linear interpolation polynomial defined above

It can be proven using Rolle's theorem that if f has a continuous second derivative, the error is bounded by

As you see, the approximation between two points on a given function gets worse with the second derivative of the function that is approximated. This is intuitively correct as well: the "curvier" the function is, the worse is the approximations made with simple linear interpolation. In words, the error is proportional to the square of the distance between the data points. The error of some other methods, including polynomial interpolation and spline interpolation (described below), is proportional to higher powers of the distance between the data points. These methods also produce smoother interpolants. Multivariate Linear interpolation as described here is for data points in one spatial dimension. For two spatial dimensions, the extension of linear interpolation is called bilinear interpolation, and in 25

three dimensions, trilinear interpolation. Other extensions of linear interpolation can be applied to other kinds of mesh such as triangular and tetrahedral meshes.

Polynomial interpolation
In the mathematical subfield of numerical analysis, polynomial interpolation is the interpolation of a given data set by a polynomial. In other words, given some data points (such as obtained by sampling), the aim is to find a polynomial which goes exactly through these points. Polynomial interpolation is a generalization of linear interpolation. Note that the linear interpolant is a linear function. We now replace this interpolant by a polynomial of higher degree. Polynomials can be used to approximate more complicated curves, for example, the shapes of letters in typography, given a few points. A related application is the evaluation of the natural logarithm and trigonometric functions: pick a few known data points, create a lookup table, and interpolate between those data points. This results in significantly faster computations. Polynomial interpolation also forms the basis for algorithms in numerical quadrature and numerical ordinary differential equations. Generally, if we have n data points, there is exactly one polynomial of degree at most n1 going through all the data points. The interpolation error is proportional to the distance between the data points to the power n. Furthermore, the interpolant is a polynomial and thus infinitely differentiable. So, we see that polynomial interpolation solves all the problems of linear interpolation. However, polynomial interpolation also has some disadvantages. Calculating the interpolating polynomial is computationally expensive compared to linear interpolation. Furthermore, polynomial interpolation may not be so exact after all, especially at the end points (Runge's phenomenon). These disadvantages can be avoided by using spline interpolation. Constructing the interpolation polynomial Suppose that the interpolation polynomial is in the form

The statement that p interpolates the data points means that

If we substitute equation (1) in here, we get a system of linear equations in the coefficients ak. The system in matrix-vector form reads

26

We have to solve this system for ak to construct the interpolant p(x). The matrix on the left is commonly referred to as a Vandermonde matrix. Its determinant is nonzero, which proves the unisolvence theorem: there exists a unique interpolating polynomial. The condition number of the Vandermonde matrix may be large, causing large errors when computing the coefficients ai if the system of equations is solved using Gaussian elimination. Several authors have therefore proposed algorithms which exploit the structure of the Vandermonde matrix to compute numerically stable solutions in O(n2) operations instead of the O(n3)required by Gaussian elimination. These methods rely on constructing first a Newton interpolation of the polynomial and then converting it to the monomial form above.

The red dots denote the data points (xk, yk), while the blue curve shows the interpolation polynomial. Non-Vandermonde solutions We are trying to construct our unique interpolation polynomial in the vector space n that is the vector space of polynomials of degree n. When using a monomial basis for n we have to solve the Vandermonde matrix to construct the coefficients ak for the interpolation polynomial. This can be a very costly operation (as counted in clock cycles of a computer trying to do the job). By choosing another basis for n we can simplify the calculation of the coefficients but then we have to do additional calculations when we want to express the interpolation polynomial in terms of a monomial basis. One method is to write the interpolation polynomial in the Newton form and use the method of divided differences to construct the coefficients, e.g. Neville's algorithm. The cost is O(n2) operations, while Gaussian elimination costs O(n3) operations. Furthermore, you only need to do O(n) extra work if an extra point is added to the data set, while for the other methods, you have to redo the whole computation. Another method is to use the Lagrange form of the interpolation polynomial. The resulting formula immediately shows that the interpolation polynomial exists under the conditions stated in the above theorem.

27

The Bernstein form was used in a constructive proof of the Weierstrass approximation theorem by Bernstein and has nowadays gained great importance in computer graphics in the form of Bezier curves. Interpolation error When interpolating a given function f by a polynomial of degree n at the nodes x0,...,xn we get the error

where

is the notation for divided differences. When f is n+1 times continuously differentiable on the smallest interval I which contains the nodes xi and x then we can write the error in the Lagrange form as

for some in I. Thus the remainder term in the Lagrange form of the Taylor theorem is a special case of interpolation error when all interpolation nodes xi are identical. In the case of equally spaced interpolation nodes xi = x0 + ih, it follows that the interpolation error is O(hn). The above error bound suggests choosing the interpolation points xi such that the product | (x xi) | is as small as possible. The Chebyshev nodes achieve this. Lagrange polynomial In numerical analysis, a Lagrange polynomial, named after Joseph Louis Lagrange, is the interpolating polynomial for a given set of data points in the Lagrange form. It was first discovered by Edward Waring in 1779 and later rediscovered by Leonhard Euler in 1783. Notice that, for any given set of data points, there is only one polynomial (of least possible degree) that interpolates these points. Thus, it is more appropriate to call it "the Lagrange form of the interpolation polynomial" rather than "the Lagrange interpolation polynomial". Definition Given a set of k + 1 data points

28

where no two xj are the same, the interpolation polynomial in the Lagrange form is a linear combination

of Lagrange basis polynomials

Proof The function we are looking for has to be a polynomial function L(x) of degree less than or equal to k with

The Lagrange polynomial is a solution to the interpolation problem. As can be seen 1. 2. Thus the function L(x) is a polynomial with degree at most k and is a polynomial and has degree k.

There can be only one solution to the interpolation problem since the difference of two such solutions would be a polynomial with degree at most k and k + 1 zeros. This is only possible if the difference is identically zero, so L(x) is the unique polynomial interpolating the given data. Main idea Solving an interpolation problem leads to a problem in linear algebra where we have to solve a matrix. Using a standard monomial basis for our interpolation polynomial we get the Vandermonde matrix. By choosing another basis, the Lagrange basis, we get the much simpler identity matrix = i,j which we can solve instantly: the Lagrange basis inverts the Vandermonde matrix. Notes The Lagrange form of the interpolation polynomial shows the linear character of polynomial interpolation and the uniqueness of the interpolation polynomial. Therefore, it is preferred in 29

proofs and theoretical arguments. Uniqueness can also be seen from the invertibility of the Vandermonde matrix, due to the non-vanishing of the Vandermonde determinant. But, as can be seen from the construction, each time a node xk changes, all Lagrange basis polynomials have to be recalculated. A better form of the interpolation polynomial for practical (or computational) purposes is the barycentric form of the Lagrange interpolation (see below) or Newton polynomials. Lagrange and other interpolation at equally spaced points, as in the example above, yield a polynomial oscillating above and below the true function. This behaviour tends to grow with the number of points, leading to a divergence known as Runge's phenomenon; the problem may be eliminated by choosing interpolation points at Chebyshev nodes. The Lagrange basis polynomials can be used in numerical integration to derive the Newton Cotes formulas. Barycentric interpolation Using the quantity

we can rewrite the Lagrange basis polynomials as

or, by defining the barycentric weights

we can simply write

which is commonly referred to as the first form of the barycentric interpolation formula. The advantage of this representation is that the interpolation polynomial may now be evaluated as

30

which, if the weights wj have been pre-computed, requires only O(n) operations (evaluating l(x) and the weights wj / (x xj)) as opposed to O(n2) for evaluating the Lagrange basis polynomials lj(x) individually. The barycentric interpolation formula can also easily be updated to incorporate a new node xk + 1 by dividing each of the wj, j = 0 k by (xj xk + 1) and constructing the new wk + 1 as above. We can further simplify the first form by first considering the barycentric interpolation of the constant function g(x) 1:

Dividing L(x) by g(x) does not modify the interpolation, yet yields

which is referred to as the second form or true form of the barycentric interpolation formula. This second form has the advantage, that l(x) need not be evaluated for each evaluation of L(x). Newton polynomial In the mathematical field of numerical analysis, a Newton polynomial, named after its inventor Isaac Newton, is the interpolation polynomial for a given set of data points in the Newton form. The Newton polynomial is sometimes called Newton's divided differences interpolation polynomial because the coefficients of the polynomial are calculated using divided differences. As there is only one interpolation polynomial for a given set of data points it is a bit misleading to call the polynomial Newton interpolation polynomial. The more precise name is interpolation polynomial in the Newton form. Definition Given a set of k + 1 data points

where no two xj are the same, the interpolation polynomial in the Newton form is a linear combination of Newton basis polynomials

31

with the Newton basis polynomials defined as

and the coefficients defined as

where

is the notation for divided differences. Thus the Newton polynomial can be written as

The Newton Polynomial above can be expressed in a simplified form when x0,x1,,xk are arranged consecutively with equal space. Introducing the notation h=x1 + i - xi for each i=0,1,,k-1 and x=x0+sh, the difference x-xi can be written as (s-i)h. So the Newton polynomial above becomes:

is called the Newton forward divided difference formula. If the nodes are reordered as xk,xk 1,,x0, the Newton polynomial becomes:

If xk,xk 1,,x0 are equally spaced with x=xk+sh and xi=xk-(k-i)h for i=0,1,,k, then,

is called the Newton backward divided difference formula. An advantage of Newton's formula is that one can add more terms, for higher degree interpolation, by using additional data points at one end, without re-doing previous

32

calculations. Newton's forward formula can add new points to the right. Newton's backwards formula can add new points to the left. Annother advantage of Newton's formula is that it is the straightforward and natural differences version of Taylor's polynomial. Taylor's polynomial tells where a function will go, based on its derivatives (its rate of change, and the rate of change of its rate of change, etc.) at one particular point. Newton's formula is Taylor's polynomial, but using finite differences instead of instantaneous rates of change. The difference methods all put the same polynomial through a set of data points--in fact so do other polynomial interpolating formulas, such as that of Lagrange. So (disragarding roundoff) they all have the same accuracy. Still, among the difference formulas, there are others which, while not as straighforwardly related to Taylors polynomial as Newotn's is, can make easier the evaluation, and the accuracy-prediciton based on inspection of the differences: Greatest accuracy is achieved when the interpolated point is in the middle of the data points. And greatest ease of evaluation and accuracy-prediction is achieved when the interpolated point is near the formula's "X zero" data point. So, Sterling's formula puts the x zero data point near the middle of the data points instead of at one end (which is where Newton's puts it). That's especially so when the interpolated point is near a data point. When the interpolated point is nearer the middle between data points, then Bessel's formula achieves those goals slightly better. General case For the special case of xi = i, there is a closely related set of polynomials, also called the Newton polynomials, that are simply the binomial coefficients for general argument. That is, one also has the Newton polynomials pn(z) given by

In this form, the Newton polynomials generate the Newton series. These are in turn a special case of the general difference polynomials which allow the representation of analytic functions through generalized difference equations. Main idea Solving an interpolation problem leads to a problem in linear algebra where we have to solve a system of linear equations. Using a standard monomial basis for our interpolation polynomial we get the very complicated Vandermonde matrix. By choosing another basis, the Newton basis, we get a system of linear equations with a much simpler lower triangular matrix which can be solved faster.

33

For k + 1 data points we construct the Newton basis as

Using these polynomials as a basis for k we have to solve

to solve the polynomial interpolation problem. This system of equations can be solved recursively by solving

Taylor polynomial The limit of the Newton polynomial if all nodes coincide is a Taylor polynomial, because the divided differences become derivatives.

Application As can be seen from the definition of the divided differences new data points can be added to the data set to create a new interpolation polynomial without recalculating the old coefficients. And when a data point changes we usually do not have to recalculate all coefficients. Furthermore if the xi are distributed equidistantly the calculation of the divided differences becomes significantly easier. Therefore the Newton form of the interpolation polynomial is usually preferred over the Lagrange form for practical purposes. Related concepts Runge's phenomenon shows that for high values of n, the interpolation polynomial may oscillate wildly between the data points. This problem is commonly resolved by the use of spline interpolation. Here, the interpolant is not a polynomial but a spline: a chain of several polynomials of a lower degree.

34

Using harmonic functions to interpolate a periodic function is usually done using Fourier series, for example in discrete Fourier transform. This can be seen as a form of polynomial interpolation with harmonic base functions, see trigonometric interpolation and trigonometric polynomial. Hermite interpolation problems are those where not only the values of the polynomial p at the nodes are given, but also all derivatives up to a given order. This turns out to be equivalent to a system of simultaneous polynomial congruences, and may be solved by means of the Chinese remainder theorem for polynomials. Birkhoff interpolation is a further generalization where only derivatives of some orders are prescribed, not necessarily all orders from 0 to a k. Collocation methods for the solution of differential and integral equations are based on polynomial interpolation. The technique of rational function modeling is a generalization that considers ratios of polynomial functions.

Spline interpolation
In the mathematical field of numerical analysis, spline interpolation is a form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. Spline interpolation is preferred over polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline. Thus, spline interpolation avoids the problem of Runge's phenomenon which occurs when using high degree polynomials. Remember that linear interpolation uses a linear function for each of intervals [xk,xk+1]. Spline interpolation uses low-degree polynomials in each of the intervals, and chooses the polynomial pieces such that they fit smoothly together. The resulting function is called a spline. For instance, the natural cubic spline is piecewise cubic and twice continuously differentiable. Furthermore, its second derivative is zero at the end points. Like polynomial interpolation, spline interpolation incurs a smaller error than linear interpolation and the interpolant is smoother. However, the interpolant is easier to evaluate than the high-degree polynomials used in polynomial interpolation. It also does not suffer from Runge's phenomenon. Spline interpolant Using polynomial interpolation, the polynomial of degree n which interpolates the data set is uniquely defined by the data points. The spline of degree n which interpolates the same data set is not uniquely defined, and we have to fill in n-1 additional degrees of freedom to construct a unique spline interpolant.

35

Linear spline interpolation Linear spline interpolation is the simplest form of spline interpolation and is equivalent to linear interpolation. The data points are graphically connected by straight lines. The resultant spline would be a polygon if the end point is connected to the beginning points. Algebraically, each Si is a linear function constructed as

The spline must be continuous at each data point, that is

This is the case as we can easily see

Quadratic spline interpolation The quadratic spline can be constructed as

The coefficients can be found by choosing a z0 and then using the recurrence relation:

The coefficients z above are basically a running derivative approximation. Since only two points are used to calculate the next iteration's curve (instead of three), this method is susceptible to severe oscillation effects when signal change is quickly followed by a steady signal. Without some sort of dampening, these oscillation effects make the above method a poor choice. A more stable alternative to this method is a cubic spline passing through the middle points of the original data instead. To calculate Si(x) first select a j such that xj1 < x < xj+1 and that |xj1 x| + |xj+1 x| is minimized. Then, find a, b, c and d of S(x) = ax3 + bx2 + cx + d such that:

36

In effect, this yields a set of curves that are continuous in the first degree, highly stable (i.e. aren't subject to oscillation effects) and does not require huge matrix solving. Cubic spline interpolation For a data set {xi} of n+1 points, we can construct a cubic spline with n piecewise cubic polynomials between the data points. If

represents the spline function interpolating the function f, we require:


the interpolating property, S(xi)=f(xi) the splines to join up, Si-1(xi) = Si(xi), i =1,...,n-1 twice continuous differentiable, S'i-1(xi) = S'i(xi) and S''i-1(xi) = S''i(xi), i =1,...,n -1.

For the n cubic polynomials comprising S, this means to determine these polynomials, we need to determine 4n conditions (since for one polynomial of degree three, there are four conditions on choosing the curve). However, the interpolating property gives us n + 1 conditions, and the conditions on the interior data points give us n + 1 2 = n 1 data points each, summing to 4n 2 conditions. We require two other conditions, and these can be imposed upon the problem for different reasons. One such choice results in the so-called clamped cubic spline, with

for given values u and v. Alternately, we can set

37

. resulting in the natural cubic spline. The natural cubic spline is approximately the same curve as created by the spline device. Amongst all twice continuously differentiable functions, clamped and natural cubic splines yield the least oscillation about the function f which is interpolated. Another choice gives the periodic cubic spline if

Another choice gives the complete cubic spline if

Interpolation using natural cubic spline It can be defined as

and . The coefficients can be found by solving this system of equations:

Trigonometric interpolation
In mathematics, trigonometric interpolation is interpolation with trigonometric polynomials. Interpolation is the process of finding a function which goes through some given data points. For trigonometric interpolation, this function has to be a trigonometric polynomial, that is, a sum of sines and cosines of given periods. This form is especially suited for interpolation of periodic functions.

38

An important special case is when the given data points are equally spaced, in which case the solution is given by the discrete Fourier transform.

Formulation of the interpolation problem A trigonometric polynomial of degree n has the form

This expression contains 2n + 1 coefficients, a0, a1, an, b1, , bn, and we wish to compute those coefficients so that the function passes through N points:

Since the trigonometric polynomial is periodic with period 2, it makes sense to assume that

(Note that we do not in general require these points to be equally spaced.) The interpolation problem is now to find coefficients such that the trigonometric polynomial p satisfies the interpolation conditions. Solution of the problem Under the above conditions, there exists a solution to the problem for any given set of data points {xk, p(xk)} as long as N, the number of data points, is not larger than the number of coefficients in the polynomial, i.e., N 2n+1 (a solution may or may not exist if N>2n+1 depending upon the particular set of data points). Moreover, the interpolating polynomial is unique if and only if the number of adjustable coefficients is equal to the number of data points, i.e., N=2n+1. In the remainder of this article, we will assume this condition to hold true. The solution can be written in a form similar to the Lagrange formula for polynomial interpolation:

This can be shown to be a trigonometric polynomial by employing the multiple-angle formula and other identities for sin (x xm). Formulation in the complex plane The problem becomes more natural if we formulate it in the complex plane. We can rewrite the formula for a trigonometric polynomial as

39

where i is the imaginary unit. If we set z = eix, then this becomes

This reduces the problem of trigonometric interpolation to that of polynomial interpolation on the unit circle. Existence and uniqueness for trigonometric interpolation now follows immediately from the corresponding results for polynomial interpolation. Equidistant nodes and the discrete Fourier transform The special case in which the points xk are equally spaced is especially important. In this case, we have

The transformation that maps the data points yk to the coefficients am, bm is known as the discrete Fourier transform (DFT) of order 2n + 1. (Because of the way the problem was formulated above, we have restricted ourselves to odd numbers of points. This is not strictly necessary; for even numbers of points, one includes another cosine term corresponding to the Nyquist frequency.) The case of the cosine-only interpolation for equally spaced points, corresponding to a trigonometric interpolation when the points have even symmetry, was treated by Alexis Clairaut in 1754. In this case the solution is equivalent to a discrete cosine transform. The sine-only expansion for equally spaced points, corresponding to odd symmetry, was solved by Joseph Louis Lagrange in 1762, for which the solution is a discrete sine transform. The full cosine and sine interpolating polynomial, which gives rise to the DFT, was solved by Carl Friedrich Gauss in unpublished work around 1805, at which point he also derived a fast Fourier transform algorithm to evaluate it rapidly. Clairaut, Lagrange, and Gauss were all concerned with studying the problem of inferring the orbit of planets, asteroids, etc., from a finite set of observation points; since the orbits are periodic, a trigonometric interpolation was a natural choice.

Extrapolation
In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points. It is similar to the process of interpolation, which constructs new points between known points, but the results of extrapolations are often less meaningful, and are subject to greater uncertainty. It may also mean extension of a method, assuming similar methods will be applicable. Extrapolation may also apply to human experience to project, extend, or expand known experience into an area not known or previously experienced so as

40

to arrive at a (usually conjectural) knowledge of the unknown (e.g. a driver extrapolates road conditions beyond his sight while driving).

Example illustration of the extrapolation problem, consisting of assigning a meaningful value at the blue box, at x = 7, given the red data points. Extrapolation methods A sound choice of which extrapolation method to apply relies on a prior knowledge of the process that created the existing data points. Crucial questions are for example if the data can be assumed to be continuous, smooth, possibly periodic etc. Linear extrapolation Linear extrapolation means creating a tangent line at the end of the known data and extending it beyond that limit. Linear extrapolation will only provide good results when used to extend the graph of an approximately linear function or not too far beyond the known data. If the two data points nearest the point x * to be extrapolated are (xk 1,yk 1) and (xk,yk), linear extrapolation gives the function (identical to linear interpolation if xk 1 < x * < xk),

It is possible to include more than two points, and averaging the slope of the linear interpolant, by regression-like techniques, on the data points chosen to be included. This is similar to linear prediction. Polynomial extrapolation A polynomial curve can be created through the entire known data or just near the end. The resulting curve can then be extended beyond the end of the known data. Polynomial extrapolation is typically done by means of Lagrange interpolation or using Newton's method of finite differences to create a Newton series that fits the data. The resulting polynomial may be used to extrapolate the data.

41

High-order polynomial extrapolation must be used with due care. For the example data set and problem in the figure above, anything above order 1 (linear extrapolation) will possibly yield unusable values, an error estimate of the extrapolated value will grow with the degree of the polynomial extrapolation. This is related to Runge's phenomenon. Quality of extrapolation Typically, the quality of a particular method of extrapolation is limited by the assumptions about the function made by the method. If the method assumes the data are smooth, then a non-smooth function will be poorly extrapolated. Even for proper assumptions about the function, the extrapolation can diverge strongly from the function. The classic example is truncated power series representations of sin(x) and related trigonometric functions. For instance, taking only data from near the x = 0, we may estimate that the function behaves as sin(x) ~ x. In the neighborhood of x = 0, this is an excellent estimate. Away from x = 0 however, the extrapolation moves arbitrarily away from the x-axis while sin(x) remains in the interval [1,1]. I.e., the error increases without bound. Taking more terms in the power series of sin(x) around x = 0 will produce better agreement over a larger interval near x = 0, but will produce extrapolations that eventually diverge away from the x-axis even faster than the linear approximation. This divergence is a specific property of extrapolation methods and is only circumvented when the functional forms assumed by the extrapolation method (inadvertently or intentionally due to additional information) accurately represent the nature of the function being extrapolated. For particular problems, this additional information may be available, but in the general case, it is impossible to satisfy all possible function behaviors with a workably small set of potential behaviors. Richardson extrapolation In numerical analysis, Richardson extrapolation is a sequence acceleration method, used to improve the rate of convergence of a sequence. It is named after Lewis Fry Richardson, who introduced the technique in the early 20th century. In the words of Birkhoff and Rota, "... its usefulness for practical computations can hardly be overestimated." Practical applications of Richardson extrapolation include Romberg integration, which applies Richardson extrapolation to the trapezium rule, and the BulirschStoer algorithm for solving ordinary differential equations. Example of Richardson extrapolation Suppose that A(h) is an estimation of order hn for , i.e. . 42

Then

is called the Richardson extrapolate of A(h); it is an estimate of order hm for A, with m>n. More generally, the factor 2 can be replaced by any other factor, as shown below. Very often, it is much easier to obtain a given precision by using R(h) rather than A(h') with a much smaller h' , which can cause problems due to limited precision (rounding errors) and/or due to the increasing number of calculations needed (see examples below). General formula Let A(h) be an approximation of A that depends on a positive step size h with an error formula of the form

where the ai are unknown constants and the ki are known constants such that hki > hki+1. The exact value sought can be given by

which can be simplified with Big O notation to be

Using the step sizes h and h / t for some t, the two formulas for A are:

Multiplying the second equation by tk0 and subtracting the first equation gives

which can be solved for A to give

43

By this process, we have achieved a better approximation of A by subtracting the largest term in the error which was O(hk0). This process can be repeated to remove more error terms to get even better approximations. A general recurrence relation can be defined for the approximations by

such that

with A0 = A(h). It should be noted that the Richardson extrapolation can be considered as a linear sequence transformation. Example Using Taylor's theorem,

the derivative of f(x) is given by

If the initial approximations of the derivative are chosen to be

then ki = i+1. For t = 2, the first formula extrapolated for A would be

For the new approximation

44

we can extrapolate again to obtain

45

46

Curve fitting
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a greater degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.

The result of fitting a set of data points with a quadratic function.

Different types of curve fitting


Fitting lines and polynomial curves to data points Let's start with a first degree polynomial equation:

This is a line with slope a. We know that a line will connect any two points. So, a first degree polynomial equation is an exact fit through any two points. If we increase the order of the equation to a second degree polynomial, we get:

This will exactly fit a simple curve to three points. If we increase the order of the equation to a third degree polynomial, we get: 47

This will exactly fit four points. A more general statement would be to say it will exactly fit four constraints. Each constraint can be a point, angle, or curvature (which is the reciprocal of the radius of an osculating circle). Angle and curvature constraints are most often added to the ends of a curve, and in such cases are called end conditions. Identical end conditions are frequently used to ensure a smooth transition between polynomial curves contained within a single spline. Higher-order constraints, such as "the change in the rate of curvature", could also be added. This, for example, would be useful in highway cloverleaf design to understand the forces applied to a car, as it follows the cloverleaf, and to set reasonable speed limits, accordingly. Bearing this in mind, the first degree polynomial equation could also be an exact fit for a single point and an angle while the third degree polynomial equation could also be an exact fit for two points, an angle constraint, and a curvature constraint. Many other combinations of constraints are possible for these and for higher order polynomial equations. If we have more than n + 1 constraints (n being the degree of the polynomial), we can still run the polynomial curve through those constraints. An exact fit to all the constraints is not certain (but might happen, for example, in the case of a first degree polynomial exactly fitting three collinear points). In general, however, some method is then needed to evaluate each approximation. The least squares method is one way to compare the deviations. Now, you might wonder why we would ever want to get an approximate fit when we could just increase the degree of the polynomial equation and get an exact match. There are several reasons:

Even if an exact match exists, it does not necessarily follow that we can find it. Depending on the algorithm used, we may encounter a divergent case, where the exact fit cannot be calculated, or it might take too much computer time to find the solution. Either way, you might end up having to accept an approximate solution. We may actually prefer the effect of averaging out questionable data points in a sample, rather than distorting the curve to fit them exactly. Runge's phenomenon: high order polynomials can be highly oscillatory. If we run a curve through two points A and B, we would expect the curve to run somewhat near the midpoint of A and B, as well. This may not happen with high-order polynomial curves, they may even have values that are very large in positive or negative magnitude. With low-order polynomials, the curve is more likely to fall near the midpoint (it's even guaranteed to exactly run through the midpoint on a first degree polynomial). Low-order polynomials tend to be smooth and high order polynomial curves tend to be "lumpy". To define this more precisely, the maximum number of ogee/inflection points possible in a polynomial curve is n-2, where n is the order of the polynomial equation. An inflection point is a location on the curve where it switches from a positive radius to negative. We can also say this is where it transitions from "holding water" to "shedding water". Note that it is only "possible" that high order polynomials 48

will be lumpy, they could also be smooth, but there is no guarantee of this, unlike with low order polynomial curves. A fifteenth degree polynomial could have, at most, thirteen inflection points, but could also have twelve, eleven, or any number down to zero. Now that we have talked about using a degree too low for an exact fit, let's also discuss what happens if the degree of the polynomial curve is higher than needed for an exact fit. This is bad for all the reasons listed previously for high order polynomials, but also leads to a case where there are an infinite number of solutions. For example, a first degree polynomial (a line) constrained by only a single point, instead of the usual two, would give us an infinite number of solutions. This brings up the problem of how to compare and choose just one solution, which can be a problem for software and for humans, as well. For this reason, it is usually best to choose as low a degree as possible for an exact match on all constraints, and perhaps an even lower degree, if an approximate fit is acceptable. Algebraic fit versus geometric fit for curves For algebraic analysis of data, "fitting" usually means trying to find the curve that minimizes the vertical (i.e. y-axis) displacement of a point from the curve (e.g. ordinary least squares). However for graphical and image applications geometric fitting seeks to provide the best visual fit; which usually means trying to minimize the orthogonal distance to the curve (e.g. total least squares), or to otherwise include both axes of displacement of a point from the curve. Geometric fits are not popular because they usually require non-linear and/or iterative calculations, although they have the advantage of a more aesthetic and geometrically accurate result.

Least squares
The method of least squares is applied to approximate solutions of overdetermined systems, i.e. systems of equations in which there are more equations than unknowns. Least squares is often applied in statistical contexts, particularly regression analysis. Least squares may be interpreted as a method of fitting data. The best fit, between modeled data and observed data, in its least-squares sense, is an instance of the model for which the sum of squared residuals has its least value, where a residual is the difference between an observed value and the value provided by the model. The method was first described by Carl Friedrich Gauss around 1794. Least squares corresponds to the maximum likelihood criterion if the experimental errors have a normal distribution and can also be derived as a method of moments estimator. Regression analysis is available in most statistical software packages. The discussion is mostly presented in terms of linear functions but the use of least-squares is valid and practical for more general families of functions. For example, the Fourier series approximation of degree n is optimal in the least-squares sense, amongst all approximations in terms of trigonometric polynomials of degree n. Also, by iteratively applying local quadratic approximation to the likelihood (through the Fisher information), the least-squares method may be used to fit a generalized linear model.

49

The method itself Carl Friedrich Gauss is credited with developing the fundamentals of the basis for leastsquares analysis in 1795 at the age of eighteen. Legendre was the first to publish the method, however. An early demonstration of the strength of Gauss's method came when it was used to predict the future location of the newly discovered asteroid Ceres. On January 1, 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated Kepler's nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis. Gauss did not publish the method until 1809, when it appeared in volume two of his work on celestial mechanics, Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium. In 1829, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. This result is known as the GaussMarkov theorem. The idea of least-squares analysis was also independently formulated by the Frenchman Adrien-Marie Legendre in 1805 and the American Robert Adrain in 1808. Problem statement The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of n points (data pairs) (xi, yi), i = 1, ..., n, where xi is an independent variable and yi is a dependent variable whose value is found by observation. The model function has the form f(x, ), where the m adjustable parameters are held in the vector . The parameter values for which the model "best" fits the data need be found. The least squares method finds its optimum when the sum, S, of squared residuals

is a minimum. A residual is defined as the difference between the value of the dependent variable and the predicted value from the estimated model,

An example of a model is that of the straight line. Denoting the intercept as 0 and the slope as 1, the model function is given by

A data point may consist of more than one independent variable. For an example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables, 50

x and z, say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point. Solving the least squares problem Least squares problems fall into two categories, linear and non-linear. The linear least squares problem has a closed form solution, but the non-linear problem does not and is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, thus the core calculation is similar in both cases. The minimum of the sum of squares is found by setting the gradient to zero. Since the model contains m parameters there are m gradient equations.

and since

the gradient equations become

The gradient equations apply to all least squares problems. Each particular problem requires particular expressions for the model and its partial derivatives. Linear least squares A regression model is a linear one when the model comprises a linear combination of the parameters, i.e.

where the coefficients, j, are functions of xi. Letting

we can then see that in that case the least square estimate (or estimator, in the context of a random sample), is given by

51

Non-linear least squares There is no closed-form solution to a non-linear least squares problem. Instead, numerical algorithms are used to find the value of the parameters which minimize the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation.

k is an iteration number and the vector of increments, j is known as the shift vector. In some commonly used algorithms, at each iteration the model may be linearized by approximation to a first-order Taylor series expansion about k

The Jacobian, J, is a function of constants, the independent variable and the parameters, so it changes from one iteration to the next. The residuals are given by

and the gradient equations become

which, on rearrangement, become m simultaneous linear equations, the normal equations.

The normal equations are written in matrix notation as

These are the defining equations of the GaussNewton algorithm. Differences between linear and non-linear least squares

The model function, f, in LLSQ (linear least squares) is a linear combination of parameters of the form The model may represent a 52

straight line, a parabola or any other polynomial-type function. In NLLSQ (non-linear least squares) the parameters appear as functions, such as 2,ex and so forth. If the derivatives are either constant or depend only on the values of the independent variable, the model is linear in the parameters. Otherwise the model is non-linear. Many solution algorithms for NLLSQ require initial values for the parameters, LLSQ does not. Many solution algorithms for NLLSQ require that the Jacobian be calculated. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain the partial derivatives must be calculated by numerical approximation. In NLLSQ non-convergence (failure of the algorithm to find a minimum) is a common phenomenon whereas the LLSQ is globally concave so non-convergence is not an issue. NLLSQ is usually an iterative process. The iterative process has to be terminated when a convergence criterion is satisfied. LLSQ solutions can be computed using direct methods, although problems with large numbers of parameters are typically solved with iterative methods, such as the GaussSeidel method. In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares. Under the condition that the errors are uncorrelated with the predictor variables, LLSQ yields unbiased estimates, but even under that condition NLLSQ estimates are generally biased.

These differences must be considered whenever the solution to a non-linear least squares problem is being sought. Least squares, regression analysis and statistics The methods of least squares and regression analysis are conceptually different. However, the method of least squares is often used to generate estimators and other statistics in regression analysis. Consider a simple example drawn from physics. A spring should obey Hooke's law which states that the extension of a spring is proportional to the force, F, applied to it.

constitutes the model, where F is the independent variable. To estimate the force constant, k, a series of n measurements with different forces will produce a set of data, , where yi is a measured spring extension. Each experimental observation will contain some error. If we denote this error , we may specify an empirical model for our observations,

There are many methods we might use to estimate the unknown parameter k. Noting that the n equations in the m variables in our data comprise an overdetermined system with one unknown and n equations, we may choose to estimate k using least squares. The sum of squares to be minimized is

53

The least squares estimate of the force constant, k, is given by

Here it is assumed that application of the force causes the spring to expand and, having derived the force constant by least squares fitting, the extension can be predicted from Hooke's law. In regression analysis the researcher specifies an empirical model. For example, a very common model is the straight line model which is used to test if there is a linear relationship between dependent and independent variable. If a linear relationship is found to exist, the variables are said to be correlated. However, correlation does not prove causation, as both variables may be correlated with other, hidden, variables, or the dependent variable may "reverse" cause the independent variables, or the variables may be otherwise spuriously correlated. For example, suppose there is a correlation between deaths by drowning and the volume of ice cream sales at a particular beach. Yet, both the number of people going swimming and the volume of ice cream sales increase as the weather gets hotter, and presumably the number of deaths by drowning is correlated with the number of people going swimming. Perhaps an increase in swimmers causes both the other variables to increase. In order to make statistical tests on the results it is necessary to make assumptions about the nature of the experimental errors. A common (but not necessary) assumption is that the errors belong to a Normal distribution. The central limit theorem supports the idea that this is a good assumption in many cases.

The GaussMarkov theorem. In a linear model in which the errors have expectation zero conditional on the independent variables, are uncorrelated and have equal variances, the best linear unbiased estimator of any linear combination of the observations, is its least-squares estimator. "Best" means that the least squares estimators of the parameters have minimum variance. The assumption of equal variance is valid when the errors all belong to the same distribution. In a linear model, if the errors belong to a Normal distribution the least squares estimators are also the maximum likelihood estimators.

However, if the errors are not normally distributed, a central limit theorem often nonetheless implies that the parameter estimates will be approximately normally distributed so long as the sample is reasonably large. For this reason, given the important property that the error is mean independent of the independent variables, the distribution of the error term is not an important issue in regression analysis. Specifically, it is not typically important whether the error term follows a normal distribution. In a least squares calculation with unit weights, or in linear regression, the variance on the jth parameter, denoted , is usually estimated with

54

where the true residual variance 2 is replaced by an estimate based on the minimised value of the sum of squares objective function S. Confidence limits can be found if the probability distribution of the parameters is known, or an asymptotic approximation is made, or assumed. Likewise statistical tests on the residuals can be made if the probability distribution of the residuals is known or assumed. The probability distribution of any linear combination of the dependent variables can be derived if the probability distribution of experimental errors is known or assumed. Inference is particularly straightforward if the errors are assumed to follow a normal distribution, which implies that the parameter estimates and residuals will also be normally distributed conditional on the values of the independent variables. Weighted least squares The expressions given above are based on the implicit assumption that the errors are uncorrelated with each other and with the independent variables and have equal variance. The GaussMarkov theorem shows that, when this is so, is a best linear unbiased estimator (BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken showed that when a weighted sum of squared residuals is minimized, the measurement. is BLUE if each weight is equal to the reciprocal of the variance of

The gradient equations for this sum of squares are

which, in a linear least squares system give the modified normal equations

or

When the observational errors are uncorrelated the weight matrix, W, is diagonal. If the errors are correlated, the resulting estimator is BLUE if the weight matrix is equal to the inverse of the variance-covariance matrix of the observations.

55

When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as . The normal equations can then be written as

where

For non-linear least squares systems a similar argument shows that the normal equations should be modified as follows.

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this Feasible Generalized Least Squares (FGLS) techniques may be used. Principal components The first principal component about the mean of a set of points is equivalent to the linear least squares solution. One of the most computationally efficient ways to solve a linear least squares problem is to use the EM technique to compute the first principal component about the mean of the data. This algorithm can be trivially modified to compute a weighted least squares solution as well. Lasso method In some contexts a regularized version of the least squares solution may be preferable. The LASSO algorithm, for example, finds a least-squares solution with the constraint that | | 1, the L1-norm of the parameter vector, is no greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with | | 1 added, where is a constant. (This is the Lagrangian form of the constrained problem.) This problem may be solved using quadratic programming or more general convex optimization methods. The L1regularized formulation is useful in some contexts due to its tendency to prefer solutions with fewer nonzero parameter values, effectively reducing the number of variables upon which the given solution is dependent.

Regression analysis
In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables that is, the average value of the dependent variable when the independent variables are held fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent

56

variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution. Regression analysis is widely used for prediction (including forecasting of time-series data). Use of regression analysis for prediction has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. A large body of techniques for carrying out regression analysis has been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional. The performance of regression analysis methods in practice depends on the form of the datagenerating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is not known, regression analysis depends to some extent on making assumptions about this process. These assumptions are sometimes (but not always) testable if a large amount of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However when carrying out inference using regression models, especially involving small effects or questions of causality based on observational data, regression methods must be used cautiously as they can easily give misleading results. Underlying assumptions Classical assumptions for regression analysis include:

The sample must be representative of the population for the inference prediction. The error is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The independent variables are error-free. If this is not so, modeling may be done using errors-in-variables model techniques. The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others. The errors are uncorrelated, that is, the variance-covariance matrix of the errors is diagonal and each non-zero element is the variance of the error. The variance of the error is constant across observations (homoscedasticity). If not, weighted least squares or other methods might be used.

These are sufficient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, in particular, these assumptions imply that the parameter estimates will be unbiased, consistent, and efficient in the class of linear unbiased estimators. Many of these assumptions may be relaxed in more advanced treatments.

57

Assumptions include the geometrical support of the variables. Independent and dependent variables often refer to values measured at point locations. There may be spatial trends and spatial autocorrelation in the variables that violates statistical assumptions of regression. Geographic weighted regression is one technique to deal with such data. Also, variables may include values aggregated by areas. With aggregated data the Modifiable Areal Unit Problem can cause extreme variation in regression parameters. When analyzing data aggregated by political boundaries, postal codes or census areas results may be very different with a different choice of units. Regression equation It is convenient to assume an environment in which an experiment is performed: the dependent variable is then outcome of a measurement. The regression equation deals with the following variables:

The unknown parameters denoted as ; this may be a scalar or a vector of length k. The independent variables, X. The dependent variable, Y.

Regression equation is a function of variables X and .

The user of regression analysis must make an intelligent guess about this function. Sometimes the form of this function is known, sometimes he must apply a trial and error process. Assume now that the vector of unknown parameters, is of length k. In order to perform a regression analysis the user must provide information about the dependent variable Y:

If the user performs the measurement N times, where N < k, regression analysis cannot be performed: there is not provided enough information to do so. If the user performs N independent measurements, where N = k, then the problem reduces to solving a set of N equations with N unknowns . If, on the other hand, the user provides results of N independent measurements, where N > k, regression analysis can be performed. Such a system is also called an overdetermined system.

In the last case, the regression analysis provides the tools for: 1. Finding a solution for unknown parameters that will, for example, minimize the distance between the measured and predicted values of the dependent variable Y (also known as method of least squares). 2. Under certain statistical assumptions, the regression analysis uses the surplus of information to provide statistical information about the unknown parameters and predicted values of the dependent variable Y.

58

Independent measurements Quantitatively, this is explained by the following example: Consider a logistic regression model, which has three unknown parameters, 0, 1, and 2. An experimenter performed 10 measurements all at exactly the same value of independent variable X. In this case, regression analysis fails to give a unique value for the three unknown parameters; the experimenter did not provide enough information. The best one can do is to calculate the average value of the dependent variable Y and its standard deviation. Similarly, measuring at two different values of X would give enough data for a linear or a power regression (two unknowns), but not a logistic (three unknowns) or cubic (four unknowns). If the experimenter had performed measurements at X1, X2 and X3, where X1, X2, and X3 are different values of X, then regression analysis would provide a unique solution to the unknown parameters . In the case of general linear regression, the above statement is equivalent to the requirement that matrix XTX is regular (that is: it has an inverse matrix). Statistical assumptions When the number of measurements, N, is larger than the number of unknown parameters, k, and the measurement errors i are normally distributed then the excess of information contained in (N - k) measurements is used to make statistical predictions about the unknown parameters. Linear regression In linear regression, the model specification is that the dependent variable, yi is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling N data points there is one independent variable: xi, and two parameters, 0 and 1: straight line: In multiple linear regression, there are several independent variables or functions of independent variables. For example, adding a term in xi2 to the preceding regression gives: parabola: This is still linear regression; although the expression on the right hand side is quadratic in the independent variable xi, it is linear in the parameters 0, 1 and 2. In both cases, i is an error term and the subscript i indexes a particular observation. Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model:

59

The term ei is the residual, . One method of estimation is ordinary least squares. This method obtains parameter estimates that minimize the sum of squared residuals, SSE:

Minimization of this function results in a set of normal equations, a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, .

Illustration of linear regression on a data set. In the case of simple regression, the formulas for the least squares estimates are

where is the mean (average) of the x values and is the mean of the y values. Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:

This is called the root mean square error (RMSE) of the regression. The standard errors of the parameter estimates are given by

60

Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create confidence intervals and conduct hypothesis tests about the population parameters. General linear data model In the more general multiple regression model, there are p independent variables:

The least square parameter estimates are obtained by p normal equations. The residual can be written as

The normal equations are

Note that for the normal equations depicted above, That is, there is no 0. Thus in what follows, In matrix notation, the normal equations are written as

Regression diagnostics Once a regression model has been constructed, it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analyses of the pattern of residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters. Interpretations of these diagnostic tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions and complicate inference. With relatively large samples, however, a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations.

61

Nonlinear regression In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations. General The data consist of error-free independent variables (explanatory variable), x, and their associated observed dependent variables (response variable), y. Each y is modeled as a random variable with a mean a given by a nonlinear function f(x,). Systematic error may be present but its treatment is outside the scope of regression analysis. If the independent variables are not error-free, this is an errors-in-variables model, also outside this scope. For example, the MichaelisMenten model for enzyme kinetics

can be written as

where 1 is the parameter Vmax, 2 is the parameter Km and [S] is the independent variable, x. This function is nonlinear because it cannot be expressed as a linear combination of the s.

MichaelisMenten kinetics. Other examples of nonlinear functions include exponential functions, logarithmic functions, trigonometric functions, power functions, Gaussian function, and Lorentzian curves. Some functions, such as the exponential or logarithmic functions, can be transformed so that they are linear. When so transformed, standard linear regression can be performed but must be applied with caution. See Linearization, below, for more details. 62

In general, there is no closed-form expression for the best-fitting parameters, as there is in linear regression. Usually numerical optimization algorithms are applied to determine the best-fitting parameters. Again in contrast to linear regression, there may be many local minima of the function to be optimized. In practice, estimated values of the parameters are used, in conjunction with the optimization algorithm, to attempt to find the global minimum of a sum of squares. Regression statistics The assumption underlying this procedure is that the model can be approximated by a linear function.

where

. It follows from this that the least squares estimators are given by

The nonlinear regression statistics are computed and used as in linear regression statistics, but using J in place of X in the formulas. The linear approximation introduces bias into the statistics. Therefore more caution than usual is required in interpreting statistics derived from a nonlinear model. Linearization Some nonlinear regression problems can be moved to a linear domain by a suitable transformation of the model formulation. For example, consider the nonlinear regression problem (ignoring the error):

If we take a logarithm of both sides, it becomes

suggesting estimation of the unknown parameters by a linear regression of ln(y) on x, a computation that does not require iterative optimization. However, use of a linear transformation requires caution. The influences of the data values will change, as will the error structure of the model and the interpretation of any inferential results. These may not be desired effects. On the other hand, depending on what the largest source of error is, a linear transformation may distribute your errors in a normal fashion, so the choice to perform a linear transformation must be informed by modeling considerations. For Michaelis-Menten kinetics, the linear LineweaverBurk plot

63

of 1/v against 1/[S] has been much used. However, since it is very sensitive to data error and is strongly biased toward fitting the data in a particular range of the independent variable, [S], its use is strongly discouraged. Ordinary least squares In statistics and econometrics, ordinary least squares (OLS) is a technique for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared distances between the observed responses in a set of data, and the fitted responses from the regression model. The linear least squares computational technique provides simple expressions for the estimated parameters in an OLS analysis, and hence for associated statistical values such as the standard errors of the parameters. OLS can mathematically be shown to be an optimal estimator in certain situations, and is closely related to the generalized least squares (GLS) estimation approach that is optimal in a broader set of situations. OLS can be derived as a maximum likelihood estimator under the assumption that the data are normally distributed, however the method has good statistical properties for a much broader class of distributions. Total least squares Total least squares, also known as errors in variables, rigorous least squares, or orthogonal regression, is a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression, and can be applied to both linear and non-linear models.

The red lines shows both the error in both x and y. This is different from the traditional least squares method which accumulates error only on the y axis. Geometrical interpretation When the independent variable is error-free a residual represents the "vertical" distance between the observed data point and the fitted curve (or surface). In total least squares a residual represents the distance between a data point and the fitted curve measured along some direction. In fact, if both variables are measured in the same units and the errors on both

64

variables are the same, then the residual represents the shortest distance between the data point and the fitted curve, that is, the residual vector is perpendicular to the tangent of the curve. A serious difficulty arises if the variables are not measured in the same units. First consider measuring distance between a data point and the curve - what are the measurement units for this distance? If we consider measuring distance based on Pythagoras' Theorem then it is clear that we shall be adding quantities measured in different units, and so this leads to meaningless results. Secondly, if we rescale one of the variables e.g., measure in grams rather than kilograms, then we shall end up with different results (a different curve). To avoid this problem of incommensurability it is sometimes suggested that we convert to dimensionless variablesthis may be called normalization or standardization. However there are various ways of doing this, and these lead to fitted models which are not equivalent to each other. One approach is to normalize by known measurement precision thereby minimizing the Mahalanobis distance from the points to the line, providing a maximum-likelihood solution. Maximum likelihood Maximum likelihood estimation (MLE) is a popular statistical method used for fitting a statistical model to data, and providing estimates for the model's parameters. The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance. For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. Maximum likelihood estimation gives a unique and easy way to determine solution in the case of the normal distribution and many other problems, although in very complex problems this may not be the case. If a uniform prior distribution is assumed over the parameters, the maximum likelihood estimate coincides with the most probable values thereof.

65

66

Optimization
In mathematics, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives. In the simplest case, this means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set. This formulation, using a scalar, real-valued objective function, is probably the simplest example; the generalization of optimization theory and techniques to other formulations comprises a large area of applied mathematics. More generally, it means finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types of domains. Optimization problems An optimization problem can be represented in the following way Given: a function f : A R from some set A to the real numbers Sought: an element x0 in A such that f(x0) f(x) for all x in A ("minimization") or such that f(x0) f(x) for all x in A ("maximization"). Such a formulation is called an optimization problem or a mathematical programming problem (a term not directly related to computer programming, but still in use for example in linear programming). Many real-world and theoretical problems may be modeled in this general framework. Problems formulated using this technique in the fields of physics and computer vision may refer to the technique as energy minimization, speaking of the value of the function f as representing the energy of the system being modeled. Typically, A is some subset of the Euclidean space Rn, often specified by a set of constraints, equalities or inequalities that the members of A have to satisfy. The domain A of f is called the search space or the choice set, while the elements of A are called candidate solutions or feasible solutions. The function f is called, variously, an objective function, cost function, energy function, or energy functional. A feasible solution that minimizes (or maximizes, if that is the goal) the objective function is called an optimal solution. Generally, when the feasible region or the objective function of the problem does not present convexity, there may be several local minima and maxima, where a local minimum x* is defined as a point for which there exists some > 0 so that for all x such that

the expression

holds; that is to say, on some region around x* all of the function values are greater than or equal to the value at that point. Local maxima are defined similarly.

67

A large number of algorithms proposed for solving non-convex problems including the majority of commercially available solvers are not capable of making a distinction between local optimal solutions and rigorous optimal solutions, and will treat the former as actual solutions to the original problem. The branch of applied mathematics and numerical analysis that is concerned with the development of deterministic algorithms that are capable of guaranteeing convergence in finite time to the actual optimal solution of a non-convex problem is called global optimization. How can an optimum be found? Fermat's theorem states that optima of unconstrained problems are found at stationary points, where the first derivative or the gradient of the objective function is zero. More generally, they may be found at critical points, where the first derivative or gradient of the objective function is zero or is undefined, or on the boundary of the choice set. An equation stating that the first derivative equals zero at an interior optimum is sometimes called a 'first-order condition'. Optima of inequality-constrained problems are instead found by the Lagrange multiplier method. This method calculates a system of inequalities called the 'Karush-Kuhn-Tucker conditions' or 'complementary slackness conditions', which may then be used to calculate the optimum. While the first derivative test identifies points that might be optima, it cannot distinguish a point which is a minimum from one that is a maximum or one that is neither. When the objective function is twice differentiable, these cases can be distinguished by checking the second derivative or the matrix of second derivatives (called the Hessian matrix) in unconstrained problems, or a matrix of second derivatives of the objective function and the constraints called the bordered Hessian. The conditions that distinguish maxima and minima from other stationary points are sometimes called 'second-order conditions'.

Computational optimization techniques


Crudely all the methods are divided according to variables called: SVO - Single-variable optimization, MVO - Multi-variable optimization. For twice-differentiable functions, unconstrained problems can be solved by finding the points where the gradient of the objective function is zero (that is, the stationary points) and using the Hessian matrix to classify the type of each point. If the Hessian is positive definite, the point is a local minimum, if negative definite, a local maximum, and if indefinite it is some kind of saddle point. The existence of derivatives is not always assumed and many methods were devised for specific situations. The basic classes of methods, based on smoothness of the objective function, are:

Combinatorial methods Derivative-free methods 68

First-order methods Second-order methods

Actual methods falling somewhere among the categories above include:


Bundle methods Conjugate gradient method Ellipsoid method FrankWolfe method Gradient descent aka steepest descent or steepest ascent Interior point methods Line search - a technique for one dimensional optimization, usually used as a subroutine for other, more general techniques Nelder-Mead method aka the Amoeba method Newton's method Quasi-Newton methods Simplex method Subgradient method - similar to gradient method in case there are no gradients

Should the objective function be convex over the region of interest, then any local minimum will also be a global minimum. There exist robust, fast numerical techniques for optimizing twice differentiable convex functions. Constrained problems can often be transformed into unconstrained problems with the help of Lagrange multipliers. Here are a few other popular methods:

Filled function method Ant colony optimization Beam search Bees algorithm Differential evolution Dynamic relaxation Evolution strategy Genetic algorithms Harmony search Hill climbing Particle swarm optimization Quantum annealing Simulated annealing Stochastic tunneling Tabu search

Combinatorial optimization Combinatorial optimization is a branch of optimization. Its domain is optimization problems where the set of feasible solutions is discrete or can be reduced to a discrete one, and the goal is to find the best possible solution.

69

It is a branch of applied mathematics and computer science, related to operations research, algorithm theory and computational complexity theory that sits at the intersection of several fields, including artificial intelligence, mathematics and software engineering. Some research literature considers discrete optimization to consist of integer programming together with combinatorial optimization (which in turn is composed of optimization problems dealing with graphs, matroids, and related structures) although all of these topics have closely intertwined research literature. Conjugate gradient method In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positivedefinite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too large to be handled by direct methods such as the Cholesky decomposition. Such systems often arise when numerically solving partial differential equations. The conjugate gradient method can also be used to solve unconstrained optimization problems such as energy minimization. The biconjugate gradient method provides a generalization to non-symmetric matrices. Various nonlinear conjugate gradient methods seek minima of nonlinear equations. Ellipsoid method The ellipsoid method is an algorithm for solving convex optimization problems. It was introduced by Naum Z. Shor, Arkady Nemirovsky, and David B. Yudin in 1972, and used by Leonid Khachiyan to prove the polynomial-time solvability of linear programs. At the time, the ellipsoid method was the only algorithm for solving linear programs whose runtime was provably polynomial. However, the interior-point method and variants of the simplex algorithm are much faster than the ellipsoid method in practice. Karmarkar's algorithm is also faster in the worst case. The algorithm works by enclosing the minimizer of a convex function in a sequence of ellipsoids whose volume decreases at each iteration. The ellipsoid method is rarely used in practice due to poor practical performance and is used almost exclusively as an educational tool to prove the polynomial complexity of linear programs. Gradient descent Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

70

Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case the search space is typically a function space, and one calculates the Gteaux derivative of the functional to be minimized to determine the descent direction. Two weaknesses of gradient descent are: 1. The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different. 2. Finding the optimal per step can be time-consuming. Conversely, using a fixed can yield poor results. Methods based on Newton's method and inversion of the Hessian using conjugate gradient techniques are often a better alternative. A more powerful algorithm is given by the BFGS method which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated line search algorithm, to find the "best" value of . Gradient descent is in fact Euler's method for solving ordinary differential equations applied to a gradient flow. As the goal is to find the minimum, not the flow line, the error in finite methods is less significant. Interior point method Interior point methods (also referred to as barrier methods) are a certain class of algorithms to solve linear and nonlinear convex optimization problems. These algorithms have been inspired by Karmarkar's algorithm, developed by Narendra Karmarkar in 1984 for linear programming. The basic elements of the method consists of a self-concordant barrier function used to encode the convex set. Contrary to the simplex method, it reaches an optimal solution by traversing the interior of the feasible region. Any convex optimization problem can be transformed into minimizing (or maximizing) a linear function over a convex set. The idea of encoding the feasible set using a barrier and designing barrier methods was studied in the early 1960s by, amongst others, Anthony V. Fiacco and Garth P. McCormick. These ideas were mainly developed for general nonlinear programming, but they were later abandoned due to the presence of more competitive methods for this class of problems (e.g. sequential quadratic programming). Yurii Nesterov and Arkadii Nemirovskii came up with a special class of such barriers that can be used to encode any convex set. They guarantee that the number of iterations of the algorithm is bounded by a polynomial in the dimension and accuracy of the solution. Karmarkar's breakthrough revitalized the study of interior point methods and barrier problems, showing that it was possible to create an algorithm for linear programming characterized by polynomial complexity and, moreover, that was competitive with the simplex method. Already Khachiyan's ellipsoid method was a polynomial time algorithm; however, in practice it was too slow to be of practical interest. The class of primal-dual path-following interior point methods is considered the most successful. Mehrotra's predictor-corrector algorithm provides the basis for most implementations of this class of methods.

71

Nelder-Mead method The Nelder-Mead method or downhill simplex method or amoeba method is a commonly used nonlinear optimization technique, which is a well-defined numerical method for twice differentiable and unimodal problems. However, the Nelder-Mead technique is only a heuristic, since it can converge to non-stationary points on problems that can be solved by alternative methods. The Nelder-Mead technique was proposed by John Nelder & R. Mead (1965) and is a technique for minimizing an objective function in a many-dimensional space. The method uses the concept of a simplex, which is a special polytope of N + 1 vertices in N dimensions. Examples of simplices include a line segment on a line, a triangle on a plane, a tetrahedron in three-dimensional space and so forth. The method approximates a locally optimum of a problem with N variables when the objective function varies smoothly and is unimodal. For example, a suspension bridge engineer has to choose how thick each strut, cable, and pier must be. Clearly these all link together, but it is not easy to visualize the impact of changing any specific element. The engineer can use the Nelder-Mead method to generate trial designs which are then tested on a large computer model. As each run of the simulation is expensive, it is important to make good decisions about where to look. Nelder-Mead generates a new test position by extrapolating the behavior of the objective function measured at each test point arranged as a simplex. The algorithm then chooses to replace one of these test points with the new test point and so the technique progresses. The simplest step is to replace the worst point with a point reflected through the centroid of the remaining N points. If this point is better than the best current point, then we can try stretching exponentially out along this line. On the other hand, if this new point isn't much better than the previous value, then we are stepping across a valley, so we shrink the simplex towards a better point. Unlike modern optimization methods, the Nelder-Mead heuristic can converge to a nonstationary point unless the problem satisfies stronger conditions than are necessary for modern methods. Modern improvements over the Nelder-Mead heuristic have been known since 1979. Many variations exist depending on the actual nature of the problem being solved. A common variant uses a constant-size, small simplex that roughly follows the gradient direction (which gives steepest ascent). Visualize a small triangle on an elevation map flip-flopping its way up a hill to a local peak. This method is also known as the Flexible Polyhedron Method. This, however, tends to perform poorly against the method described in this article because it makes small, unnecessary steps in areas of little interest. Newton's method in optimization In mathematics, Newton's method is a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions by noticing that if a real number x* is a stationary point of a function f(x), then x* is

72

a root of the derivative f'(x), and therefore one can solve for x* by applying Newton's method to f'(x). The Taylor expansion of f(x),

, attains its extremum when x solves the linear equation:

Thus, provided that f(x) is a twice-differentiable function and the initial guess x0 is chosen close enough to x* , the sequence (xn) defined by

will converge towards x*. This iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, Hessian matrix, , and the reciprocal of the second derivative with the inverse of the . One obtains the iterative scheme

Usually Newton's method is modified to include a small step size > 0 instead of = 1

This is often done to ensure that the Wolfe conditions are satisfied at each step of the iteration. The geometric interpretation of Newton's method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function. (If f(x) happens to be a quadratic function, then the exact extremum is found in one step.) Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood N such that, if we start with Newton's method with step size = 1 converges quadratically (if the Hessian is invertible in that neighborhood). Finding the inverse of the Hessian is an expensive operation, so the linear equation

73

is often solved approximately (but to great accuracy) using a method such as conjugate gradient. There also exist various quasi-Newton methods, where an approximation for the Hessian is used instead. If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix Bn so as to make One approach is to diagonalize Hf and choose Bn so that eigenvectors as Hf, but with each negative eigenvalue replaced by > 0. positive definite. has the same

Some functions are poorly approximated by quadratics, particularly when far from a maximum or minimum. In these cases, approximations other than quadratic may be more appropriate. Quasi-Newton method In optimization, quasi-Newton methods (also known as variable metric methods) are wellknown algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. Newton's method assumes that the function can be locally approximated as a quadratic in the region around the optimum, and use the first and second derivatives (gradient and Hessian) to find the stationary point. In Quasi-Newton methods the Hessian matrix of second derivatives of the function to be minimized does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multi-dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian. The first quasi-Newton algorithm was proposed by W.C. Davidon, a physicist working at Argonne National Laboratory. He developed the first quasi-Newton algorithm in 1959: the DFP updating formula, which was later popularized by Fletcher and Powell in 1963, but is rarely used today. The most common quasi-Newton algorithms are currently the SR1 formula (for symmetric rank one) and the widespread BFGS method, that was suggested independently by Broyden, Fletcher, Goldfarb, and Shanno, in 1970. The Broyden's class is a linear combination of the DFP and BFGS methods. The SR1 formula does not guarantee the update matrix to maintain positive-definiteness and can be used for indefinite problems. The Broyden's method does not require the update matrix to be symmetric and it is used to find the root of a general system of equations (rather than the gradient) by updating the Jacobian (rather than the Hessian). As in Newton's method, one uses a second order approximation to find the minimum of a function f(x). The Taylor series of f(x) around an iterate is:

74

, ) is the gradient and B an approximation to the Hessian matrix. The gradient of where ( this approximation (with respect to x) is

and setting this gradient to zero provides the Newton step: , The Hessian approximation B is chosen to satisfy , which is called the secant equation, but this condition is not sufficient to determine B. In one dimension, solving for B and applying the Newton's step with the updated value is equivalent to the secant method. In multidimensions B is under determined. Various methods are used to find the solution to the secant equation that is symmetric (BT = B) and closest to the current approximate value Bk according to some metric minB | | B Bk | | . An approximate initial value of B0 = I is often sufficient to achieve rapid convergence. The unknown xk is updated applying the Newton's step calculated using the current approximate Hessian matrix Bk. Simplex algorithm In mathematical optimization theory, the simplex algorithm, created by the American mathematician George Dantzig in 1947, is a popular algorithm for numerically solving linear programming problems. The journal Computing in Science and Engineering listed it as one of the top 10 algorithms of the century. The method uses the concept of a simplex, which is a polytope of N + 1 vertices in N dimensions: a line segment in one dimension, a triangle in two dimensions, a tetrahedron in three-dimensional space and so forth. Consider a linear programming problem, maximize subject to with the variables of the problem, a vector the

representing the linear form to optimize, A a rectangular p,n matrix and linear constraints.

In geometric terms, each inequality specifies a half-space in n-dimensional Euclidean space, and their intersection is the set of all feasible values the variables can take. The region is convex and either empty, unbounded, or a polytope.

75

The set of points where the objective function obtains a given value v is defined by the hyperplane cTx = v. We are looking for the largest v such that the hyperplane still intersects the feasible region. As v increases, the hyperplane translates in the direction of the vector c. Intuitively, and indeed it can be shown by convexity, the last hyperplane to intersect the feasible region will either just graze a vertex of the polytope, or a whole edge or face. In the latter two cases, it is still the case that the endpoints of the edge or face will achieve the optimum value. Thus, the optimum value will always be achieved on one of the vertices of the polytope. The simplex algorithm applies this insight by walking along edges of the (possibly unbounded) polytope to vertices with higher objective function value. When a local maximum is reached, by convexity it is also the global maximum and the algorithm terminates. It also finishes when an unbounded edge is visited, concluding that the problem has no solution. The algorithm always terminates because the number of vertices in the polytope is finite ; moreover since we jump between vertices always in the same direction (that of the objective function), we hope that the number of vertices visited will be small. Usually there are more than one adjacent vertices which improve the objective function, so a pivot rule must be specified to determine which vertex to pick. The selection of this rule has a great impact on the runtime of the algorithm. In 1972, Klee and Minty gave an example of a linear programming problem in which the polytope P is a distortion of an n-dimensional cube. They showed that the simplex method as formulated by Dantzig visits all 2n vertices before arriving at the optimal vertex. This shows that the worst-case complexity of the algorithm is exponential time. Since then it has been shown that for almost every deterministic rule there is a family of simplices on which it performs badly. It is an open question if there is a pivot rule with polynomial time, or even sub-exponential worst-case complexity. Nevertheless, the simplex method is remarkably efficient in practice. It has been known since the 1970s that it has polynomial-time average-case complexity under various distributions. These results on "random" matrices still didn't quite capture the desired intuition that the method works well on "typical" matrices. In 2001 Spielman and Teng introduced the notion of smoothed complexity to provide a more realistic analysis of the performance of algorithms. Subgradient method Subgradient methods are algorithms for solving convex optimization problems. Originally developed by Naum Z. Shor and others in the 1960s and 1970s, subgradient methods can be used with a non-differentiable objective function. When the objective function is differentiable, subgradient methods for unconstrained problems use the same search direction as the method of steepest descent. Although subgradient methods can be much slower than interior-point methods and Newton's method in practice, they can be immediately applied to a far wider variety of problems and require much less memory. Moreover, by combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem.

76

Ant colony optimization The ant colony optimization algorithm (ACO), is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. This algorithm is a member of ant colony algorithms family, in swarm intelligence methods, and it constitutes some metaheuristic optimizations. Initially proposed by Marco Dorigo in 1992 in his PhD thesis, the first algorithm was aiming to search for an optimal path in a graph; based on the behavior of ants seeking a path between their colony and a source of food. The original idea has since diversified to solve a wider class of numerical problems, and as a result, several problems have emerged, drawing on various aspects of the behavior of ants. In the real world, ants (initially) wander randomly, and upon finding food return to their colony while laying down pheromone trails. If other ants find such a path, they are likely not to keep travelling at random, but to instead follow the trail, returning and reinforcing it if they eventually find food. Over time, however, the pheromone trail starts to evaporate, thus reducing its attractive strength. The more time it takes for an ant to travel down the path and back again, the more time the pheromones have to evaporate. A short path, by comparison, gets marched over faster, and thus the pheromone density remains high as it is laid on the path as fast as it can evaporate. Pheromone evaporation has also the advantage of avoiding the convergence to a locally optimal solution. If there were no evaporation at all, the paths chosen by the first ants would tend to be excessively attractive to the following ones. In that case, the exploration of the solution space would be constrained. Thus, when one ant finds a good (i.e., short) path from the colony to a food source, other ants are more likely to follow that path, and positive feedback eventually leads all the ants following a single path. The idea of the ant colony algorithm is to mimic this behavior with "simulated ants" walking around the graph representing the problem to solve. Beam search Beam search is a heuristic search algorithm that is an optimization of best-first search that reduces its memory requirement. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic which attempts to predict how close a partial solution is to a complete solution (goal state). In beam search, only a predetermined number of best partial solutions are kept as candidates. Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorts them in order of increasing heuristic values. However, it only stores a predetermined number of states at each level (called the beam width). The smaller the beam width, the more states are pruned. Therefore, with an infinite beam width, no states are pruned and beam search is identical to breadth-first search. The beam width bounds the memory required to perform the search, at the expense of risking completeness (possibility that it will not terminate) and optimality (possibility that it will not find the best solution). The reason for this risk is that the goal state could potentially be pruned.

77

The beam width can either be fixed or variable. In a fixed beam width, a maximum number of successor states is kept. In a variable beam width, a threshold is set around the current best state. All states that fall outside this threshold are discarded. Thus, in places where the best path is obvious, a minimal number of states is searched. In places where the best path is ambiguous, many paths will be searched. Bees algorithm The Bees Algorithm is a population-based search algorithm first developed in 2005. It mimics the food foraging behaviour of swarms of honey bees. In its basic version, the algorithm performs a kind of neighbourhood search combined with random search and can be used for both combinatorial optimisation and functional optimisation. A colony of honey bees can extend itself over long distances (up to 14 km) and in multiple directions simultaneously to exploit a large number of food sources. A colony prospers by deploying its foragers to good fields. In principle, flower patches with plentiful amounts of nectar or pollen that can be collected with less effort should be visited by more bees, whereas patches with less nectar or pollen should receive fewer bees. The foraging process begins in a colony by scout bees being sent to search for promising flower patches. Scout bees move randomly from one patch to another. During the harvesting season, a colony continues its exploration, keeping a percentage of the population as scout bees. When they return to the hive, those scout bees that found a patch which is rated above a certain quality threshold (measured as a combination of some constituents, such as sugar content) deposit their nectar or pollen and go to the "dance floor" to perform a dance known as the waggle dance. This dance is essential for colony communication, and contains three pieces of information regarding a flower patch: the direction in which it will be found, its distance from the hive and its quality rating (or fitness). This information helps the colony to send its bees to flower patches precisely, without using guides or maps. Each individuals knowledge of the outside environment is gleaned solely from the waggle dance. This dance enables the colony to evaluate the relative merit of different patches according to both the quality of the food they provide and the amount of energy needed to harvest it. After waggle dancing on the dance floor, the dancer (i.e. the scout bee) goes back to the flower patch with follower bees that were waiting inside the hive. More follower bees are sent to more promising patches. This allows the colony to gather food quickly and efficiently. While harvesting from a patch, the bees monitor its food level. This is necessary to decide upon the next waggle dance when they return to the hive. If the patch is still good enough as a food source, then it will be advertised in the waggle dance and more bees will be recruited to that source. Differential evolution Differential evolution (DE) is a method of mathematical optimization of multidimensional functions and belongs to the class of evolution strategy optimizers. DE finds the global

78

minimum of a multidimensional, multimodal (i.e. exhibiting more than one minimum) function with good probability. The DE community has been growing since the mid 1990s and today more researchers are working on and with DE. The crucial idea behind DE is a scheme for generating trial parameter vectors. DE adds the weighted difference between two population vectors to a third vector. This way no separate probability distribution has to be used which makes the scheme completely self-organizing. Further information on DE can be found in. Differential evolution is a simple and efficient adaptive scheme for global optimization over continuous spaces. The key element distinguishing DE from the other population-based techniques is the differential mutation mechanism. The first attempt to guide the differential mutation was presented by Price, where "semi-directed" mutation was realized by a special pre-selection operation. Later, Price analysed the strategies and noted that the strategy may consist of differential mutation and arithmetic crossover. This, in turn, gives the different dynamic effects of search. The ideas of "directions" were grasped spontaneously by H.-Y. Fan and J. Lampinen. In 2001, they proposed the alternations of the classical strategy (the first strategy suggested by K. Price) with a triangle mutation scheme and, in 2003, the alternations with a weighted directed strategy, where they used two difference vectors. These methods give some improvements, but it is worth noting that the percentage of using novel strategies is quite moderate. Subsequently, mixed variables were introduced. In 1999, I. Zelinka and J. Lampinen described a simple and, at the same time, an efficient way of handling simultaneously continuous, integer, and discrete variables. They applied this method to design engineering problems and obtained results that outperformed all the other mixed-variables methods used in engineering design. As a particular case of mixed-variable problems, in 2003, V. Feoktistov implemented DE to the binary-continuous large-scale application in the frame of the ROADEF2003 challenge. Dynamic relaxation Dynamic relaxation is a numerical method, which, among other things, can be used do "formfinding" for cable and fabric structures. The aim is to find a geometry where all forces are in equilibrium. In the past this was done by direct modelling, using hanging chains and weights, or by using soap films, which have the property of adjusting to find a "minimal surface". The dynamic relaxation method is based on discretizing the continuum under consideration by lumping the mass at nodes and defining the relationship between nodes in terms of stiffness. The system oscillates about the equilibrium position under the influence of loads. An iterative process is followed by simulating a pseudo-dynamic process in time, with each iteration based on an update of the geometry. Considering Newton's second law of motion (force is mass multiplied by acceleration) in the x direction at the ith node at time t:

79

Where: R is the residual force M is the nodal mass A is the nodal acceleration Note that fictitious modal masses may be chosen to speed up the process of form-finding. The relationship between the speed V, the geometry X and the residuals can be obtained by performing a double numerical integration of the acceleration (here in central finite difference form),:

Where: t is the time interval between two updates. By the principle of equilibrium of forces, the relationsip between the residuals and the geometry can be obtained:

where: P is the applied load component T is the tension in link m between nodes i and j l is the length of the link. The sum must cover the forces in all the connections between the node and other nodes. By repeating the use of the relationship between the residuals and the geometry, and the relationship between the geometry and the residual, the pseudo-dynamic process is simulated. It is possible to make dynamic relaxation more computationally efficient (reducing the number of iterations) by using damping. There are two methods of damping:

Viscous damping, which assumes that connection between the nodes has a viscous force component. Kinetic energy damping, where the coordinates at peak kinetic energy are calculated (the equilibrium position), then updates the geometry to this position and resets the velocity to zero.

The advantage of viscous damping is that it represents the reality of a cable with viscous properties. Moreover it is easy to realize because the speed is already computed. The kinetic energy damping is an artificial damping which is not a real effect, but offers a drastic 80

reduction in the number of iterations required to find a solution. However, there is a computational penalty in that the kinetic energy and peak location must be calculated, after which the geometry has to be updated to this position. Evolution strategy In computer science, evolution strategy (ES) is an optimization technique based on ideas of adaptation and evolution. It was created in the early 1960s and developed further along the 1970s and later by Ingo Rechenberg, Hans-Paul Schwefel and his co-workers, and belongs to the more general class of evolutionary computation or artificial evolution. Evolution strategies use natural problem-dependent representations, and primarily mutation and selection as search operators. As common with evolutionary algorithms, the operators are applied in a loop. An iteration of the loop is called a generation. The sequence of generations is continued until a termination criterion is met. As far as real-valued search spaces are concerned, mutation is normally performed by adding a normally distributed random value to each vector component. The step size or mutation strength (i.e. the standard deviation of the normal distribution) is often governed by selfadaptation. Individual step sizes for each coordinate or correlations between coordinates are either governed by self-adaptation or by covariance matrix adaptation (CMA-ES). The (environmental) selection in evolution strategies is deterministic and only based on the fitness rankings, not on the actual fitness values. The simplest ES operates on a population of size two: the current point (parent) and the result of its mutation. Only if the mutant's fitness is at least as good as the parent one, it becomes the parent of the next generation. Otherwise the mutant is disregarded. This is a (1+1)-ES. More generally, mutants can be generated and compete with the parent, called (1 + )-ES. In a (1, )-ES the best mutant becomes the parent of the next generation while the current parent is always disregarded. Contemporary derivatives of evolution strategy often use a population of parents and also recombination as an additional operator (called (/+, )-ES). This is believed to make them less prone to get stuck in local optima. Genetic algorithm A genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms (EA) that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. Genetic algorithms are implemented in a computer simulation in which a population of abstract representations (called chromosomes or the genotype of the genome) of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem evolves toward better solutions. Traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible. The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), and modified (recombined and possibly

81

randomly mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. Genetic algorithms find application in bioinformatics, phylogenetics, computational science, engineering, economics, chemistry, manufacturing, mathematics, physics and other fields. A typical genetic algorithm requires: 1. a genetic representation of the solution domain, 2. a fitness function to evaluate the solution domain. A standard representation of the solution is as an array of bits. Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming. The fitness function is defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. For instance, in the knapsack problem one wants to maximize the total value of objects that can be put in a knapsack of some fixed capacity. A representation of a solution might be an array of bits, where each bit represents a different object, and the value of the bit (0 or 1) represents whether or not the object is in the knapsack. Not every such representation is valid, as the size of objects may exceed the capacity of the knapsack. The fitness of the solution is the sum of values of all objects in the knapsack if the representation is valid, or 0 otherwise. In some problems, it is hard or even impossible to define the fitness expression; in these cases, interactive genetic algorithms are used. Once we have the genetic representation and the fitness function defined, GA proceeds to initialize a population of solutions randomly, then improve it through repetitive application of mutation, crossover, inversion and selection operators. Harmony search Harmony search (HS) is a metaheuristic algorithm (also known as soft computing algorithm or evolutionary algorithm) mimicking the improvisation process of musicians. In the process, each musician (= decision variable) plays (= generates) a note (= value) for finding a best harmony (= global optimum) all together. The Harmony Search algorithm has a novel stochastic derivative (for discrete variable) based on musician's experience, rather than gradient (for continuous variable) in differential calculus. HS has several advantages when compared with traditional gradient-based mathematical optimization techniques as follows:

HS does not require complex calculus, thus it is free from divergence.

82

HS does not require initial value settings for the decision variables, thus it may escape local optima. HS can handle discrete variables as well as continuous variables, while gradient-based techniques handle continuous variables only.

Also, the HS algorithm could overcome the drawback of genetic algorithm's building block theory by considering the relationship among decision variables using its ensemble operation. Hill climbing In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is relatively simple to implement, making it a popular first choice. Although more advanced algorithms may give better results, in some situations hill climbing works just as well. Hill climbing can be used to solve problems that have many solutions, some of which are better than others. It starts with a random (potentially poor) solution, and iteratively makes small changes to the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates. Ideally, at that point the current solution is close to optimal, but it is not guaranteed that hill climbing will ever come close to the optimal solution. For example, hill climbing can be applied to the traveling salesman problem. It is easy to find a solution that visits all the cities but will be very poor compared to the optimal solution. The algorithm starts with such a solution and makes small improvements to it, such as switching the order in which two cities are visited. Eventually, a much better route is obtained. Hill climbing is used widely in artificial intelligence, for reaching a goal state from a starting node. Choice of next node and starting node can be varied to give a list of related algorithms. Hill climbing attempts to maximize (or minimize) a function f(x), where x are discrete states. These states are typically represented by vertices in a graph, where edges in the graph encode nearness or similarity of a graph. Hill climbing will follow the graph from vertex to vertex, always locally increasing (or decreasing) the value of f, until a local maximum (or local minimum) xm is reached. Hill climbing can also operate on a continuous space: in that case, the algorithm is called gradient ascent (or gradient descent if the function is minimized). In simple hill climbing, the first closer node is chosen, whereas in steepest ascent hill climbing all successors are compared and the closest to the solution is chosen. Both forms fail if there is no closer node, which may happen if there are local maxima in the search space which are not solutions. Steepest ascent hill climbing is similar to best-first search, which tries all possible extensions of the current path instead of only one. Stochastic hill climbing does not examine all neighbors before deciding how to move. Rather, it selects a neighbor at random, and decides (based on the amount of improvement in that neighbor) whether to move to that neighbor or to examine another. Random-restart hill climbing is a meta-algorithm built on top of the hill climbing algorithm. It is also known as Shotgun hill climbing. It iteratively does hill-climbing, each time with a

83

random initial condition x0. The best xm is kept: if a new run of hill climbing produces a better xm than the stored state, it replaces the stored state. Random-restart hill climbing is a surprisingly effective algorithm in many cases. It turns out that it is often better to spend CPU time exploring the space, than carefully optimizing from an initial condition. A problem with hill climbing is that it will find only local maxima. Unless the heuristic is convex, it may not reach a global maximum. Other local search algorithms try to overcome this problem such as stochastic hill climbing, random walks and simulated annealing. A ridge is a curve in the search place that leads to a maximum, but the orientation of the ridge compared to the available moves that are used to climb is such that each move will lead to a smaller point. In other words, each point on a ridge looks to the algorithm like a local maximum, even though the point is part of a curve leading to a better optimum. Another problem with hill climbing is that of a plateau, which occurs when we get to a "flat" part of the search space, i.e. we have a path where the heuristics are all very close together. This kind of flatness can cause the algorithm to cease progress and wander aimlessly. Particle swarm optimization Particle swarm optimization (PSO) is an algorithm modelled on swarm intelligence that finds a solution to an optimization problem in a search space, or model and predict social behavior in the presence of objectives. The PSO belongs to the class of direct search methods used to find an optimal solution to an objective function (aka fitness function) in a search space. Direct search methods are usually derivative-free, meaning that they depend only on the evaluation of the objective function. The particle swarm optimization algorithm is simple, in the sense that even the basic form of the algorithm yields results, it can be implemented by a programmer in short duration, and it can be used by anyone with an understanding of objective functions and the problem at hand without needing an extensive background in mathematical optimization theory. The PSO is a stochastic, population-based computer algorithm modeled on swarm intelligence. Swarm intelligence is based on social-psychological principles and provides insights into social behavior, as well as contributing to engineering applications. The particle swarm optimization algorithm was first described in 1995 by James Kennedy and Russell C. Eberhart. Social influence and social learning enable a person to maintain cognitive consistency. People solve problems by talking with other people about them, and as they interact their beliefs, attitudes, and behaviors change; the changes could typically be depicted as the individuals moving toward one another in a socio-cognitive space. Particle swarm optimization is inspired by this kind of social optimization. A problem is given, and some way to evaluate a proposed solution to it exists in the form of a fitness function. A communication structure or social network is also defined, assigning neighbors for each individual to interact with. Then a population of individuals defined as random guesses at the problem solutions is initialized. These individuals are candidate solutions. They

84

are also known as the particles, hence the name particle swarm. An iterative process to improve these candidate solutions is set in motion. The particles iteratively evaluate the fitness of the candidate solutions and remember the location where they had their best success. The individual's best solution is called the particle best or the local best. Each particle makes this information available to their neighbors. They are also able to see where their neighbors have had success. Movements through the search space are guided by these successes, with the population usually converging, by the end of a trial, on a problem solution better than that of non-swarm approach using the same methods. The swarm is typically modelled by particles in multidimensional space that have a position and a velocity. These particles fly through hyperspace and have two essential reasoning capabilities: their memory of their own best position and knowledge of the global or their neighborhood's best. In a minimization optimization problem, problems are formulated so that "best" simply means the position with the smallest objective value. Members of a swarm communicate good positions to each other and adjust their own position and velocity based on these good positions. So a particle has the following information to make a suitable change in its position and velocity:

A global best that is known to all and immediately updated when a new best position is found by any particle in the swarm. Neighborhood best that the particle obtains by communicating with a subset of the swarm. The local best, which is the best solution that the particle has seen.

The particle position and velocity update equations in the simplest form that govern the PSO are given by

In Standard PSO (available on the Particle Swarm Central), the parameter c1 is set to zero. And when neighborhoodbestj = localbesti,j, then r3 is also (temporarily) set to zero. As the swarm iterates, the fitness of the global best solution improves (decreases for minimization problem). It could happen that all particles being influenced by the global best eventually approach the global best, and from there on the fitness never improves despite however many runs the PSO is iterated thereafter. The particles also move about in the search space in close proximity to the global best and not exploring the rest of search space. This phenomenon is called 'convergence'. If the inertial coefficient of the velocity is small, all particles could slow down until they approach zero velocity at the global best. The selection of coefficients in the velocity update equations affects the convergence and the ability of the swarm to find the optimum. One way to come out of the situation is to reinitialize the particles positions at intervals or when convergence is detected. Some research approaches investigated the application of constriction coefficients and inertia weights. There are numerous techniques for preventing premature convergence. Many variations on the social network topology, parameter-free, fully adaptive swarms, and some highly simplified models have been created. The algorithm has been analyzed as a dynamical system, and has been used in hundreds of engineering applications; it is used to compose music, to model markets and organizations, and in art installations.

85

The particle swarm optimization in its basic form is best suited for continuous variables, that is the objective function can be evaluated for even the tiniest increment. The method has been adapted as a binary PSO to also optimize binary variables which take only one of two values. Several methods exist to handle discrete variables which may be in one of multiple states, the simplest being rounding an internal continuous representation of the solution to the closest coordinates at which the objective function can be evaluated. Methods also exist to extend the particle swarm to search combinatorial variables where moving for state to state does not have the same meaning as moving in a coordinate space. Quantum annealing In mathematics and applications, quantum annealing (QA) is a general method for finding the global minimum of a given objective function over a given set of candidate solutions (the search space), by a process analogous to quantum fluctuations. It is used mainly for problems where the search space is discrete (combinatorial optimization problems) with many local minima; such as finding the ground states of a glassy system. In quantum annealing, a "current state" (the current candidate solution) s is randomly replaced by a randomly selected neighbor state s' if the latter has a lower "energy" (value of the objective function). The process is controlled by the "tunneling field strength" T, a parameter that determines the extent of the neighborhood of states explored by the method. The tunneling field starts high, so that the neighborhood extends over the whole search space; and is slowly reduced through the computation, until the neighborhood shrinks to those few states that differ minimally from the current states. Quantum annealing can be compared to simulated annealing (SA), whose "temperature" parameter plays a similar role to QA's tunneling field strength. However, in SA the neighborhood stays the same throughout the search, and the temperature determines the probability of moving to a state of higher "energy". In QA, the tunneling field strength determines instead the neighborhood radius, i.e. the mean distance between the next candidate s' and the current candidate s. In more elaborated SA variants (such as Adaptive simulated annealing), the neighborhood radius is also varied using acceptance rate percentages or the temperature value. The tunneling field is basically a kinetic energy term that does not commute with the classical potential energy part of the original glass. The whole process can be simulated in a computer using quantum Monte Carlo (or other stochastic technique), and thus obtain a heuristic algorithm for finding the ground state of the classical glass. It is speculated that in a quantum computer, such simulations would be much more efficient and exact than that done in a classical computer, due to quantum parallelism realized by the actual superposition of all the classical configurations at any instant. By that time, the system finds a very deep (likely, the global one) minimum and settle there. At the end, we are left with the classical system at its global minimum. In the case of annealing a purely mathematical objective function, one may consider the variables in the problem to be classical degrees of freedom, and the cost functions to be the potential energy function (classical Hamiltonian). Then a suitable term consisting of noncommuting variable(s) (i.e. variables that has non-zero commutator with the variables of the

86

original mathematical problem) has to be introduced artificially in the Hamiltonian to play the role of the tunneling field (kinetic part). Then one may carry out the simulation with the quantum Hamiltonian thus constructed (the original function + non-commuting part) just as described above. Here, there is a choice in selecting the non-commuting term and the efficiency of annealing may depend on that. It has been demonstrated experimentally as well as theoretically, that quantum annealing can indeed out rate thermal annealing in certain cases, specially, where the potential energy (cost) landscape consists of very high but thin barriers surrounding shallow local minima. Since thermal transition probabilities (~exp( / kBT); T => Temperature, kB => Boltzmann constant) depend only on the height of the barriers, it is very difficult for thermal fluctuations to get the system out from such local minima. But quantum tunneling probabilities through a barrier depend not only the height of the barrier, but also on its width w; if the barriers are thin enough, quantum fluctuations may bring the system out of the shallow local minima surrounded by them. Simulated annealing Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of applied mathematics, namely locating a good approximation to the global minimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For certain problems, simulated annealing may be more effective than exhaustive enumeration provided that the goal is merely to find an acceptably good solution in a fixed amount of time, rather than the best possible solution. The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. By analogy with this physical process, each step of the SA algorithm replaces the current solution by a random "nearby" solution, chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (called the temperature), that is gradually decreased during the process. The dependency is such that the current solution changes almost randomly when T is large, but increasingly "downhill" as T goes to zero. The allowance for "uphill" moves saves the method from becoming stuck at local minimawhich are the bane of greedier methods. The method was independently described by S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi in 1983, and by V. ern in 1985. The method is an adaptation of the Metropolis-Hastings algorithm, a Monte Carlo method to generate sample states of a thermodynamic system, invented by N. Metropolis et al. in 1953. In the simulated annealing (SA) method, each point s of the search space is analogous to a state of some physical system, and the function E(s) to be minimized is analogous to the internal energy of the system in that state. The goal is to bring the system, from an arbitrary initial state, to a state with the minimum possible energy.

87

At each step, the SA heuristic considers some neighbour s' of the current state s, and probabilistically decides between moving the system to state s' or staying in state s. The probabilities are chosen so that the system ultimately tends to move to states of lower energy. Typically this step is repeated until the system reaches a state that is good enough for the application, or until a given computation budget has been exhausted. The neighbours of each state (the candidate moves) are specified by the user, usually in an application-specific way. For example, in the traveling salesman problem, each state is typically defined as a particular tour (a permutation of the cities to be visited); and one could define the neighbours of a tour as those tours that can be obtained from it by exchanging any pair of consecutive cities. The probability of making the transition from the current state s to a candidate new state s' is specified by an acceptance probability function P(e,e',T), that depends on the energies e = E(s) and e' = E(s') of the two states, and on a global time-varying parameter T called the temperature. One essential requirement for the probability function P is that it must be nonzero when e' > e, meaning that the system may move to the new state even when it is worse (has a higher energy) than the current one. It is this feature that prevents the method from becoming stuck in a local minimuma state that is worse than the global minimum, yet better than any of its neighbours. On the other hand, when T goes to zero, the probability P(e,e',T) must tend to zero if e' > e, and to a positive value if e' < e. That way, for sufficiently small values of T, the system will increasingly favor moves that go "downhill" (to lower energy values), and avoid those that go "uphill". In particular, when T becomes 0, the procedure will reduce to the greedy algorithmwhich makes the move only if it goes downhill. In the original description of SA, the probability P(e,e',T) was defined as 1 when e' < e i.e., the procedure always moved downhill when it found a way to do so, irrespective of the temperature. Many descriptions and implementations of SA still take this condition as part of the method's definition. However, this condition is not essential for the method to work, and one may argue that it is both counterproductive and contrary to its spirit. The P function is usually chosen so that the probability of accepting a move decreases when the difference e' e increasesthat is, small uphill moves are more likely than large ones. However, this requirement is not strictly necessary, provided that the above requirements are met. Given these properties, the evolution of the state s depends crucially on the temperature T. Roughly speaking, the evolution of s is sensitive to coarser energy variations when T is large, and to finer variations when T is small. Another essential feature of the SA method is that the temperature is gradually reduced as the simulation proceeds. Initially, T is set to a high value (or infinity), and it is decreased at each step according to some annealing schedulewhich may be specified by the user, but must end with T = 0 towards the end of the allotted time budget. In this way, the system is expected to wander initially towards a broad region of the search space containing good solutions, ignoring small features of the energy function; then drift towards low-energy regions that

88

become narrower and narrower; and finally move downhill according to the steepest descent heuristic. It can be shown that for any given finite problem, the probability that the simulated annealing algorithm terminates with the global optimal solution approaches 1 as the annealing schedule is extended. This theoretical result, however, is not particularly helpful, since the time required to ensure a significant probability of success will usually exceed the time required for a complete search of the solution space. Stochastic tunneling Stochastic tunneling (STUN) is an approach to global optimization based on the Monte Carlo method-sampling of the function to be minimized. Monte Carlo method-based optimization techniques sample the objective function by randomly "hopping" from the current solution vector to another with a difference in the function value of E. The acceptance probability of such a trial jump is in most cases chosen (Metropolis criterion) with an appropriate parameter . to be The general idea of STUN is to circumvent the slow dynamics of ill-shaped energy functions that one encounters for example in spin glasses by tunneling through such barriers. This goal is achieved by Monte Carlo sampling of a transformed function that lacks this slow dynamics. In the "standard-form" the transformation reads where fo is the lowest function value found so far. This transformation preserves the loci of the minima. Tabu search Tabu search is a mathematical optimization method, belonging to the class of local search techniques. Tabu search enhances the performance of a local search method by using memory structures: once a potential solution has been determined, it is marked as "taboo" ("tabu" being a different spelling of the same word) so that the algorithm does not visit that possibility repeatedly. Tabu search is attributed to Fred Glover. Tabu search is a metaheuristic algorithm that can be used for solving combinatorial optimization problems, such as the traveling salesman problem (TSP). Tabu search uses a local or neighbourhood search procedure to iteratively move from a solution x to a solution x' in the neighbourhood of x, until some stopping criterion has been satisfied. To explore regions of the search space that would be left unexplored by the local search procedure, tabu search modifies the neighbourhood structure of each solution as the search progresses. The solutions admitted to N * (x), the new neighbourhood, are determined through the use of memory structures. The search then progresses by iteratively moving from a solution x to a solution x' in N * (x). Perhaps the most important type of memory structure used to determine the solutions admitted to N * (x) is the tabu list. In its simplest form, a tabu list is a short-term memory which contains the solutions that have been visited in the recent past (less than n iterations ago, where n is the number of previous solutions to be stored (n is also called the tabu tenure)).

89

Tabu search excludes solutions in the tabu list from N * (x). A variation of a tabu list prohibits solutions that have certain attributes (e.g., solutions to the traveling salesman problem (TSP) which include undesirable arcs) or prevent certain moves (e.g. an arc that was added to a TSP tour cannot be removed in the next n moves). Selected attributes in solutions recently visited are labeled "tabu-active." Solutions that contain tabu-active elements are tabu. This type of short-term memory is also called "recency-based" memory. Tabu lists containing attributes can be more effective for some domains, although they raise a new problem. When a single attribute is marked as tabu, this typically results in more than one solution being tabu. Some of these solutions that must now be avoided could be of excellent quality and might not have been visited. To mitigate this problem, "aspiration criteria" are introduced: these override a solution's tabu state, thereby including the otherwiseexcluded solution in the allowed set. A commonly used aspiration criterion is to allow solutions which are better than the currently-known best solution.

90

You might also like