You are on page 1of 13

CHAPTER 1 (4 LECTURES)

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

1. Numerical analysis
Numerical analysis is the branch of mathematics which study and develop
the algorithms that use numerical approximation for the problems of mathematical analysis (continuous mathematics). Numerical technique is widely
used by scientists and engineers to solve their problems. A major advantage
for numerical technique is that a numerical answer can be obtained even
when a problem has no analytical solution. However, result from numerical
analysis is an approximation, in general, which can be made
as accurate as

desired. For example to find the approximate values of 2, etc.


In this chapter, we introduce and discuss some basic concepts of scientific
computing. We begin with discussion of floating-point representation and
then we discuss the most fundamental source of imperfection in numerical
computing namely roundoff errors. We also discuss source of errors and then
stability of numerical algorithms.
2. Numerical analysis and the art of scientific computing
Scientific computing is a discipline concerned with the development and
study of numerical algorithms for solving mathematical problems that arise
in various disciplines in science and engineering. Typically, the starting point
is a given mathematical model which has been formulated in an attempt
to explain and understand an observed phenomenon in biology, chemistry,
physics, economics, or any engineering or scientific discipline. We will concentrate on those mathematical models which are continuous (or piecewise
continuous) and are difficult or impossible to solve analytically: this is usually the case in practice. Relevant application areas within computer science
include graphics, vision and motion analysis, image and signal processing,
search engines and data mining, machine learning, hybrid and embedded
systems, and many more. In order to solve such a model approximately on
a computer, the (continuous, or piecewise continuous) problem is approximated by a discrete one. Continuous functions are approximated by finite
arrays of values. Algorithms are then sought which approximately solve the
mathematical problem efficiently, accurately and reliably.
3. Floating-point representation of numbers
Any real number is represented by an infinite sequence of digits. For example
1

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

8
= 2.66666 =
3

2
6
6
+ 2 + 3 + ...
1
10
10
10

101 .

This is an infinite series, but computer use an finite amount of memory to


represent numbers. Thus only a finite number of digits may be used to
represent any number, no matter by what representation method.
For example, we can chop the infinite decimal representation of 83 after 4
digits,
8
2
6
6
6
= ( 1 + 2 + 3 + 4 ) 101 = 0.2666 101 .
3
10
10
10
10
Generalizing this, we say that number has n decimal digits and call this n
as precision.
For each real number x, we associate a floating point representation denoted
by f l(x), given by
f l(x) = (0.a1 a2 . . . an ) e ,
here based fraction is called mantissa with all ai integers and e is known
as exponent. This representation is called based floating point representation of x.
For example,
42.965 = 4 101 + 2 100 + 9 101 + 6 102 + 5 103
= 42965 102 .
0.00234 = 0.234 102 .
0 is written as 0.00 . . . 0 e . Likewise, we can use for binary number system
and any real x can be written
x = q 2m
with 21 q 1 and some integer m. Both q and m will be expressed in
terms of binary numbers.
For example,
1001.1101 = 1 23 + 2 20 + 1 21 + 1 22 + 1 24
= (9.8125)10 .
Definition 3.1 (Normal form). A non-zero floating-point number is in normal form if the values of mantissa lies in (1, 1 ) or [ 1 , 1].
Remark: The above representation is not unique.
For example, 0.2666 101 = 0.02666 102 .
Therefore, we normalize the representation by a1 6= 0. Not only the precision
is limited to a finite number of digits, but also the range of exponent is also
restricted. Thus there are integers m and M such that m e M .

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

3.1. Rounding and chopping. Let x be any real number and f l(x) be its
machine approximation. There are two ways to do the cutting to store a
real number
x = (a1 a2 . . . an an+1 . . . ) e ,

a1 6= 0.

(1) Chopping: We ignore digits after an and write the number as following in chopping
f l(x) = (.a1 a2 . . . an ) e .
(2) Rounding: Rounding is defined as following

(0.a1 a2 . . . an ) e , 0 an+1 < /2
(rounding down)
f l(x) =
(0.a1 a2 . . . an ) + (0.00 . . . 01) e , /2 an+1 < (rounding up).
Example 1.
  
6
0.86 100 (rounding)
fl
=
0.85 100 (chopping).
7
Rules for rounding off numbers:
(1) If the digit to be dropped is greater than 5, the last retained digit is
increased by one. For example,
12.6 is rounded to 13.
(2) If the digit to be dropped is less than 5, the last remaining digit is left
as it is. For example,
12.4 is rounded to 12.
(3) If the digit to be dropped is 5, and if any digit following it is not zero,
the last remaining digit is increased by one. For example,
12.51 is rounded to 13.
(4) If the digit to be dropped is 5 and is followed only by zeros, the last
remaining digit is increased by one if it is odd, but left as it is if even. For
example,
11.5 is rounded to 12, and 12.5 is rounded to 12.
Definition 3.2 (Absolute and relative error). If x is the approximation to
the exact value x , then the absolute error is |x x |, and relative error is
|x x |
.
|x|
Remark: As a measure of accuracy, the absolute error may be misleading
and the relative error is more meaningful.
Definition 3.3 (Overflow and underflow). An overflow is obtained when
a number is too large to fit into the floating point system in use, i.e e >
M . An underflow is obtained when a number is too small, i.e e < m .
When overflow occurs in the course of a calculation, this is generally fatal.
But underflow is non-fatal: the system usually sets the number to 0 and
continues. (Matlab does this, quietly.)

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

4. Errors in numerical approximations


Let x be any real number we want to represent in a computer. Let f l(x)
be the representation of x in the computer then what is largest possible
|x f l(x)|
values of
? In the worst case, how much data we are losing due
|x|
to round-off errors or chopping errors?
Chopping errors: Let
x = (0.a1 a2 . . . an an+1 . . . ) e =

X
ai
e , a1 6= 0.
e
i=1

n
X
ai
e.
e

f l(x) = (0.a1 a2 . . . an ) e =

i=1

Therefore
|x f l(x)| =

X
ai
e
e

i=n+1

X
ai
.
e

e |x f l(x)| =

i=n+1

Now since each ai 1,


therefore,
e |x f l(x)|

X
1
e

i=n+1


= ( 1)

1
n+1
"

= ( 1))

1
n+2
#

1
n+1
1 1


+ ...

= n .

Now
|x| = (0.a1 a2 . . . an ) e

1
e.

Therefore

|x f l(x)|
n e
1
1n .
|x|
e
Rounding errors: For rounding

P
ai

(0.a1 a2 . . . an ) e = ni=1 i e , an+1 < /2

f l(x) =
P
1
ai
e

(0.a1 a2 . . . an1 [an + 1]) = n + ni=1 i , an+1 /2.

For an+1 < /2,

X
X
ai
an+1
ai
e |x f l(x)| =
=
+
i
n+1

i
i=n+1

i=n+2

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

X
/2 1
( 1)
+
n+1

i
i=n+2

/2 1
1
1
+ n+1 = n .
n+1

For an+1 /2,

Since an+1


X
1
a


i

e |x f l(x)| =

i
n


i=n+1

1
X
ai

= n


i
i=n+1

1
X
an+1
ai

n n+1

i
i=n+2


1
an+1

n n+1

/2, therefore



1
/2
e
|x f l(x)| n n+1

1
= n .
2

Therefore, for both cases


|x f l(x)|
Now

1 en

.
2

|x f l(x)|
1 n e
1

= 1n .
1
e
|x|
2
2
5. Significant Figures

All measurements are approximations. No measuring device can give


perfect measurements without experimental uncertainty. By convention, a
mass measured to 13.2 g is said to have an absolute uncertainty of plus or
minus 0.1 g and is said to have been measured to the nearest 0.1 g. In
other words, we are somewhat uncertain about that last digit-it could be a
2; then again, it could be a 1 or a 3. A mass of 13.20 g indicates an
absolute uncertainty of plus or minus 0.01 g.
The number of significant figures in a result is simply the number of figures
that are known with some degree of reliability.
The number 25.4 is said to have 3 significant figures. The number 25.40
is said to have 4 significant figures
Rules for deciding the number of significant figures in a measured quantity:
(1) All nonzero digits are significant:

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

1.234 has 4 significant figures, 1.2 has 2 significant figures.


(2) Zeros between nonzero digits are significant: 1002 has 4 significant figures.
(3) Leading zeros to the left of the first nonzero digits are not significant;
such zeros merely indicate the position of the decimal point: 0.001 has only
1 significant figure.
(4) Trailing zeros that are also to the right of a decimal point in a number
are significant: 0.0230 has 3 significant figures.
(5) When a number ends in zeros that are not to the right of a decimal
point, the zeros are not necessarily significant: 190 may be 2 or 3 significant
figures, 50600 may be 3, 4, or 5 significant figures.
The potential ambiguity in the last rule can be avoided by the use of standard
exponential, or scientific, notation. For example, depending on whether
the number of significant figures is 3, 4, or 5, we would write 50600 calories
as:
5.06 104 (3 significant figures)
5.060 104 (4 significant figures), or
5.0600 104 (5 significant figures).
What is an exact number? Some numbers are exact because they are
known with complete certainty. Most exact numbers are integers: exactly
12 inches are in a foot, there might be exactly 23 students in a class. Exact
numbers are often found as conversion factors or as counts of objects. Exact
numbers can be considered to have an infinite number of significant figures.
Thus, the number of apparent significant figures in any exact number can be
ignored as a limiting factor in determining the number of significant figures
in the result of a calculation.

6. Rules for mathematical operations


In carrying out calculations, the general rule is that the accuracy of a
calculated result is limited by the least accurate measurement involved in
the calculation. In addition and subtraction, the result is rounded off so
that it has the same number of digits as the measurement having the fewest
decimal places (counting from left to right). For example,
100 (assume 3 significant figures) +23.643 (5 significant figures) = 123.643,
which should be rounded to 124 (3 significant figures). However, that it is
possible two numbers have no common digits (significant figures in the same
digit column).
In multiplication and division, the result should be rounded off so as to have
the same number of significant figures as in the component with the least
number of significant figures. For example,
3.0 (2 significant figures ) 12.60 (4 significant figures) = 37.8000
which should be rounded to 38 (2 significant figures).
Let X = f (x1 , x2 , . . . , xn ) be the function having n variables. To determine

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

the error X in X due to the errors x1 , x2 , . . . , xn , respectively.


X + X = f (x1 + x1 , x2 + x2 , . . . , xn + xn ).
Error in addition of numbers. Let X = f (x1 + x2 + + xn ).
Therefore
X + X = (x1 + x1 ) + (x2 + x2 ) + + (xn + xn )
= (x1 + x2 + + xn ) + (x1 + x2 + + xn )
Therefore,
X = x1 + x2 + + xn ;
this is an absolute error.
Dividing by X we get,
X
x1 x2
xn

+
+ ...
X
X
X
X
which is a relative error. Now,






X x1 x2
xn






X X + X + ... X
which is a maximum relative error. Therefore it shows that when the given
numbers are added then the magnitude of absolute error in the result is the
sum of the magnitudes of the absolute errors in that numbers.
Error in subtraction of numbers. As in the case of addition, we can
obtain the maximum absolute errors for subtraction of numbers
|X| |x1 | + |x2 |.
Also





X x1 x2




X X + X
which is a maximum relative error in subtraction of numbers.
Error in product of numbers. Let X = x1 x2 . . . xn then using the general
formula for error
X
X
X
+ x2
+ + xn
.
X = x1
x1
x2
xn
We have
X
x1 X
x2 X
xn X
=
+
+ +
.
X
X x1
X x2
X xn
Now
1 X
x 2 x 3 . . . xn
1
=
=
X x1
x1 x2 x3 . . . xn
x1
1 X
x 1 x 3 . . . xn
1
=
=
X x2
x1 x2 x3 . . . xn
x2

1 X
x1 x2 . . . xn1
1
=
=
.
X xn
x1 x2 x3 . . . xn
xn

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

Therefore
xn
X
x1 x2
+
+ +
.
=
X
x1
x2
xn
Therefore maximum Relative and Absolute errors are given by






X x1 x2
xn
=
+
+ + +

Er =
xn .
X x1 x2




X
X

(x1 x2 . . . xn ).


Ea =
X=
X
X
Error in division of numbers. Let X =
X = x1

x1
x2

then

X
X
+ x2
.
x1
x2

We have
X
x1 X
x2 X
x1 x2
=
+
=

.
X
X x1
X x2
x1
x2
Therefore relative error




X x1 x2



,


Er =

+
X x1 x2
and absolute error


X
X.

Ea =
X
Example 2. Add the following floating-point numbers 0.4546e3 and 0.5433e7.
Sol. This problem contains unequal exponent. To add these floating-point
numbers, take operands with the largest exponent as,
0.5433e7 + 0.0000e7 = 0.5433e7.
(Because 0.4546e3 changes in the same operand as 0.0000e7).
Example 3. Add the following floating-point numbers 0.6434e3 and 0.4845e3.
Sol. This problem has an equal exponent but on adding we get 1.1279e3,
that is, mantissa has 5 digits and is greater than 1, thats why it is shifted
right one place. Hence we get the resultant value 0.1127e4.
Example 4. Subtract the following floating-point numbers:
1. 0.5424e 99 from 0.5452e 99
2. 0.3862e 7 from 0.9682e 7
Sol. On subtracting we get 0.0028e 99. Again this is a floating-point
number but not in the normalized form. To convert it in normalized form,
shift the mantissa to the left by 1. Therefore we get 0.028e 100. This
condition is called an underflow condition.
Similarly, after subtraction we get 0.5820e 7.

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

Example 5. Multiply the following floating point numbers:


1. 0.1111e74 and 0.2000e80
2. 0.1234e 49 and 0.1111e 54
Sol. 1. On multiplying 0.1111e74 0.2000e80 we have 0.2222e153. This
Shows overflow condition of normalized floating-point numbers.
2. Similarly second multiplication gives 0.1370e 104, which shows the underflow condition of floating-point number.
7. Loss of significance, stability and conditioning
Roundoff errors are inevitable and difficult to control. Other types of
errors which occur in computation may be under our control. The subject of
numerical analysis is largely preoccupied with understanding and controlling
errors of various kinds. Here we examine some of them.
7.1. Loss of significance. One of the most common error-producing calculations involves the cancellation of significant digits due to the subtractions
nearly equal numbers (or the addition of one very large number and one
very small number). The phenomenon can be illustrated with the following
example.
Example 6. If x = 0.3721478693 and y = 0.3720230572. What is the relative error in the computation of x y using five decimal digits of accuracy?
Sol. We can compute with ten decimal digits of accuracy and can take it as
exact.
x y = 0.0001248121.
Both x and y will be rounded to five digits before subtraction. Thus
f l(x) = 0.37215
f l(y) = 0.37202.
f l(x) f l(y) = 0.13000 103 .
Relative error, therefore is
(x y) (f l(x) f l(y)
.04% = 4%.
Er =
xy

Example 7. Consider the two equivalent forms f (x) = x( x + 1 x)


x
and g(x) =
. Compare the function evaluation of f (500) and
( x + 1 + x)
g(500) using 6 digits and rounding.
Sol.

f (500) = 0.500000103 ( 501 500) = 0.500000103 (0.223830102 0.223607102 )


= 0.111500 103 .

500

g(500) =
501 + 500

10

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

0.500000 103
0.223830 102 + 0.223607 102
0.500000 103
= 0.111748 102 .
=
0.447437 102
If more digits are used,

f (500) = 500 ( 501 500) = 11.1747583.


=

Hence with same number of digits, g(x) is better.


Example 8. The quadratic formula is used for computing the roots of equation ax2 + bx + c = 0, a 6= 0 and roots are given by

b b2 4ac
x=
.
2a
Consider the equation x2 + 62.10x + 1 = 0 and discuss the numerical results.
Sol. Using quadratic formula and 8-digit rounding arithmetic, we obtain
two roots
x1 = .01610723
x2 = 62.08390.
We use these values as exact values. Now we perform calculations with
4-digit rounding
arithmetic.

We have b2 4ac = 62.102 4.000 = 3856 4.000 = 62.06 and


62.10 + 62.06
f l(x1 ) =
= 0.02000.
2.000
The relative error in computing x1 is
|f l(x1 ) x1 |
| 0.02000 + .01610723|
=
= 0.2417.
|x1 |
| 0.01610723|
In calculating x2 ,
62.10 62.06
= 62.10.
2.000
The relative error in computing x2 is
f l(x2 ) =

|f l(x2 ) x2 |
| 62.10 + 62.08390|
=
= 0.259 103 .
|x2 |
| 62.08390|
2
2
In this
equation since b = 62.10 is much larger than 4ac = 4. Hence b
and b2 4ac become two equal numbers. Calculation of x1 involves the
subtraction of nearly two equal numbers but x2 involves the addition of the
nearly equal numbers which will not cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x1 , we change
the formulation by rationalizing the numerator, that is,
2c

x1 =
.
b + b2 4ac

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

11

Then

2.000
= 2.000/124.2 = 0.01610.
62.10 + 62.06
The relative error in computing x1 is now reduced to 0.62 103 . However,
if rationalize the numerator in x2 to get
2c

.
x2 =
b b2 4ac
The use of this formula results not only involve the subtraction of two nearly
equal numbers but also division by the small number. This would cause
degrade in accuracy.
2.000
f l(x2 ) =
= 2.000/.04000 = 50.00
62.10 62.06
The relative error in x2 becomes 0.19.
f l(x1 ) =

Example 9. How to evaluate y x sin x, when x is small.


Sol. Since x sin x, x is small. This will cause loss of significant figures.
Alternatively, if we use Taylor series for sin x, we obtain
y = x (x

x3 x5 x7
+

+ ...)
3!
5!
7!

x3
x5
x7

+
...
6
6 20 6 20 42


x3
x2
x2
x2
1 (1 (1 )(...)) .
=
6
20
42
72
=

7.2. Numerical stability. Another theme that occurs repeatedly in numerical analysis is the distinction between numerical algorithms are stable
and those that are not. Informally speaking, a numerical process is unstable
if small errors made at one stage of the process are magnified and propagated in subsequent stages and seriously degrade the accuracy of the overall
calculation. Whether a process is stable or unstable should be decided on
the basis of the relative error.
7.3. Conditioning. The words condition and conditioning are used to indicate how sensitive the solution of a problem may be to small changes in
the input data. A problem is ill-conditioned if small changes in the data
can produce large changes in the results. For a certain types of problems,
a condition number can be defined. If that number is large, it indicates an
ill-conditioned problem. In contrast, if the number is modest, the problem
is recognized as a well-conditioned problem.
The condition number can be calculated in the following manner:
=

relative change in output


relative change in input

12

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

(x )
| f (x)f
|
f (x)

| xx
x |

0
xf (x)
,


f (x)

For example, if f (x) =


as

10
,
1x2

then the condition number can be calculated


0
2
xf (x)
= 2x .

=
f (x) |1 x2 |
Condition number can be quite large for |x| 1. Therefore, the function is
ill-conditioned.
Remarks
(1) Accuracy tells us the closeness of computed solution to true solution
of problem. Accuracy depends on conditioning of problem as well as
stability of algorithm.
(2) Stability alone does not guarantee accurate results. Applying stable
algorithm to well-conditioned problem yields accurate solution.
(3) Inaccuracy can result from applying stable algorithm to ill-conditioned
problem or unstable algorithm to well-conditioned problem.
Exercises
(1) Assume 3-digit mantissa with rounding
(a) Evaluate y = x3 3x2 + 4x + 0.21 for x = 2.73.
(b) Evaluate y = [(x 3)x + 4]x + 0.21 for x = 2.73.
Compare and discuss the errors obtained in part (a) and (b).
1
1
.
(2) Given f (x) =
x
x+1
Assume 3 decimal mantissa with rounding
(a) Evaluate f (1000) directly. (b) Evaluate f (1000) as accurate as
possible using an alternative approach.
(c) Find the relative error of f (1000) in part (a) and (b).
(3) Associativity not necessarily hold for floating point addition (or multiplication).
Let a = 0.8567 100 , b = 0.1325 104 , c = 0.1325 104 , then
a + (b + c) = 0.8567 100 , and (a + b) + c) = 0.1000 101 .
The two answers are NOT the same! Show the calculations.
(4) Find the smaller root of the equation
x2 400x + 1 = 0
using four digits rounding arithmetic.
(5) Discuss the condition number of the polynomial function f (x) =
2x2 + x 1.

NUMERICAL ALGORITHMS AND ROUNDOFF ERRORS

13

Bibliography
[Atkinson]
[Conte]

K. Atkinson and W. Han, Elementary Numerical Analysis,


John Willey and Sons, Third edition, 2004.
Samuel D. Conte and Carle de Boor, Elementary Numerical
Analysis: An Algorithmic Approach, Third edition, McGrawHill, New York, 1980.

You might also like