You are on page 1of 32

CONTENTS

ABSTRACT LIST OF SYMBOLS AND ACRONYMS VHDL INTRODUCTION TO FLOATING POINT NUMBER INTRODUCTION HISTORY RANGE OF FPN FPN PRECISION IEEE 754 FPN STANDARD FPN REPRESENTATION COMPUTER REPRESENTATION IEEE FPN REPRESENTATION ATTRIBUTES & ROUNDING FPN ARITHMETIC FPN REPRESENTATION FORMAT PARAMETERS FOR THE IEEE 754 FLOATING-POINT STANDARD FPN MULTIPLICATION DENORMALS FPN MULTIPLICATION ALGORITHM HARDWARE OF FLOATING POINT MULTIPLIER UNSIGNED MULTIPLIER ADDITION PROCESS NORMALIZER UNDERFLOW/OVERFLOW DETECTION MULTIPLICATION FLOWCHART

STRUCTURE OF MULTIPLICATION FLOATING POINT MULTIPLIER ARCHITECTURE PROPOSED CIFM ARCHITECTURE Real-life application Optimization criteria Application APPENDIX CONCLUSION REFERENCES

FLOATING POINT MULTIPLICATION USING VHDL


A report submitted in partial fulfillment of the requirements for the Degree of Bachelor of Technology in Electronincs and Communication Engineering

Under the Guidance of


Manas Ranjan Tripathy

Department of Electronics and Communication Engineering


INSTITUTE OF TECHNICAL EDUCATION & RESEARCH, BHUBANESWAR
(SIKSHA O ANUSANDHAN UNIVERSITY, ODISHA)

2012

Submitted by: Bibhu bhusHan panda (0911016214) Sadbhab patra (0911016231) Chandrakanta parida (1021016041) Sweta chandan(0911016244)

INSTITUTE OF TECHNICAL EDUCATION AND RESEARCH

CERTIFICATE

This is to certify that the project titled, FLOATING POINT MULTIPLICATION USING
VHDL is the a bonafide work of group C4, in partial fulfillment for the award of

Degree Bachelor of Technology in Electronics and communication Engineering, conducted under my supervision.

Project guide:

Mr. Manas Ranjan Tripathy (Lecturer) Department of Electronics and communication Engineering ITER, BHUBANESWAR

DECLARATION

We certify that a. The work contained in this report is original and has been done by us under the guidance of our supervisor. b. The work has not been submitted to any other Institute for any degree or diploma. c. We have followed the guidelines provided by the Institute in preparing the report. d. We have conformed to the norms and guidelines given in the Ethical Code of Conduct of the Institute. e. Whenever we have used materials (data, theoretical analysis, figures, and text) from other sources, we have given due credit to them by citing them in the text of the report and giving their details in the reference. Further, we have taken permission from the copyright owners of the sources, whenever necessary.

BIBHU BHUSHAN PANDA (0911016214)

SADBHAB PATRA (0911016231)

CHANDRAKANTA PARIDA (1021016041)

SWETA CHANDAN(0911016244)

ACKNOWLEDGMENT

We would like to thank Mr. Manas Ranjan Tripathy for providing us this opportunity to present the project on

FLOATING

POINT

MULTIPLICATION USING VHDL


We would like to thank Prof. Mr.Bibhu Prasad Mohanty(HOD), Prof Dr Niva Das (Associate Dean) & Mr.Manas Ranjan Tripathy for his constant support and guidance. We would also extend my gratitude to the faculty and staff of Department Electronics and communication engineering, for their valuable insights which made this project a success. Lastly, we would thank one and all, who helped building this project and guided us in all aspects to its success.

BIBHU BHUSHAN PANDA (0911016214) SADBHAB PATRA (0911016231) CHANDRAKANTA PARIDA (1021016041) SWETA CHANDAN(0911016244)

ABSTRACT
Shrinking feature sizes gives more headroom for designers to extend the functionality of microprocessor. As processor support for decimal floating-point arithmetic emerges, it is important to investigate efficient algorithms and hardware designs for common decimal floatingpoint arithmetic algorithms. This paper presents designs for a decimal floating-point adder . floating-point arithmetic is usually sufficient for scientific and statistics applications. However, it is not sufficient for many commercial applications and database systems, in which operations often need to mirror manual calculations. Therefore, these applications often use software to perform decimal floating-point arithmetic operations . This standard provides a method for computation with floating-point numbers that will yield the same result whether the processing is done in hardware, software, or a combination of the two. The results of the computation will be identical, independent of implementation, given the same input data. Errors, and error conditions, in the mathematical processing will be reported in a consistent manner regardless of implementation.

Keywords:- exponent,normalized value,subnormal,subnormal numbers.

LIST OF SYMBOLS

Serial No. 1 2 3 X M E

Symbols

Meaning Real Number Significant Exponent

LIST OF ACRONYMS

Serial No. 1 2 3

Acronyms OFL UFL NAN

Meaning Overflow level Underflow Level Not a Number

Chapter 1

1. VHDL
The VHSIC (very high speed integrated circuits) Hardware Description Language (VHDL) was first proposed in 1981. The development of VHDL was originated by IBM, Texas Instruments, and Inter-metrics in 1983. The result, contributed by many participating EDA (Electronics Design Automation) groups, was adopted as the IEEE 1076 standard in December 1987. VHDL is intended to provide a tool that can be used by the digital systems community to distribute their designs in a standard format. Using VHDL, they are able to talk to each other about their complex digital circuits in a common language without difficulties of revealing technical details. As a standard description of digital systems, VHDL is used as input and output to various simulation, synthesis, and layout tools. The language provides the ability to describe systems, networks, and components at a very high behavioral level as well as very low gate level. It also represents a top-down methodology and environment. Simulations can be carried out at any level from a generally functional analysis to a very detailed gate-level wave form analysis.

1.1 INTRODUCTION TO FLOATING POINT NUMBERS


1. INTRODUCTION
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16. The typical number that can be represented exactly is of the form: Significant digits baseexponent

The term floating point refers to the fact that the radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. Over the years, a variety of floating-point representations have been used in computers. However, since the 1990s, the most commonly encountered representation is that defined by the IEEE 754 Standard. The advantage of floating-point representation over fixed-point and integer representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits with two decimal places can represent the numbers 12345.67, 123.45, 1.23 and so on, whereas a floating-point representation (such as the IEEE 754 decimal32 format) with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.

i.History
Leonardo Torres y Quevedo in 1914 designed an electro-mechanical version of the Analytical Engine of Charles Babbage which included floating-point arithmetic. In 1938, Konrad Zuse of Berlin completed the Z1, the first mechanical binary programmable computer, this was however unreliable in operation. It worked with 22-bit binary floating-point numbers having a 7-bit signed exponent, a 15-bit significand (including one implicit bit), and a sign bit. The memory used sliding metal parts to store 64 words of such numbers. The relay-based Z3, completed in 1941 had representations for plus and minus infinity. It implemented defined operations with infinity such as 1/ = 0 and stopped on undefined operations like 0. It also implemented the square root operation in hardware. Konrad Zuse, architect of the first programmable computer, which used 22-bit binary floating point. Zuse also proposed, but did not complete, carefully rounded floatingpoint arithmetic that would have included and NaNs, anticipating features of IEEE Standard floatingpoint by four

decades. By contrast, von Neumann recommended against floating point for the 1951 IAS machine, arguing that fixed point arithmetic was preferable. The first commercial computer with floating point hardware was Zuse's Z4 computer designed in 19421945. The Bell Laboratories Mark V computer implemented decimal floating point in 1946. Prior to the IEEE-754 standard, computers used many different forms of floating-point. These differed in the word sizes, the format of the representations, and the rounding behavior of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy. The IEEE-754 standard was created in the early 1980s after word sizes of 32 bits (or 16 or 64) had been generally settled upon. This was based on a proposal from Intel who were designing the i8087 numerical coprocessor. Prof. W. Kahan was the primary architect behind this proposal, along with his student Jerome Coonen at U.C. Berkeley and visiting Prof. Harold Stone, for which he was awarding the 1989 Turing award. Among the innovations are these:

A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.

A precisely specified behavior of the arithmetic operations: arithmetic operations were required to be correctly rounded, i.e. to give the same result as if infinitely precise arithmetic was used and then rounded. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.

The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.

ii.Range of floating-point numbers


By allowing the radix point to be adjustable, floating-point notation allows calculations over a wide range of magnitudes, using a fixed number of digits, while maintaining good precision. For example, in a decimal floating-point system with three digits, the multiplication that humans would write as 0.12 0.12 = 0.0144 would be expressed as (1.20101) (1.20101) = (1.44102). In a fixed-point system with the decimal point at the left, it would be 0.120 0.120 = 0.014. A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string. The range of floating-point numbers depends on the number of bits or digits used for representation of the significand (the significant digits of the number) and for the exponent. On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10308 to 10308, because the range of the exponent is [1022,1023] and 308 is approximately log10(21023). The complete range of the format is from about 10308 through +10308 . The number of normalized floating point numbers in a system F (B, P, L, U) (where B is the base of the system, P is the precision of the system to P numbers, L is the smallest exponent representable in the system, and U is the largest exponent used in the system) is: .

There is a smallest positive normalized floating-point number, Underflow level = UFL = which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.

There is a largest floating point number, Overflow level = OFL =

which has B 1 as the

value for each digit of the significand and the largest possible value for the exponent. In addition there are representable values strictly between UFL and UFL. Namely, zero and negative zero, as well as subnormal numbers.

iii.Floating-point precisions
IEEE 16-bit: 32-bit: 64-bit: Single Double Half (binary32), (binary64), 754: (binary16) decimal32 decimal64

128-bit: Quadruple (binary128), decimal128 Extended precision formats

Other: Minifloat Arbitrary precision The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754 (aka. IEC 60559). This standard is followed by almost all modern machines. Notable exceptions include IBM mainframes, which support IBM's own format (in addition to the IEEE 754 binary and decimal formats), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format. The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats and others are termed extended formats, and three of these are especially widely used in computer hardware and languages:

Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).

Double precision, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).

Double extended format, 80-bit floating point value. This is implemented on most personal computers but not on other devices. Sometimes "long double" is used for this in the C language family (the C99 and C11 standards "IEC 60559 floating-point arithmetic extension- Annex F" recommend the 80-bit extended format to be provided as "long double" when available), though "long double" may be a synonym for "double" or may stand for quadruple precision. Extended precision can help minimise accumulation of round-off error in intermediate calculations.

Any integer with absolute value less than or equal to 224 can be exactly represented in the single precision format, and any integer with absolute value less than or equal to 253 can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers. To a rough approximation, the bit representation of an IEEE binary floating-point number is proportional to its base 2 logarithm.

Chapter 2

2. IEEE-754 FLOATING-POINT STANDARD


In the early days of digital computers, it was quite common that machines from different vendors have different word lengths and unique floating-point formats. This caused many problems, especially in the porting of programs between different machines (designs). A main objective in developing such a standard, floating-point representation standard is to make numerical programs predictable and completely portable, in the sense of producing identical results when run on different machines.The IEEE-754 floating-point standard, formally named ANSI/IEEE Std 754-1985, introduced in 1985 tried to solve these problems. Our main objective for this standard is that an implementation of a floating-point system confirming to this standard can be realized in software, entirely in hardware, or in any combination of software and hardware. The standard specifies two formats for floating-point numbers, basic (single precision) and extended (double precision), it also specifies the basic operations for both formats which are addition and subtraction of operations. Finally, it describes the different floating-point exceptions and their handling, including non-numbers (NaNs). Table 1: Features of the ANSI/IEEE Standard Floating-Point Representation
Feature Word length, bits Significant bits Significant Range Exponent Bits Exponent Bias Zero ( 0) Denormal Infinity () Not-a-Number (NAN) Minimum Maximum Single 32 23+1(hidden) [1,2-2-23] 8 127 E + bias = 0, f = 0 E + bias = 0, f 0 E + bias = 255, f = 0 E + bias = 255, f 0 2-126 1.2 10 -38 2128 3.4 10 38 Double 64 52+1(hidden) [1,2-2-52] 11 1023 e + bias = 0, f = 0 e + bias = 0, f 0 e + bias = 2047, f = 0 e + bias = 2047, f 0 2-1023 1.2 10 -308 21024 1.8 10 308

PROBLEMS ASSOCIATED WITH FLOATING POINT ADDITION

For the input the exponent of the number may be dissimilar. And dissimilar exponent cant be added directly. So the first problem is equalizing the exponent. To equalize the exponent the smaller number must be increased until it equals to that of the larger number. Then significant are added. Because of fixed size of mantissa and exponent of the floating-point number cause many problems to arise during addition and subtraction. The second problem associated with overflow of mantissa. It can be solved by using the rounding of the result. The third problem is associated with overflow and underflow of the exponent. The former occurs when mantissa overflow and an adjustment in the exponent is attempted the underflow can occur while normalizing a small result. Unlike the case in the fixed-point addition, an overflow in the mantissa is not disabling; simply shifting the mantissa and increasing the exponent can compensate for such an overflow. Another problem is associated with normalization of addition and subtraction. The sum or difference of two significant may be a number, which is not in normalized form. So it should be normalized before returning results.

2.1 Floating Point Representation


i. Computer Representation of Numbers
Computers which work with real arithmetic use a system called floating point. Suppose a real number x has the binary expansion X= m=( To store a number in floating point representation, a computer word is divided into 3 fields, representing the sign, the exponent E, and the significand m respectively. A 32-bit word could be divided into fields as follows: 1 bit for the sign, 8 bits for the exponent and 23 bits for the significand. Since the exponent field is 8 bits, it can be used to represent exponents between -128 and 127. The significand field can store the first 23 bits of the binary representation of m, namely , 1 and

FORMATS:- This defines floating-point formats, which are used to represent a finite subset of real numbers Formats are characterized by their radix, precision, and exponent range, and each format can represent a unique set of floating-point data . All formats can be supported as arithmetic formats; that is, they may be used to represent floating-point operands.Specific fixed-width encodings for binary and decimal formats are defined in this clause for a subset of the formats . These interchange formats are identified by their size and can be used for the exchange of floating-point data between implementations. Five basic formats are defined : Three binary formats, with encodings in lengths of 32, 64, and 128 bits . Two decimal formats, with encodings in lengths of 64 and 128 bits. Additional arithmetic formats are recommended for extending these basic formats . The choice of which of this standards formats to support is language-defined or, if the relevant language standard is silent or defers to the implementation, implementation-defined. The names used for formats in this standard are not necessarily those used in programming environments.

ii. IEEE Floating Point Representation


In the 1960's and 1970's, each computer manufacturer developed its own floating point system, leading to a lot of inconsistency as to how the same program behaved on different machines. For example, although most machines used binary floating point systems, the IBM 360/370 series, which dominated computing during this period, used a hexadecimal base, i.e. numbers were represented as m . Other machines, such as HP calculators,used a decimal floating point

system. Through the efforts of several computer scientists, particularly W.Kahan, a binary floating point standard was developed in the early 1980's and, most importantly, followed very carefully by the principal manufacturers of floating point chips for personal computers,namely Intel and Motorola. This standard has become known as the IEEE floating point standard since it was developed and endorsed by a working committee of the Institute for Electrical and Electronics Engineers. The IEEE standard has three very important requirements: -- _ consistent representation of floating point numbers across all machines adopting the standard

-- correctly rounded arithmetic --consistent and sensible treatment of exceptional situations such as division by zero

We start with the following observation. In the last section, we chose to normalize a nonzero number x so that x = m m=( ..)2 , where 1 m < 2, i.e.

with b0 = 1. In the simple floating point model , we stored the leading nonzero bit b0 in the first position of the field provided for m. Note, however, that since we know this bit has the value one, it is not necessary to store it. Consequently, we can use the 23 bits of the significand field to store b1.b2 b23 instead of b0.b1.. b22, changing the machine precision from = = to

. Since the bitstring stored in the significand field is now actually the fractional part of

the significand, we shall refer henceforth to the field as the fraction field. Given a string of bits in the fraction field, it is necessary to imagine that the symbols 1. appear in front of the string, even though these symbols are not stored. This technique is called hidden bit normalization and was used by Digital for the Vax machine in the late 1970s.

iii.Attributes and rounding Attribute specification :- An attribute is logically associated with a program block
to modify its numerical and exception semantics. A user can specify a constant value for an attribute parameter. Some attributes have the effect of an implicit parameter to most individual operations of this standard;language standards shall specify rounding-direction attributes and should specify alternate exception handling attributes . Other attributes change the mapping of language expressions into operations of this standard; language standards that permit more than one such mapping should provide support for: preferredWidth attributes

value-changing optimization attributes reproducibility attributes For attribute specification, the implementation shall provide language-defined means, such as compiler directives, to specify a constant value for the attribute parameter for all standard operations in a block; the scope of the attribute value is the block with which it is associated. Language standards shall provide for constant specification of the default and each specific value of the attribute. Rounding and Correctly Rounded Arithmetic:-

We use the terminology floating point numbers" to mean all acceptable numbers in a given IEEE floating point arithmetic format. This set consists of 0, subnormal and normalized numbers, and , but not NaN values, and is a finite subset of the reals. We have seen that most real

numbers, such as 1/10 and pi, cannot be represented exactly as floating point numbers. For ease of expression we will say a general real number is normalized if its modulus lies between the smallest and largest positive normalized floating point numbers, with a corresponding use of the word subnormal. In both cases the representations we give for these numbers will parallel the floating point number representations in that b0 = 1 for normalized numbers, and b0 = 0 with E = -126 for subnormal numbers. For any number x which is not a floating point number, there are two obvious choices for the floating point approximation to x: the closest floating point number less than x, and the closest floating point number greater than x. The IEEE standard defines the correctly rounded value of x, which we shall denote round(x), as follows. If x happens to be a floating point number, then round(x) = x. Otherwise, the correctly rounded value depends on which of the following four rounding modes is in effect: Round down round(x) = x_: Round up round(x) = x+: Round towards zero round(x) is either x_ or x+, whichever is between zero and x.

Round to nearest round(x) is either x_ or x+, whichever is nearer to x. In the case of a tie, the one with its least significant bit equal to zero is chosen. If x is positive, then x_ is between zero and x, so round down and round towards zero have the same effect. If x is negative, then x+ is between zero and x, so it is round up and round towards zero which have the same effect. In either case, round towards zero simply requires truncating the binary expansion, i.e. discarding bits. The most useful rounding mode, and the one which is almost always used, is round to nearest, since this produces the floating point number which is closest to x. In the case of toy precision, with x = 1=7, it is clear that round to nearest gives a rounded value of x equal to 1.75. When the word round is used without any qualification, it almost always means round to nearest. In the more familiar decimal context, if we round the number pi= 3.14159 to four decimal digits, we obtain the result 3.142, which is closer to pi than the truncated result 3.141.

iv.Floating Point Arithmetic


Although integers provide an exact representation for numeric values, they suffer from two major drawbacks: --the inability to represent fractional values -- a limited dynamic range. Floating point arithmetic solves these two problems at the expense of accuracy and, on some processors, speed. Most programmers are aware of the speed loss associated with floating point arithmetic; however, they are blithely unware of the problems with accuracy. For many applications, the benefits of floating point outweigh the disadvantages. A big problem with floating point arithmetic is that it does not follow the standard rules of algebra. Nevertheless, many programmers apply normal algebraic rules when using floating point arithmetic. This is a source of bugs in many programs. One of the primary goals of this section is to describe the limitations of floating point arithmetic so it can be properly used. Normal algebraic rules apply only to infinite precision arithmetic.Let us consider the simple statement x:=x+1, x is an integer. On any modern computer this statement follows the

normal rules of algebra as long as overflow does not occur. That is, this statement is valid only for certain values of x (minint <= x < maxint). Most programmers do not have a problem with this because they are well aware of the fact that integers in a program do not follow the standard algebraic rules (e.g., 5/2 is not= 2.5). Integers do not follow the standard rules of algebra because the computer represents them with a finite number of bits. We cannot represent any of the (integer) values above the maximum integer or below the minimum integer. Floating point values suffer from this same problem, only worse. After all, the integers are a subset of the real numbers. Therefore, the floating point values must represent the same infinite set of integers. However, there are an infinite number of values between any two real values, so this problem is infinitely worse. Therefore, as well as having to limit your values between a maximum and minimum range, you cannot represent all the values between those two ranges, either. To represent real numbers, most floating point formats employ scientific notation and use some number of bits to represent a mantissa and a smaller number of bits to represent an exponent. The end result is that floating point numbers can only represent numbers with a specific number of significant digits. This has a big impact on how floating point arithmetic operations. To easily see the impact of limited precision arithmetic, we will adopt a simplified decimal floating point format for our examples. Our floating point format will provide a mantissa with three significant digits and a decimal exponent with two digits. The mantissa and exponents are both signed values . When adding and subtracting two numbers in scientific notation, we must adjust the two values so that their exponents are the same. For example, when adding 1.23e1 and 4.56e0, you must adjust the values so they have the same exponent. One way to do this is to to convert 4.56e0 to 0.456e1 and then add. This produces 1.686e1. Unfortunately, the result does not fit into three significant digits, so we must either round or truncate the result to three significant digits. Rounding generally produces the most accurate result, so lets round the result to obtain 1.69e1. As we can see, the lack of precision (the number of digits or bits we maintain in a computation) affects the accuracy (the correctness of the computation). In the previous example, we were able to round the result because we maintained four significant digits during the calculation. If our floating point calculation is limited to three significant digits during computation, we would have had to truncate the last digit of the

smaller number, obtaining 1.68e1 which is even less correct. Extra digits available during a computation are known as guard digits (or guard bits in the case of a binary format). They greatly enhance accuracy during a long chain of computations. The accuracy loss during a single computation usually isnt enough to worry about unless we are greatly concerned about the accuracy of our computations. However, we compute a value which is the result of a sequence of floating point operations, the error can accumulate and greatly affect the computation itself. For example, suppose we were to add 1.23e3 with 1.00e0. Adjusting the numbers so their exponents are the same before the addition produces 1.23e3 + 0.001e3. The sum of these two values, even after rounding, is 1.23e3. This might seem perfectly reasonable; after all, we can only maintain three significant digits, adding in a small value shouldnt affect the result at all. However, suppose we were to add 1.00e0 1.23e3 ten times. The first time we add 1.00e0 to 1.23e3 we get 1.23e3. Likewise, we get this same result the second, third, fourth. and tenth time we add 1.00e0 to 1.23e3. On the other hand, had we added 1.00e0 to itself ten times, then added the result (1.00e1) to 1.23e3, we would have gotten a different result, 1.24e3. This is the most important thing to know about limited precision arithmetic: The order of evaluation can effect the accuracy of the result. We can get more accurate results if the relative magnitudes (that is, the exponents) are close to one another. Whenever a chain calculation involving addition and subtraction is being

perfomed, it should be attempted to group the values appropriately. Another problem with addition and subtraction is that you can wind up with false precision. Consider the computation 1.23e0 - 1.22 e0. This produces 0.01e0. Although this is mathematically equivalent to 1.00e-2, this latter form suggests that the last two digits are exactly zero. Unfortunately, weve only got a single significant digit at this time. Indeed, some FPUs or floating point software packages might actually insert random digits (or bits) into the least significant positions. This brings up a second important rule concerning limited precision arithmetic: Whenever subtracting two numbers with the same signs or adding two numbers with different signs, the accuracy of the result may be less than the precision available in the floating point format. Multiplication and division do not suffer from the same problems as addition and subtraction since we do not have to adjust the exponents before the operation; all we need to do is add the exponents and multiply the mantissas (or subtract the exponents and divide the

mantissas). By themselves, multiplication and division do not produce particularly poor results. However, they tend to multiply any error which already exists in a value. For example, if we multiply 1.23e0 by two, when we should be multiplying 1.24e0 by two, the result is even less accurate. This brings up a third important rule when working with limited precision arithmetic, When performing a chain of calculations involving addition, subtraction, multiplication, and division, try to perform the multiplication and division operations first. Often, by applying normal algebraic transformations, we can arrange a calculation so the multiply and divide operations occur first. For example, suppose we want to compute x*(y+z). Normally we would add y and z together and multiply their sum by x. However, we can get a little more accuracy if we transform x*(y+z) to get x*y+x*z and compute the result by performing the multiplications first. Multiplication and division are not without their own problems. When multiplying two very large or very small numbers, it is quite possible for overflow or underflow to occur. The same situation occurs when dividing a small number by a large number or dividing a large number by a small number. This brings up a fourth rule we should attempt to follow when multiplying or dividing values: When multiplying and dividing sets of numbers, try to arrange the multiplications so that they multiply large and small numbers together; likewise, try to divide numbers that have the same relative magnitudes. Comparing floating pointer numbers is very dangerous. Given the inaccuracies present in any computation (including converting an input string to a floating point value), two floating point values should never be compared to see if they are equal. In a binary floating point format, different computations which produce the same (mathematical) result may differ in their least significant bits. For example, adding 1.31e0+1.69e0 should produce 3.00e0. Likewise, adding 2.50e0+1.50e0 should produce 3.00e0. However, were you to compare (1.31e0+1.69e0) agains (2.50e0+1.50e0) we might find out that these sums are not equal to one another. The test for equality succeeds if and only if all bits (or digits) in the two operands are exactly the same. Since this is not necessarily true after two different floating point computations which should produce the same result, a straight test for equality may not work. The standard way to test for equality between floating point numbers is to determine how much error (or tolerance) you will allow in a comparison and check to see if one value is within this error range of the other. The straight-forward way to do this is to use a

test like the following: if Value1 >= (Value2-error) and Value1 <= (Value2+error) then Another common way to handle this same comparison is to use a statement of the form: if abs(Value1-Value2) <= error then When discussing floating point comparisons, we shouldnt stop immediately after discussing the problem with floating point equality, assuming that other forms of comparison are perfectly okay with floating point numbers. This isnt true. If we are assuming that x=y if x is within y error, then a simple bitwise comparison of x and y will claim that x<y if y is greater than x but less than y+error. However, in such a case x should really be treated as equal to y, not less than y. Therefore, we must always compare two floating point numbers using ranges, regardless of the actual comparison we want to perform. Trying to compare two floating point numbers directly can lead to an error. To compare two floating point numbers, x and y, against one another, one of the following forms should be used : = if abs(x-y) <= error then not= if abs(x-y) > error then < if (x-y) < error then if (x-y) <= error then > if (x-y) > error then if (x-y) >= error then Great care should be taken while choosing the value of error. This should be a value slightly greater than the largest amount of error which will creep into your computations. The exact value depends upon the particular floating point format used. The final rule we will state in this section is when comparing two floating point numbers, always compare one value to see if it is in the range given by the second value plus or minus some small error value.

Floating point multiplication


The simplest floating-point operation is multiplication, so we discuss it first. A

binary floating-point number x is represented as a significand and an exponent, x = s* 2e. The formula (s1 *2e1) (s2 *2e2) = (s1 s2) *2e1+e2 Shows that a floating-point multiply algorithm has several parts. The first part multiplies the significands using ordinary integer multiplication. Because floating point numbers are stored in sign magnitude form, the multiplier need only deal with unsigned numbers (although we have seen that Booth recoding handles signed twos complement numbers painlessly). The second part rounds the result. If the significands are unsigned p-bit numbers (e.g., p = 24 for single precision), then the product can have as many as 2p bits and must be rounded to a p-bit number. The third part computes the new exponent. Because exponents are stored with a bias, this involves subtracting the bias from the sum of the biased exponents. Example How does the multiplication of the single-precision numbers 1 10000010 000. . . = 1* 23 0 10000011 000. . . = 1* 24 Proceed in binary? Answer When unpacked, the significands are both 1.0, their product is 1.0, and so the result is of the form 1 ???????? 000. . . To compute the exponent, use the formula Biased exp (e1 + e2) = biased exp(e1) + biased exp(e2) bias The bias is 127 = 011111112, so in twos complement 127 is 100000012. Thus, the biased exponent of the product is 10000010 10000011 + 10000001 10000110

Since this is 134 decimal, it represents an exponent of 134 bias = 134 127 = 7, as expected. The interesting part of floating-point multiplication is rounding. Since the cases are similar in all bases, the figure uses human-friendly base 10, rather than base 2. For floating point number multiplication its necessary to know about floating point number addition. As while performing floating point multiplication we have to perform addition anyhow to get the final result. So in performing addition there may be some carry generated, for which we have to renormalize it in which it may lose its precision bits. For that we have to take three extra bits guard, round and sticky. Hence, its much important to know about how addition occurs and then multiplication. The next page contains, how addition is done and what all procedures are obtained in order to get the final result.

Chapter 3 3.ADDITION ALGORITHM

Let a1 and a2 be the two numbers to be added. The notations ei and si are used for the exponent and significant of the addends ai. This means that the floating-point inputs have been unpacked and that si has an explicit leading bit. To add a1 and a2, perform these eight steps: 1. If e1 < e2, swap the operands. This ensures that the difference of the exponents satisfies d = e1e2 0. Tentatively set the exponent of the result to e1.

2. If the sign of a1 and a2 differ, replace s2 by its twos complement.

3. Place s2 in a p-bit register and shift it d = e1-e2 places to the right (shifting in 1s if the s2 was complemented in previous step). From the bits shifted out, set g to the mostsignificant bit, r to the next most-significant bit, and set sticky bit s to the OR of the rest. 4. Compute a preliminary significant S = s1+s2 by adding s1 to the p-bit register containing s2. If the signs of a1 and a2 are different, the most-significant bit of S is 1, and there was no carry out then S is negative. Replace S with its twos complement. This can only happen when d = 0. 5. Shift S as follows. If the signs of a1 and a2 are same and there was a carry out in step 4, shift S right by one, filling the high order position with one (the carry out). Otherwise shift it left until it is normalized. When left shifting, on the first shift fill in the low order position with the g bit. After that, shift in zeros. Adjust the exponent of the result accordingly. 6. Adjust r and s. If S was shifted right in step 5, set r: = low order bit of S before shifting and s: = g or r or s. If there was no shift, set r: = g, s: = r. If there was a single left shift, dont change r and s. If there were two or more left shifts, set r: = 0, s: = 0. (In the last case, two or more shifts can only happen when a1 and a2 have opposite signs and the same exponent, in which case the computation s1 + s2 in step 4 will be exact.)

7. Compute the sign of the result. If a1 and a2 have the same sign, this is the sign of the result.

If a1 and a2 have different signs, then the sign of the result depends on which of a1, a2 is negative, whether there was a swap in the step 1 and whether S was replaced by its twos complement in step 4.

3.1 ABOUT FLOATING POINT ARITHMETIC


Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication and division the operations are done with algorithms similar to those used on sign magnitude

integers (because of the similarity of representation) -- example, only add numbers of the same sign. If the numbers are of opposite sign, must do subtraction.

ADDITION

Example on decimal value given in scientific notation:

3.25 x 10 ** 3 + 2.63 x 10 ** -1 ----------------first step: align decimal points second step: add 3.25 x 10 ** 3

+ 0.000263 x 10 ** 3 -------------------3.250263 x 10 ** 3 (presumes use of infinite precision, without regard for accuracy) third step: normalize the result (already normalized!) example on fl pt. value given in binary: .25 = 0 01111101 00000000000000000000000 100 = 0 10000101 10010000000000000000000 to add these fl. pt. representations, step 1: align radix points shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1 shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1 we want to shift the mantissa right, because the bits that fall off the end should come from the least significant end of the mantissa > we choose to shift the .25, since we want to increase it's exponent. -> shift by 10000101 -01111101 ---------

00001000 (8) places. 0 01111101 00000000000000000000000 (original value) 0 01111110 10000000000000000000000 (shifted 1 place (note that hidden bit is shifted into msb of mantissa) 0 01111111 01000000000000000000000 (shifted 2 places) 0 10000000 00100000000000000000000 (shifted 3 places) 0 10000001 00010000000000000000000 (shifted 4 places) 0 10000010 00001000000000000000000 (shifted 5 places) 0 10000011 00000100000000000000000 (shifted 6 places) 0 10000100 00000010000000000000000 (shifted 7 places) 0 10000101 00000001000000000000000 (shifted 8 places) step 2: add ( hidden bit for the 100 shouldnt be forgotten) 0 10000101 1.10010000000000000000000 (100) + 0 10000101 0.00000001000000000000000 (.25) --------------------------------------0 10000101 1.10010001000000000000000 step 3: normalize the result (get the "hidden bit" to be a 1) it already is for this example. result is: 0 10000101 10010001000000000000000

conclusion

For floating point number multiplication its necessary to know about floating point number addition. As while performing floating point multiplication we have to perform addition anyhow to get the final result. So in performing addition there may be some carry generated, for which we have to renormalize it in which it may lose its precision bits. For that we have to take three extra bits guard, round and sticky. Its much important to know about how addition occurs and then multiplication. This presentation contained how addition is done and what all procedures are obtained in order to get the final result. Now we have studied and gathered idea about floating point addition which will be helpful for us while doing the multiplication part in our next semester as our major project.

references

Liang-Kai Wang and Michael J. Schulte Decimal Floating-Point Adder and Multifunction Unit with Injection-Based Rounding Department of Electrical and Computer Engineering18th IEEE Symposium on Computer Arithmetic(ARITH'07)

G. Even and P. M. Seidel. A comparison of three rounding algorithms for IEEE floating-point multiplication. IEEE Transactions on Computers, 49(7), July 2000 . N. Burgess. Renormalizations rounding in IEEE floating point Operations using a flagged prefix adder. IEEE Transactions on VLSI System, 13(2):266277, Feb 2005 .

IEEE Standard for Floating-Point Arithmetic IEEE 3 Park Avenue New York, NY 100165997, USA 29 August 2008 IEEE Computer Society Sponsored by the Microprocessor Standards Committee.

M. S. Schmookler and A. W. Weinberger. High speed decimal addition

IEEE Transactions on Computers, C-20:862867, Aug 1971.

You might also like