You are on page 1of 8

REDUCING LOOKUP TABLE SIZE USED FOR BITCOUNTING ALGORITHM

Eyas El-Qawasmeh and Wafa'a Al-Qarqaz


Computer Science Dept. Jordan University of Science and Technology P.O. Box 3030, Irbid 22110 Jordan eyas@just.edu.jo,walqarqaz@yahoo.com

ABSTRACT Bit-counting is the operation of counting the number of ones in a given computer word or binary vector. Nowadays there are several solutions for this problem. Among these solutions is the usage of lookup table. However, the lookup table can not be used for large sizes of binary vectors or computer words. This paper presents a new implementation of bit-counting problem based on lookup table. The advantage of the proposed algorithm is that it avoids the limitation of the table lookup. This is achieved by taking advantage of the regular behavior of the number of set bits in all possible values for a computer word. The regular pattern enables us to reduce the size of the lookup table. Performance results showed that the suggested techniques outperform other existing methods.
Keywords: BitCounting, Lookup Table, RemainingBits, NumberOfOnes, Comrade Group.

1. Introduction
Bit-counting, which is also called popcount, refers to the operation of counting the number of ones in a given computer word or binary vector. The bitcounting has been used in many applications including information retrieval systems, file processing systems, coding theory [Berkovich, et al., 2000], genetic algorithms, and Game theory. For example, information retrieval systems may represent search results in the form of a single bit attribute matrix representing tensor hundreds-of-thousands of documents indicating whether each document satisfies one or more search criteria. In this case, bit-counting operation is used to determine the number of documents satisfying the search criteria. Likewise, file comparison routines may be used to compare files having large numbers of elements. In this case, a count of the number of matching elements may be required to find an overall match metric. Genetic algorithms use the bit-counting

operation in many of its procedures [Goldberg, et al., 1992]. Currently, there are several software implementations for the bit counting operation. These implementations are: a) Sequential shifting, b) Arithmetic logic counting (AL), c) Emulated popcount, d) Lookup table e) Hamming distance bit vertical Counter (HC), and f) Frequency division [Berkovich et al., 2000] [Berkovich et al., 1998] [El-Qawasmeh, Hemidi, 2000] [Gutman, 2000] [Reingold et al., 1977]. These schemes are competing in the level of efficiency, both time and space efficiency. For example the lookup table algorithm beats the sequential shifting, arithmetic and logic, and parallel methods in favor of time efficiency, on the other hand its space requirements is a critical point when the binary vector (or computer word) size becomes large since it occupies a table of size 2vector size. This paper is interested in the lookup table algorithm. This table lookup has a sever

problem presented by the limitation of its size. For example, we can not use a lookup table of size 2 32 entries in general. The main objective of this paper is to introduce a new enhancement to the lookup table algorithm by reducing the size of the lookup table. The reduction takes advantage from the regular behavior of the number of ones in any four consecutive binary values the smallest one is a multiple of four value- to make more control over the table size. In addition, the performance analysis of the suggested technique will be investigated and compared with other implementations of popcount. Experimental results showed that the proposed scheme is the best among all known. The organization of this paper is as follows; section 2 describes some of the existing software methods, section 3 describes the new suggested method. Section 4 presents a performance analysis to the suggested algorithm. The conclusions are introduced in section 5.

Shift the Number to the right one bit. End while Figure (1): Sequential shifting method

2.2 Arithmetic Logic Counting (AL) This method depends on doing the mask operation (AND) for a number with itself after subtracting one from it. The same logic operation is repeated as long as the number does not equal to zero. The algorithm for the arithmetic logic counting is shown in figure (2). The operational time of this algorithm is proportional to the number of 1s rather than to the length of the computer word. In other words, it is O(ones(w)) where w is a computer word. This is due to the While loop that will continue to be executed for a number of times equal to the number of 1s in a given computer word. This means that its performance is better in the case of sparse binary vectors of 1s rather than in dense binary vectors of 1s [ElQawasmeh, 2001].
Counter = 0; While the Number 0 do Number = Number AND (Number 1) // AND is the bitwise logical operation Increase Counter by one. End while Figure (2): Arithmetic Logic Counting (AL) method

2. Current Related Work


Currently, there are many different software implementations for bit-counting that vary in the level of efficiency. Following is a listing of the most common methods: 2.1 Sequential Shifting: It loops and checks each bit alone until the number becomes zero. The running time for the sequential shifting is O(n) where n is the number of bits in the computer word. The algorithm is shown in figure (1) below:
Counter = 0; While the Number 0 do If the lowest bit of Number is 1 then increase Counter by one.

2.3 Parallel Count The parallel counting successively groups the bits into sub-groups of 2, 4, 8, 16, and

32, while maintaining a count of 1s in each group. The algorithm can be summarized as follows: 1- Partition the register into groups of 2 bits. Compute the population count for each 2-bit group and store the result in the 2-bit group. In order to handle all 2-bit groups simultaneously, we have to mask appropriately to prevent spilling from one group to the next lower group. 2- Add the population count of adjacent 2-bit groups and store the sum to the 4-bit group resulting from merging these adjacent 2-bit groups. To do this simultaneously to all groups, we mask out the odd numbered groups, mask out the even numbered groups and then add the odd numbered groups to the even numbered groups. 3- Now, and for the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. The result will be four 8-bit fields whose lower half has the correct sum and whose upper half contains an incorrect value that has to be masked out. 4- Add the adjacent 8-bit fields together and store the sum into 16bit fields created by merging the adjacent 8-bit fields. 5- Add the values of two 16-bit groups to produce the final result that will be stored in the least six significant bits. The above steps are applicable for 32-bit machines while for 64-bit machines an extra step similar to step 5 is required in order to add values of two adjacent 32-bit groups.

The lookup table algorithm is based upon storing the number of ones for each possible word value in a lookup table. To get the popcount value for a given number, a single access to the lookup table indexed by the number itself will directly returns the result. The lookup technique is fast, it runs in a constant time since there are no mathematical calculations or logical operations. Lookup table technique runs efficient only for small computer words (8 or 16 bits). However, if the table size is large, then the lookup table technique is not appropriate. For example, if the computer word is 32-bits, then we need to store in the RAM a table of size 232 which is not applicable these days. Therefore, an improvement to this critical point is needed. Working on the badly-effective increasing in the lookup table size, there was a suggested enhancement technique which depended on splitting 32 bits or 64 bits computer word into groups, then getting a value from the table of size 2group size per group apart, instead of single access to a table of size 2 word size. Figure (3) shows the algorithm of this enhancement.

Create and fill the lookup table of size [2 group size] NumberOfOnes = Table [No. AND 0xff] + Table [(No. shifted right 8 bits) AND 0xff] + Table [(No. shifted right 16 bits) AND 0xff] + Table [No. shifted right 24 bits] Figure (3): Enhanced lookup table method

Note that in figure (3), the AND represents the logical bitwise AND operation. The value 0xff represents a hexadecimal representation the binary value with ones in its least significant 8 bits (each f corresponds to 4 ones).

2.4 Lookup Table

3. The suggested Algorithm


In the proposed algorithm, we uses a lookup table of a size less than 2 word size and still acts with a single table access. In addition, we make recursive calls to the function RemaningBits shown in figure (5b). The RemaningBits function is constructed from simple logical operations to determine the number of ones for sub parts of the problem. The basis for the given improvement came from the regular behavior of the number of ones in any four consecutive binary values within the same comrade group. A comrade group is a four consecutive binary values starts with a multiple of four value. Within each comrade group the pattern x1, x, x, x+1 continues to appear. Where x is the number of ones in the second (or third) value within each comrade group. This property is shown in figure (4). Making use of this property gives us more flexibility with constructing the lookup table. That is instead of storing the bitcounting (popcount) value for each possible value within a computer word or a binary vector of a specific size we can make use of this property by storing the bit-counting value for a single element for each comrade group and then solve the problem for the other elements in that group by the RemaningBits function. Two versions of the algorithm will be introduced; the first one is to explain the main point of the algorithm with the gain of reducing the lookup table size to one fourth while keeping the constant running time (i.e. O(c)). The second version gains more control over the table size, but with more running time complexity caused by the while loop that iterates into the sub parts of the given vector. The procedure is; for each computer word (e.g. 32 bit) the most significant n bits will be used as an index to a lookup table of

size 2 n entries instead of sized 2 word size table with each entry in the table is the xvalue (as shown in figure (4)). Then the RemainingBits function (figure (5-b),
Figure (4): The first 32 possible values for a computer word of size 8 bits and the popcount value corresponding to each is show in the "Number of Ones" columns.figure (7-b)

for version one and version two consecutively) will be invoked either for one time if version one is used or for T times if version two is used, where: T = ( word size number of index bits) / 4 Note that the number of the RemaingBits function calls is determined according to the number of the index bits. As soon as the index bits become less by one bit the, table size will be reduced by the factor 1/4 and with more invocations to the RemaningBits function. 3.1 Version One Version one partitions the computer word into two parts. The first which is the most significant K bits, with K is the memory word size minus 2. This part is used as an index to the lookup table to return the corresponding value from the table. The second is the least significant 2 bits which are sent to the RemaingBits function. The two results are then added together. The algorithm is shown in figure (5) below.

Create and fill a lookup table of size word size/ 2 2

NumberOfOnes = Table [ number shifted right 2 bits ] + RemainingBits(number AND 0x3) (a) Figure (5-a): The new suggested algorithm: Version one

The RemaningBits function algorithm is shown in figure (5-b) bellow:


RemainingBits (number) Set counter = 0 if ( (number AND 3 ) equals 3) AND then counter =counter +1 if ( (number OR 0 ) equals 0 ) = counter - 1 return counter (b) Figure (5-b): First proposed algorithm using RemainingBits function then counter // Bitwise-

Going back to the table of figure (4) and examining it more carefully give the note that the property of comrade groups still holds for each consecutive four comrade groups (i.e. comrade group of comrade groups).By taking the first entry from each comrade group and for any consecutive four groups-starting from multiple of 16 values- the property still holds. Taking more care over this note, and making some changes for the code of version one will manage to give us more control over the table size, that would be by reducing the table size to the fourth. On the other hand the running time complexity becomes larger. 3.2 Version Two In version two, the computer word is partitioned into two parts. The first one is the index bits which are used as an index to the lookup table. The second is the remaining computer word bits after excluding the index bits for the first part. An invocation to the RemainingBits function is made with the second part and a true-valued flag are to be the parameters of that function. The code of version two is shown in figure (7-a) below.
Create and fill a lookup table of size 2 word size / 2 index bits counter2 = 0 flag = false // Boolean flag initially set to false

Let us take an example to clarify this idea. In this context, let us consider a number such as 16 represented in 8 bits. A table of size 26 must be constructed using our suggested approach. The most significant 6 bits are index to the table and the remaining 2 bits are tested by RemainingBits function. See Figure (6).

Index to the lookup table

2 bits sent to the RemainingBits function

counter1 = Table [number shifted right by (word size number of index bits) bits] number = number AND 0xfffff

Figure (6): Version one example

The lookup table will return the value 2 and the RemainingBits function will return -1.The total is 1 which is the bit-counting value of 16. It should be noted that the table has 64 elements instead of 256 elements (reduced by one fourth).

while ( number not equal zero ) counter1 = counter1 + RemainingBits( number AND 0xf , flag ) if ( counter2 > 1 )

then counter1 = counter1+ 1 end if counter2 = counter2 + 1 number = number shifted right 4 bits end while (a) Figure (7-a): The new suggested algorithm ( Version Two)

solved with the recursive calls for each four bits. See Figure (8).

00010001101000101110000100110010
Index to the lookup table for each 4 bits call the RemainingBits Function

Figure (8): Version two example

4. Performance Analysis
The RemainingBits function algorithm is shown in figure (7-b) below:
RemainingBits( number , flag ) counter = 0 if (number grater than 3) then return RemainingBits (number AND 0x3 , flag) +RemainingBits (number shifted right 2 bits, NOT flag) // NOT: logical-negation else if if ((number AND 0x3) equals then counter = counter +1 ((number OR 0x0) equals then counter = counter 1 3) 0)

In version one of the suggested algorithm, the constant running time complexity of the original lookup table algorithm is still achieved with a single access to a lookup table of a size of one fourth of the original table size. This version will be more efficient in case used for small computer words. For example if a computer word is of a size of 8 bits, the table size will be 64 entries instead of 256 entries and still works in constant time complexity. This version is also useful for the enhanced version of the lookup table algorithm. Reducing the space requirements to the fourth drives the space efficiency down for larger computer words; since for large computer words having a lookup table of size one fourth the original expected size still not efficient. For example if the computer word size is 32 bits, then a table of size 1073741824 ( = 232 / 4 ) entries still very large. In version two, we can reduce the size of the lookup table to a pre-specified number of entries and then calculate the complexity for resolving the remaining bits after excluding the index bits to that table. This complexity is related to the table size specified before. For example if the computer word is of a size of 32 bits, then we can construct a table of size 2 12 elements and then iterate in the while loop for the remaining 20 bite. The complexity

if (flag equals TRUE) then counter = counter +1 return counter (b)


Figure (7-b): The new suggested algorithm: RemainingBits function

For example, let a number be represented in 32 bits and assume that we have a table of size 212, then the most significant 12 bits are assumed to be the index bits to the lookup table, the remaining 20 bits will be sent to the RemainingBits function to be

is no more constant time, the complexity comes from the while loop, which will loop (n-12) / 4 times for an n bit number. Each while loop will call the RemainingBits function which will call it self recursively two times for each 4 bits. The over all time complexity will be 5 while iterations for 32 bits each with constant time complexity. Five iterations for 32 bits is considered a good running time with the note that time complexity may vary for different computer words sizes. This is shown in the following analysis. With a computer word of size 8 bits version one is a good choice. The table size is 64 entries instead of 256 entries and the running time is constant. Version two is also applicable which will reduce the table size again to one fourth (i.e. 16 entry) and keeps the running time efficient. With a computer word of size 16 bits we can apply version two by choosing the table size to be 2 8 instead of 2 16 entries. According to that choice the while loop will iterate two times one time for each four bits in (n - 8) bits. With a computer word of size 64 bits we can apply version two by choosing the table size to be 212 instead of 2 64 entries. According to that choice the while loop will iterate thirteen times one time for each four bits in (n - 12) bits. To measure the performance of version Two of 32-bits binary vectors (or computer word), a 32-bits machine was simulated by generating hundreds of thousands of 32bits binary vectors randomly according to some probability of ones- ranges from zero to 1-. A Pentium 4 machine was used to generate the values and execute the second version of the new algorithm using these values. The execution is repeated several hundreds of times, and the average of the execution time was considered. Same thing

was done to the sequential shifting algorithm, Parallel algorithm, Arithmetic's and logic, and enhanced lockup table algorithm. A comparison between these methods is shown in figure (9). As shown if figure (9), the suggested enhancement of the lookup table algorithm is faster than the many of the existing bitcounting algorithms. It is also gained control over the lookup table. The door for enhancing the performance level of this algorithm is still open. By making more use of the comrade group property discussed before it could be more useful if implemented not only in the procedure code but also with constructing the Lookup table. Changing the constructions of the lookup table according the behavior of the repeating patterns when n is power of 2 and resolving this with specific mathematical rule can get the algorithm space efficiency to a higher level.
60

50

40 Running time(ms) BitCount 30 parallel Arithmetic Logic 20 Enhanced Lookup

10

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of ones

Figure (9): Time comparisons between different bit-counting methods and the new suggested method

6. Conclusion
Since bit-counting has become an important topic, many new algorithms have evolved to meet bit-counting application's demands, and many of the existing ones have been enhanced in concerns of time efficiency. One of the

most popular time efficient algorithms is the Lookup table algorithm which has the negativity of space requirements with large computer words and binary vectors. In this paper a new algorithm is presented to overcome the badly-effective space requirement of the lookup table algorithm and keep the time efficient. Both versions of the algorithm implement the lookup table technique side by side with the idea of comrade groups to operate on a computer word or a binary vector of any size. This implementation makes the management of the running time against the required space more flexible. It acts as a kind of slider between the table size and the running time which is set according to both the nature of the application and the available resources.

[4] E. El-Qawasmeh, Performance Investigation of Bit-Counting Algorithms with a Speedup to Lookup Table. Journal of Research and Practice in Information Technology, Australia, Vol. 32(3/4), 2001, pp. 215-230. [5] R. Gutman, Exploiting 64-Bit parallelism, Dr. Dobbs Journal, Vol. 25, No. 9, 2000, pp. 133-134. E. Reingold, J. Nievergeit, J. and N. Deo, Combinatorial Algorithms, Theory and Practice. Englewood Cliffs, New Jersey 07632, Prentice Hall, 1977. E. Goldberg, K. Deb, and H. Clark, Genetic Algorithms, Noise, and the Sizing of Populations, Complex Systems, Vol. 6, pp. 333-362,1992. Berkovich, S., Lapir, G., and Mack, M., A Bit-Counting Algorithm Using the Frequency Division Principle. Software, Practice and Experience Vol. 30, No. 14, 2000, pp. 15311540.

[6]

[7]

References
[1] E. El-Qawasmeh. : Beating The Popcount, International Journal of Information Technology, Vol. 9. No. 1, 2001, pp. 1-18. [2] E. El-Qawasmeh, I. Hemidi, Performance Investigation of Hamming Distance Bit Vertical Counter Applied to Access Methods in Information Retrieval, Journal Of The American Society For Information Science, Vol. 51, No. 5, 2000, pp. 427-432. S. Berkovich, E. El-Qawasmeh, G. Lapir, M. Mack, C. and Zincke, Organization of Near Matching in Bit Attribute Matrix Applied to Associative Access Methods In Information Retrieval. Proc. of the 16th IASTED International Conference Applied Informatics, Garmisch-Partenkirchen, Germany, 1998. pp. 62-64.

[8]

[3]

You might also like