You are on page 1of 13

SIMULATION AND SYNTHESIS OF AN EFFICIENT 32 BIT LOSSLESS COMPRESSION METHOD USING VHDL/VERILOG

Objective
With the increase in silicon densities, it is becoming feasible for multiple compression systems to be implemented in parallel onto a single chip. A 32-bit system with distributed memory architecture is based on having multiple data compression and decompression engines working independently on different data at the same time. This data is stored in memory distributed to each processor. The objective of the project is to design a lossless data compression system which operates in high-speed to achieve high compression rate. By using architecture of compressors, the data compression rates are significantly improved. Also inherent scalability of architecture is possible. The main parts of the system are the XMatchPro

based data compressors and the control blocks providing control signals for the Data compressors. Each Data compressor can process four bytes of data into and from a block of data every clock cycle. The data entering the system needs to be clocked in at a rate of 4n bytes every clock cycle, where n is the number of compressors in the system. This is to ensure that adequate data is present for all compressors to process rather than being in an idle state.

Requirements ModelSim Simulator Student Edition VHDL or VERILOG programming.

XMatchPro Based System


The Lossless data compression system is a derivative of the XMatchPro Algorithm which originates from previous research of the authors [15] and advances in FPGA technology. The flexibility provided by using this technology is of great interest since the chip can be adapted to the requirements of a particular application easily. The drawbacks of some of the previous methods are overcome by using the XmatchPro algorithm in design. The objective is then to obtain better compression ratios and still maintain a high throughput so that the compression/decompression processes do not slow the original system down.

Block Diagram of XMatchPro Based 32-Bit Compression / Decompression System

Usage of XMatchPro Algorithm


The Lossless Parallel Data Compression system designed uses the XMatchPro Algorithm. The XMatchPro algorithm uses a fixed-width dictionary of previously seen data and attempts to match the current data element with a match in the dictionary. It works by taking a 4-byte word and trying to match or partially match this word with past data. This past data is stored in a dictionary, which is constructed from a content addressable memory. As each entry is 4 bytes wide, several types of matches are possible. If all the bytes do not match with any data present in the dictionary they are transmitted with an additional miss bit. If all the bytes are matched then the match location and match type is coded and transmitted, this match is then moved to the front of the dictionary. The dictionary is maintained using a move to front strategy whereby a new tuple is placed at the front of the dictionary while the rest move down one position. When the dictionary becomes full the tuple placed in the last position is discarded leaving space for a new one. The coding function for a match is required to code several fields as follows: A zero followed by: 1). Match location: It uses the binary code associated to the matching location. 2). Match type: Indicates which bytes of the incoming tuple have matched. 3). Characters that did not match transmitted in literal form.

The XMatchPro based Compression system

Previous research on the lossless XMatchPro data compressor has been on optimising and implementing the XMatchPro algorithm for speed, complexity and compression in hardware. The XMatchPro algorithm uses a fixed width dictionary of previously seen data and attempts to match the current data element with a match in the dictionary. It works by taking a 4-byte word and trying to match this word with past data. This past data is stored in a dictionary, which is constructed from a content addressable memory. Initially all the entries in the dictionary are empty & 4-bytes are added to the front of the dictionary, while the rest move one position down if a full match has not occurred. The larger the dictionary, the greater the number of address bits needed to identify each memory location, reducing compression performance. Since the number of bits needed to code each location address is a function of the dictionary size greater compression is obtained in comparison to the case where a fixed size dictionary uses fixed address codes for a partially full dictionary. In the parallel XMatchPro system, the data stream to be compressed enters the compression system,

which is then partitioned and routed to the compressors. For parallel compression systems, it is important to ensure all compressors are supplied with sufficient data by managing the supply so that neither stall conditions nor data overflow occurs.

The Main Component- Content Addressable Memory


Dictionary based schemes copy repetitive or redundant data into a lookup table (such as CAM) and output the dictionary address as a code to replace the data. The compression architecture is based around a block of CAM to realize the dictionary. This is necessary since the search operation must be done in parallel in all the entries in the dictionary to allow high and data-independent throughput. The number of bits in a CAM word is usually large, with existing implementations ranging from 36 to 144 bits. A typical CAM employs a table size ranging between a few hundred entries to 32K entries, corresponding to an address space ranging from 7 bits to 15 bits. The length of the CAM varies with three possible values of 16, 32 or 64 tuples trading complexity for compression. The no. of tuples present in the dictionary has an important effect on compression. In principle, the larger the dictionary the higher the probability of having a match and improving compression. On the other hand, a bigger dictionary uses more bits to code its locations degrading compression when processing small data blocks that only use a fraction of the dictionary length available. The width of the CAM is fixed with 4bytes/word. Content Addressable Memory (CAM) compares input search data against a table of stored data, and returns the address of the matching data. CAMs have a single clock cycle throughput making them faster than other hardware and software-based search systems. The input to the system is the search word that is broadcast onto the searchlines to the table of stored data. Each stored word has a matchline that indicates whether the search word and stored word are identical (the match case) or are different (a mismatch case, or miss). The matchlines are fed to an encoder that generates a binary match location corresponding to the matchline that is in the match state. An

encoder is used in systems where only a single match is expected. The overall function of a CAM is to take a search word and return the matching memory location. Managing Dictionary entries Since the initialization of a compression CAM sets all words to zero, a possible input word formed by zeros will generate multiple full matches in different locations. TheXmatchpro compression system simply selects the full match closer to the top. This operational mode initializes the dictionary to a state where all the words with location address bigger than zero are declared invalid without the need for extra logic. The reason is that location x can never generate a match until the data contents of location x-1 are different from zero because locations closer to the top have higher priority generating matches. Also to increase dictionary efficiency, only one dictionary position contains repeated information and in the best case, all the dictionary positions contain different data.

XMATCHPRO LOSSLESS COMPRESSION SYSTEM


DESIGN METHODOLOGY
The XMatchPro algorithm is efficient at compressing the small blocks of data necessary with cache and page based memory hierarchies found in computer systems. It is suitable for high performance hardware implementation. The XMatchPro hardware achieves a throughput 2-3 times greater than other high-performance hardware implementation. The core component of the system is the XMatchPro based Compression / Decompression system. The XMatchPro is a high-speed lossless dictionary based data compressor. The XMatchPro algorithm works by taking an incoming four-byte tuple of data and attempting to match fully or partially match the tuple with the past data.

FUNCTIONAL DESCRIPTION
The XMatchPro algorithm maintains a dictionary of data previously seen and attempts to match the current data element with an entry in the dictionary, replacing it with a shorter code referencing the match location. Data elements that do not produce a match are transmitted in full (literally) prefixed by a single bit. Each data element is exactly 4 bytes in width and is referred to as tuple. This feature gives a guaranteed input data rate during compression and thus also guaranteed data rates during decompression, irrespective of the data mix. Also the 4-byte tuple size gives an inherently higher throughput than other algorithms, which tend to operate on a byte stream. The dictionary is maintained using move to front strategy, where by the current tuple is placed at the front of the dictionary and the other tuples move down by one location as necessary to make space. The move to front strategy aims to exploit locality in the input data. If the dictionary becomes full,

the tuple occupying the last location is simply discarded. A full match occurs when all characters in the incoming tuple fully match a Dictionary entry. A partial match occurs when at least any tow of the characters in the incoming tuple match exactly with a dictionary entry, with the characters that do not match being transmitted literally. The use of partial matching improves the compression ratio when compared with allowing only 4 byte matches, but still maintains high throughput. If neither a full nor partial match occurs, then a miss is registered and a single miss bit of 1 is transmitted followed by the tuple itself in literal form. The only exception to this is the first tuple in any compression operation, which will always generate a miss as the dictionary begins in an empty state. In this case no miss bit is required to prefix the tuple. At the beginning of each compression operation, the dictionary size is reset to zero. The dictionary then grows by one location for each incoming tuple being placed at the front of the dictionary and all other entries in the dictionary moving down by one location. A full match does not grow the dictionary, but the move-to-front rule is still applied. This growth of the dictionary means that code words are short during the early stages of compressing a block. Because the XMatchPro algorithm allows partial matches, a decision must be made about which of the locations provides the best overall match, with the selection criteria being the shortest possible number of output bits.

Parallel Xmatchpro Compression


The Input router of the system divide the data to be processed and Output router concatenate the data to give as output of the parallel compression system respectively. The split data by Input Router are sent to each of the compression system or XMatchPro compression engines where the data is compressed and is sent to the Output Router to merge the compressed data and sent out as the compressed data.

For multiple compression systems, it is important to ensure all compressors are supplied with sufficient data by managing the supply so that neither stall conditions nor data overflow occurs. There are several approaches in which data can be routed in and out of the compressors. The basic method for input routing used in this project is done by getting twice the size of the input to the XMatchPro compressor, the lower 32 bit is given to the Compressor 0 and the higher 32 bits are given to the other Compressor 1. The method is used for output routing and additional output pins are assigned for both the Compressor 0 and Compressor 1.

DATA FLOW FOR PARALLEL XMATCHPRO COMPRESSOR


The below figure shows graphically the general concept of this approach. Thedata stream to be compressed enters the compression system, which is then partitioned and routed to the compressors. Appropriate methods for routing the data are discussed below, but to achieve good compression performance, it is important that the partitioning mechanism supplies the compressors with sufficient data to keep them active for as great a proportion of the time that the stream is entering the system as is possible. As the compressors operate independently, each producing its own compressed data stream, a mechanism is required to merge these streams in such a way that subsequent decompression can be performed correctly. Also, subsequent decompression needs to be capable of operating in an appropriate parallel fashion, otherwise, a disparity in compression and decompression speeds will occur, reducing overall throughput. The data Flow for parallel compression system is given in Figure 3 below.

INPUT ROUTING
As per the Algorithm, XMatchPro can process four bytes of data per clock cycle, then to ensure that all are busy, data must enter the system at a rate of 4n bytes per clock cycle, where n is the number of compressors in the system. It can be achieved by 2 methods. 1. 2. Interleaved input method Blocked Input method

INTERLEAVED INPUT METHOD In the Interleaved input approach, the router divides the input data into 4-byte wide data streams that are fed into the compressors. This is illustrated in the below figure for two compressors, but the technique can be extended to supply data to any required number of compressors.

7 5 7 5 3 1

3 1

XMatchPro

I R 8 6 4 2 XMatchPro

Fig.4.3. Interleaved Input Routing

The interleaved method avoids the need for input buffering as data are continuously fed to the compressors and acted upon immediately on arrival. This minimization of latency is an important advantage of the approach.

BLOCKED INPUT METHOD

In the blocked input approach, a fixed length block of data is sent from the incoming data stream to each of the compressors in turn, as shown in the following figure. In this scheme, the data has to arrive at the dedicated memory of the compressor at a rate slower than it can be processed, thereby allowing the memory to be filled with data.

iONE SOFTWARE SOLUTIONS


TRAINING,PLACEMENTS,PRODUCT DEVELOPMENT To minimize the latency introduced in blocked mode, compressors need to start processing data as it arrives. It is also important to ensure that sufficient data are available for the compressor to work on while data are being routed to the other compressors, as no new data can be added to the dedicated memory until this process has been completed.

PROPOSED INPUT ROUTING

In this project, Blocked Input Routing method is used for inputting data to compression system as it is more advantageous than interleaved input approach. The advantage of going for this method is that the complexity in designing and coding is reduced and helps in achieving superior compression ratio. But at the same time number of input pins increase as it assigns another set of pins for the second XMatchPro compressor. Actually, the input data size for one XMatchPro compressor is 32 bit, so another 32 bit is required for the second XMatchPro compressor. In order to achieve this, while designing the parallel compressor an input data is assigned as 64 bits and the lower order 32 bits is sent to one XMatchPro compressor and the higher order 32 bit is sent to the second XMatchPro compressor. Thus, by doing so, both the XMatchPro compressor is supplied with the data simultaneously and this increases the speed of compression. 4.6. OUTPUT ROUTING

The lengths of the compressed data output blocks from an array of parallel compressors will generally not be constant due to the variability of redundancy in the data. As in decompression, the system would not know the data boundaries of each block, these data cannot be sent directly to the output bus and additional manipulation is needed in order to guarantee that the original data can be recovered. It is achieved by 3 methods, namely, 1. Single Compressed Block

ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123

iONE SOFTWARE SOLUTIONS


TRAINING,PLACEMENTS,PRODUCT DEVELOPMENT 2. Multiple Compressed Block 3. Interleaved Compressed Block

SINGLE COMPRESSED BLOCK

In this method, it is assumed that the data enters the system using the blocked mode technique and that the compressed data are collected in the compressors output buffers. The buffer outputs are routed in strict order of the compressor number and a boundary tag that contains information on the block length is added so as to precede the data. As the tag will enter the decompression system, first, it will know the length of the compressed data input belonging to any given decompression engine. The introduction of tags is detrimental to the compression ratio, but this effect diminishes as the block length is increased, as the overhead of one tag per block of compressed data is largely constant.

One of the drawbacks of this approach is that the data output may contain idle time. This arises since a whole block of data needs to be compressed before the appropriate tag values can be determined and, so, a compressor may still be compressing its data when router becomes available. MULTIPLE COMPRESSED BLOCK

ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123

iONE SOFTWARE SOLUTIONS


TRAINING,PLACEMENTS,PRODUCT DEVELOPMENT The Figure 2.7 illustrates the format of an output data stream containing multiple blocks. This is similar to the single block scheme, but, instead of waiting for each compressor to finish processing its block of data, all compressors need to finish compressing blocks before the data are sent. In this technique, the tag provides information on the length or the compressed data to enable correct decompression. As all compressors need to have completed their operations before an output can be produced, this approach has a greater latency compared with the single compressed block case, but, as fewer tags are needed, the effect on the compression ratio is reduced. The combined tag is shorter than the sum of the individual tags as the output bus granularity is of fixed width. Output tags are sized in accordance with the output but width in order to simplify the routing architecture and decoding operations, even though fewer bits are required to determine block size boundaries.

INTERLEAVED COMPRESSED BLOCK

The figure illustrates the interleaved approach for routing multiple compressed blocks of data to the output stream. Instead of waiting for a whole block to be compressed, a predefined fixed length of compressed data is always sent to the output. If a compressor has not completed its operations, the system must wait until the data block has been produced.

ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123

iONE SOFTWARE SOLUTIONS


TRAINING,PLACEMENTS,PRODUCT DEVELOPMENT

There are two benefits of this approach compared with the previously discussed methods. First, there is a reduction in latency since data can be sent to the output before the whole block is compressed. Second, since no boundary tags are required, the compression ratio is improved. At the end of compression sequence, the interleaved approach needs to add dummy tags to the output stream in receipt of the stop signal, output routing continues until all compressors have completed operations on their input blocks. It is likely that the final interleaved block from each compressor will contain insufficient data to fill the required fixed output length and, so, the dummy data tags are added as required in order to maintain the interleave length. PROPOSED OUTPUT ROUTING In this project, the Interleaved technique was selected as the Output Routing method as it imparts no overhead to maintain compressed data boundaries, and so has no detrimental effect on the compression ratio. The advantage of going for this method is that the complexity in designing and coding is reduced. But at the same time number of input pins increase as it assigns another set of pins for the second XMatchPro compressor. Actually, the output data size for one 32 bit compressor is either 7 bit (match is found) or 33 bit (match not found), so another set of 33 bit in case of no match and 7 bit in case of match is required for the second compressor. In order to achieve this, while designing the parallel compressor an output data is assigned with two sets of 7 bits as well as two 33 bit output pins. Thus, by doing so, both the compressors are supplied with data simultaneously and the output data is transmitted via the external bus

ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123

You might also like