Professional Documents
Culture Documents
Objective
With the increase in silicon densities, it is becoming feasible for multiple compression systems to be implemented in parallel onto a single chip. A 32-bit system with distributed memory architecture is based on having multiple data compression and decompression engines working independently on different data at the same time. This data is stored in memory distributed to each processor. The objective of the project is to design a lossless data compression system which operates in high-speed to achieve high compression rate. By using architecture of compressors, the data compression rates are significantly improved. Also inherent scalability of architecture is possible. The main parts of the system are the XMatchPro
based data compressors and the control blocks providing control signals for the Data compressors. Each Data compressor can process four bytes of data into and from a block of data every clock cycle. The data entering the system needs to be clocked in at a rate of 4n bytes every clock cycle, where n is the number of compressors in the system. This is to ensure that adequate data is present for all compressors to process rather than being in an idle state.
Previous research on the lossless XMatchPro data compressor has been on optimising and implementing the XMatchPro algorithm for speed, complexity and compression in hardware. The XMatchPro algorithm uses a fixed width dictionary of previously seen data and attempts to match the current data element with a match in the dictionary. It works by taking a 4-byte word and trying to match this word with past data. This past data is stored in a dictionary, which is constructed from a content addressable memory. Initially all the entries in the dictionary are empty & 4-bytes are added to the front of the dictionary, while the rest move one position down if a full match has not occurred. The larger the dictionary, the greater the number of address bits needed to identify each memory location, reducing compression performance. Since the number of bits needed to code each location address is a function of the dictionary size greater compression is obtained in comparison to the case where a fixed size dictionary uses fixed address codes for a partially full dictionary. In the parallel XMatchPro system, the data stream to be compressed enters the compression system,
which is then partitioned and routed to the compressors. For parallel compression systems, it is important to ensure all compressors are supplied with sufficient data by managing the supply so that neither stall conditions nor data overflow occurs.
encoder is used in systems where only a single match is expected. The overall function of a CAM is to take a search word and return the matching memory location. Managing Dictionary entries Since the initialization of a compression CAM sets all words to zero, a possible input word formed by zeros will generate multiple full matches in different locations. TheXmatchpro compression system simply selects the full match closer to the top. This operational mode initializes the dictionary to a state where all the words with location address bigger than zero are declared invalid without the need for extra logic. The reason is that location x can never generate a match until the data contents of location x-1 are different from zero because locations closer to the top have higher priority generating matches. Also to increase dictionary efficiency, only one dictionary position contains repeated information and in the best case, all the dictionary positions contain different data.
FUNCTIONAL DESCRIPTION
The XMatchPro algorithm maintains a dictionary of data previously seen and attempts to match the current data element with an entry in the dictionary, replacing it with a shorter code referencing the match location. Data elements that do not produce a match are transmitted in full (literally) prefixed by a single bit. Each data element is exactly 4 bytes in width and is referred to as tuple. This feature gives a guaranteed input data rate during compression and thus also guaranteed data rates during decompression, irrespective of the data mix. Also the 4-byte tuple size gives an inherently higher throughput than other algorithms, which tend to operate on a byte stream. The dictionary is maintained using move to front strategy, where by the current tuple is placed at the front of the dictionary and the other tuples move down by one location as necessary to make space. The move to front strategy aims to exploit locality in the input data. If the dictionary becomes full,
the tuple occupying the last location is simply discarded. A full match occurs when all characters in the incoming tuple fully match a Dictionary entry. A partial match occurs when at least any tow of the characters in the incoming tuple match exactly with a dictionary entry, with the characters that do not match being transmitted literally. The use of partial matching improves the compression ratio when compared with allowing only 4 byte matches, but still maintains high throughput. If neither a full nor partial match occurs, then a miss is registered and a single miss bit of 1 is transmitted followed by the tuple itself in literal form. The only exception to this is the first tuple in any compression operation, which will always generate a miss as the dictionary begins in an empty state. In this case no miss bit is required to prefix the tuple. At the beginning of each compression operation, the dictionary size is reset to zero. The dictionary then grows by one location for each incoming tuple being placed at the front of the dictionary and all other entries in the dictionary moving down by one location. A full match does not grow the dictionary, but the move-to-front rule is still applied. This growth of the dictionary means that code words are short during the early stages of compressing a block. Because the XMatchPro algorithm allows partial matches, a decision must be made about which of the locations provides the best overall match, with the selection criteria being the shortest possible number of output bits.
For multiple compression systems, it is important to ensure all compressors are supplied with sufficient data by managing the supply so that neither stall conditions nor data overflow occurs. There are several approaches in which data can be routed in and out of the compressors. The basic method for input routing used in this project is done by getting twice the size of the input to the XMatchPro compressor, the lower 32 bit is given to the Compressor 0 and the higher 32 bits are given to the other Compressor 1. The method is used for output routing and additional output pins are assigned for both the Compressor 0 and Compressor 1.
INPUT ROUTING
As per the Algorithm, XMatchPro can process four bytes of data per clock cycle, then to ensure that all are busy, data must enter the system at a rate of 4n bytes per clock cycle, where n is the number of compressors in the system. It can be achieved by 2 methods. 1. 2. Interleaved input method Blocked Input method
INTERLEAVED INPUT METHOD In the Interleaved input approach, the router divides the input data into 4-byte wide data streams that are fed into the compressors. This is illustrated in the below figure for two compressors, but the technique can be extended to supply data to any required number of compressors.
7 5 7 5 3 1
3 1
XMatchPro
I R 8 6 4 2 XMatchPro
The interleaved method avoids the need for input buffering as data are continuously fed to the compressors and acted upon immediately on arrival. This minimization of latency is an important advantage of the approach.
In the blocked input approach, a fixed length block of data is sent from the incoming data stream to each of the compressors in turn, as shown in the following figure. In this scheme, the data has to arrive at the dedicated memory of the compressor at a rate slower than it can be processed, thereby allowing the memory to be filled with data.
In this project, Blocked Input Routing method is used for inputting data to compression system as it is more advantageous than interleaved input approach. The advantage of going for this method is that the complexity in designing and coding is reduced and helps in achieving superior compression ratio. But at the same time number of input pins increase as it assigns another set of pins for the second XMatchPro compressor. Actually, the input data size for one XMatchPro compressor is 32 bit, so another 32 bit is required for the second XMatchPro compressor. In order to achieve this, while designing the parallel compressor an input data is assigned as 64 bits and the lower order 32 bits is sent to one XMatchPro compressor and the higher order 32 bit is sent to the second XMatchPro compressor. Thus, by doing so, both the XMatchPro compressor is supplied with the data simultaneously and this increases the speed of compression. 4.6. OUTPUT ROUTING
The lengths of the compressed data output blocks from an array of parallel compressors will generally not be constant due to the variability of redundancy in the data. As in decompression, the system would not know the data boundaries of each block, these data cannot be sent directly to the output bus and additional manipulation is needed in order to guarantee that the original data can be recovered. It is achieved by 3 methods, namely, 1. Single Compressed Block
ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123
In this method, it is assumed that the data enters the system using the blocked mode technique and that the compressed data are collected in the compressors output buffers. The buffer outputs are routed in strict order of the compressor number and a boundary tag that contains information on the block length is added so as to precede the data. As the tag will enter the decompression system, first, it will know the length of the compressed data input belonging to any given decompression engine. The introduction of tags is detrimental to the compression ratio, but this effect diminishes as the block length is increased, as the overhead of one tag per block of compressed data is largely constant.
One of the drawbacks of this approach is that the data output may contain idle time. This arises since a whole block of data needs to be compressed before the appropriate tag values can be determined and, so, a compressor may still be compressing its data when router becomes available. MULTIPLE COMPRESSED BLOCK
ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123
The figure illustrates the interleaved approach for routing multiple compressed blocks of data to the output stream. Instead of waiting for a whole block to be compressed, a predefined fixed length of compressed data is always sent to the output. If a compressor has not completed its operations, the system must wait until the data block has been produced.
ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123
There are two benefits of this approach compared with the previously discussed methods. First, there is a reduction in latency since data can be sent to the output before the whole block is compressed. Second, since no boundary tags are required, the compression ratio is improved. At the end of compression sequence, the interleaved approach needs to add dummy tags to the output stream in receipt of the stop signal, output routing continues until all compressors have completed operations on their input blocks. It is likely that the final interleaved block from each compressor will contain insufficient data to fill the required fixed output length and, so, the dummy data tags are added as required in order to maintain the interleave length. PROPOSED OUTPUT ROUTING In this project, the Interleaved technique was selected as the Output Routing method as it imparts no overhead to maintain compressed data boundaries, and so has no detrimental effect on the compression ratio. The advantage of going for this method is that the complexity in designing and coding is reduced. But at the same time number of input pins increase as it assigns another set of pins for the second XMatchPro compressor. Actually, the output data size for one 32 bit compressor is either 7 bit (match is found) or 33 bit (match not found), so another set of 33 bit in case of no match and 7 bit in case of match is required for the second compressor. In order to achieve this, while designing the parallel compressor an output data is assigned with two sets of 7 bits as well as two 33 bit output pins. Thus, by doing so, both the compressors are supplied with data simultaneously and the output data is transmitted via the external bus
ione software solutions, K.N.arcade, opp. zilla panchayat, kempegowda circle, Ramanagara
Ph: 9611102489,9945472492,9902580532,8105704123