WO2007050018A1 - Method and system for compressing data - Google Patents

Method and system for compressing data Download PDF

Info

Publication number
WO2007050018A1
WO2007050018A1 PCT/SE2006/001198 SE2006001198W WO2007050018A1 WO 2007050018 A1 WO2007050018 A1 WO 2007050018A1 SE 2006001198 W SE2006001198 W SE 2006001198W WO 2007050018 A1 WO2007050018 A1 WO 2007050018A1
Authority
WO
WIPO (PCT)
Prior art keywords
elements
tables
sequence
subsequence
length
Prior art date
Application number
PCT/SE2006/001198
Other languages
French (fr)
Other versions
WO2007050018A8 (en
Inventor
Anders Holtsberg
Original Assignee
Algo Trim Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Algo Trim Ab filed Critical Algo Trim Ab
Priority to EP06799795A priority Critical patent/EP1941617A4/en
Publication of WO2007050018A1 publication Critical patent/WO2007050018A1/en
Publication of WO2007050018A8 publication Critical patent/WO2007050018A8/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3086Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing a sliding window, e.g. LZ77
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3088Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/42Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code using table look-up for the coding or decoding process, e.g. using read-only memory

Definitions

  • the present invention relates to a lossless compression method which allows for block-wise random access of the compressed data.
  • the memory size of the firmware (pre-installed software) in embedded computer devices is rapidly increasing.
  • the memory unit is one of the most expensive components, greatly affecting the bill of material.
  • it is important to reduce the size of the firmware in such devices.
  • the firmware in embedded devices consists of machine code, that is, the actual microprocessor instructions. Embedded into the code is also constant, literal data, i.e. data that remains unchanged (the value of literal data is written at compile-time and read-only at run time), such as integers and strings, which typically occupies a few percent of the total code size.
  • Firmware is normally stored in non-volatile memory such as NOR or NAND flash. There is a trend to replace NOR flash with NAND flash since the latter memory type is several times cheaper. The drawback with NAND flash is that it is a block-based memory, like a hard disk, which means that it can only be read and written in blocks, typically 512 bytes or 2 kilobytes in size.
  • the microprocessor in an embedded device is usually a RISC (Reduced Instruction Set Computer) processor.
  • This type of processor is characterized by having a relatively small number of instructions of fixed length, typically 16- or 32-bits.
  • CISC (Complex Instruction Set Computer) processors are normally found in personal computers and have a much larger instruction set.
  • An advantage of RISC processors is that they allow for a simpler and cheaper design, but a disadvantage is that more instructions are required to accomplish a given function, compared to CISC processors.
  • the relatively low code density of RISC processors is known by embedded system manufacturers and they have addressed this concern in various ways.
  • IBM's has developed the CodePack code compression technique for the 32-bit RISC PowerPC processors.
  • compression is performed by a software utility that creates compressed application code images from standard PowerPC executable files.
  • Decompression is performed by an ASIC core that is placed between the processor and the memory controller. The core decompresses instructions on-the-fly as needed by the processor.
  • the technique is described in the IBM Technical White Paper "CodePack: Code Compression for PowerPC Processors" by M. Game and A. Booker. Compression is done by splitting each 32-bit instruction into two 16-bit halves. Each half is then substituted with a variable-length (entropy) code, representing an index into one of two decode tables, holding 512 16-bit entries each.
  • An element is normally an 8-bit byte but the method easily generalizes to elements of other lengths.
  • An element is either encoded as a literal byte, i.e. data that is represented "as is" in the compressed data, or as a backward reference to a repeated element sequence.
  • the reference is specified by the length, in bytes, of the repeated sequence and by the backward distance, also in bytes, to the most recent repeated sequence.
  • the maximal distance allowed by the encoding scheme defines a "window" which defines how far back, relative to the current element, a reference can be made.
  • a popular and patent-free version of the LZ77 method is the DEFLATE algorithm, specified in the RFC (Request for Comments) 1951 internet standard and implemented in software as the zlib library (www.zlib.net).
  • This algorithm specifies how to encode literal bytes and backward, length-distance references as uniquely decodeable and variable length bit sequences. It further employs Huffman entropy coding of the literals and length references to maximize the compression ratio.
  • the DEFLATE algorithm's encoding method therefore uniquely defines the decompression method. Decompression interprets the compressed data as a bit stream and outputs either literal bytes or a sequence of bytes which is a repetition occurring before the most recent encoded byte.
  • the LZ77 method performs well on many types of data it has the drawback of not allowing random access to different parts of the compressed data. For decompressing a certain part, all data before that part has to be decompressed first.
  • One way to achieve random access in LZ77 to individual blocks, in a block based memory, is to compress each block separately, concatenate the compressed data of all blocks, and store pointers to each compressed block in a table. But the problem with this approach is that the compression ratio decreases as the block size gets smaller.
  • Another drawback of the LZ77 method is that it is not optimal for compressing code consisting of 16- and 32-bit RISC instructions since they have rather different characteristics than, for example, text files. This also explains, partly, why IBM has chosen a different approach in their CodePack technique.
  • IBM has also disclosed US Patent 5,001,478, which is a method for encoding compressed data.
  • the compression method operates by transforming the input data into a sequence of history, lexicon, and literal references.
  • the history references are of the same type as the backward length-distance references in the LZ77 method.
  • the lexicon references are also history-type references, but refer to a string in a lexicon which works like a buffer, holding the most recent history references. In this way, the compression ratio is increased since lexicon references require shorter binary codes.
  • the object of the invention is achieved by a novel method for lossless data compression of an input sequence of elements. The method comprises the steps of transforming the input sequence of data elements into a processed sequence of symbols, each symbol representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one or more tables of frequently occurring subsequences of elements.
  • the method further includes the step of encoding the processed sequence of symbols into a uniquely decodeable bit stream for compressing the original sequence.
  • the bit stream is stored in storage means.
  • the one or more tables of frequently occurring subsequences of elements is a pre-defined table or a predefined set of tables. (In the following description and claims a pre-defined table or pre-defined set of tables is defined as a table or set of tables wherein the content of the tables is pre-defined).
  • the method may include a step of building the one or more tables of frequently occurring subsequences of elements as a pre-processing step. This has the advantage that an efficient lossless compression method is obtained in a simple manner.
  • the tables may be adapted for the specific data to be compressed, and the step of building the tables can be carried out every time a new set of data is to be compressed.
  • the tables may be given beforehand, as the same set of tables may be applicable on different sequences of elements.
  • the tables may be built once for a particular type of data sets and then re-used each time a particular set of data of that type is to be compressed.
  • pre-built tables has the advantage that, e.g. in case of transmission of the bit stream to a receiver via a wireless interface, the receiver can be provided with the tables beforehand, which reduces the amount of data to be transmitted.
  • the method may further include the step of building a table of bit positions referring to the starts of each compressed block in the bit stream.
  • the compression method of the present invention operates on the original data as a block-partitioned sequence of elements, transforms the elements into a sequence of symbols, and outputs a bit stream, possibly together with a set of tables.
  • the corresponding decompression method operates on the bit stream, encodes it into a sequence of symbols, and outputs the original data as a sequence of elements.
  • the decompression method also needs access to the tables produced during the compression phase in order to identify the beginning of each block and in order to transform the table reference symbols into the corresponding subsequences. Either the tables containing the subsequences are pre-stored in the decoding device, or alternatively, the tables are included in the output bit stream.
  • compression means that less number of bits, or bytes, are needed to represent the compressed data than the original data and that lossless means that the original data can be reconstructed exactly from the compressed data. Compression is achieved only if the sum of the size of the bit stream and the size of all the tables is less than the size of the original data.
  • the pre-processing step of building one or more tables of commonly occurring subsequences and referencing to them comprises a method to address this limitation and results in a higher compression ratio.
  • commonly shorter subsequences are repeated more frequently than longer ones.
  • the tables therefore contain short and frequent subsequences of elements.
  • one or two tables are built in the pre-processing step, one table consisting of the most common pairs, or bigrams, of elements and/or one table consisting of the most common subsequences of three elements, or trigrams, respectively.
  • the present invention is not limited to this particular choice of tables.
  • the compression method processes the original data as a sequence of elements.
  • An element is usually understood to mean an 8-bit byte, but the present invention is not limited to an element being a byte and it would certainly be possible to let an element belong to any finite alphabet. In the sequel, the terms element and byte are therefore used interchangeably.
  • the original data is further assumed to be divided into blocks whose sizes typically range from 512 bytes to 4 kilobytes. These block sizes do not have to be equal to the physical block size of the block-based memory, or storage device. In most applications, the blocks are all of the same size but this is no limitation of the present invention and they could even be of variable length.
  • the method first transforms this sequence into a new representation which consists of a sequence of literal bytes, length-distance backward references, or table-index references.
  • the length-distance references are of the same type as in the LZ77 method, that is, references to sequences of bytes that have occurred previously in the original byte sequence. If there are several repeated subsequences found, the one with the longest length is chosen and, if there are two or more repetitions of the same length, the one with the shortest backward distance is chosen.
  • the lengths are limited to be at least three bytes, with a maximum length defined by the number of bits used to encode the lengths. To allow for random access to individual blocks, the distance references can not refer beyond the most recent block boundary.
  • the table-index references also refer to sequences of bytes. However, these sequences are stored in one or more tables.
  • the first parameter in the ordered pair table-index is a number which specifies in which table the sequence is found and the second parameter is an index number which uniquely identifies the sequence in this particular table.
  • one table is built containing only two-byte sequences and one table is built containing only three-byte sequences, but this is no limitation of the present invention. If each table only contains sequences of a fixed length, less memory is needed to represent the tables.
  • a fixed set of tables that are optimized for a particular type of data for example, machine code for a particular RISC processor are used.
  • the firmware, or ROM image from a number of different embedded devices with the same processor could be used to find the most common two- and three-byte sequences. The same set of tables could then be used when compressing different sets of data without having to go through the pre-processing step.
  • Fig. 1 is a flow chart depicting an exemplary compression method of the present invention.
  • Fig. 2 schematically illustrates the exemplary compression process, resulting in a compressed bitstream and a set of tables, and the decompression process for decompressing an arbitrary block, given the bit stream and the tables.
  • Fig. 3 is a table showing an exemplary encoding of a sample text string.
  • the memory size of the software in computer devices is rapidly increasing.
  • the memory unit is one of the most expensive components, and, accordingly, it is important to reduce the size of the software (in particular firmware) in such devices.
  • firmware is normally stored in non-volatile memory such as NOR or NAND flash.
  • the present invention provides a data compression method that is particularly suitable for use when compressing firmware of such devices.
  • the present invention is applicable on any type of data in any application wherein a lossless compression of the data is desired.
  • FIG. 1 A flow chart of an exemplary compression method according to the present invention is depicted in Fig. 1.
  • the data to be compressed will be processed as a sequence of bytes but it should be understood that elements from any finite alphabet could be used as well, and the present invention is not limited to the case of an element being a byte.
  • the compression process starts, it is assumed that the original data to be compressed and the block boundaries are given. It is further assumed that the table of bit positions, which at the end of the process refers to the starts of each compressed block, and the bitstream, which at the end of the process contains the compressed data, have been initialized.
  • the initial step of the compression method is comprised of building two- and three-byte tables, as shown in step 100.
  • These tables should contain frequently occurring subsequences in the original data.
  • one table is defined to consist of the 512 most common two-byte sequences found in the original data and one table to consist of the 512 most common three-byte sequences.
  • the size of the two tables adds to 2,560 bytes.
  • the sort and count methods needed to determine the most common subsequences are familiar to those skilled in the art of the invention and will therefore not be discussed more in detail.
  • the choice of the number of tables and their content and sizes is not limited to these particular values and is no limitation of the present invention.
  • the transformation of the original byte sequence, given the tables, into a sequence of literal bytes, table or backward references now proceeds as follows.
  • the original data is processed block-by-block and as a sequence of bytes within each block.
  • the current block is set to the first block, step 101, and then the current byte pointer is set to point to the first byte of the current block to be processed, step 102.
  • the bytes in each block are then transformed as follows. First, a search is made to find the longest repetition within the current block and before the current byte pointer, step 103. If a backward reference of length at least four bytes is found (test condition 104), the corresponding length-distance reference symbol is output, step 105. At the beginning of each block it is, of course, not possible to find a length-distance backward reference, meaning that condition 104 leads to step 106. If a backward reference of length at least four bytes is not found, a search for the next three bytes is also made in the three-byte table, step 106.
  • the symbol with the shortest bit length is output, that is, either a length-distance or a table-index symbol.
  • the binary encoding of these symbols is defined later.
  • the three-byte table symbol will always have the shortest encoding and will therefore always be output. This assumption is made in the flow chart of fig. 1, but this is no limitation of the present invention. Therefore, in this embodiment, if a match is found (test condition 107) in the three-byte table, the corresponding three-byte table symbol is directly output (step 108).
  • a length-distance reference symbol output (step 110) if the length is equal to three (test condition 109). If no backward reference or three-byte table reference can be found, the next two bytes are searched for in the two-byte table, as shown in step 111. If they are found in the table, the corresponding table-index symbol is output, step 113. Finally, if no backward or table references can be found, a literal byte symbol is output, step 114, and the byte pointer is moved to point to the next byte.
  • step 115 in fig. 1, the prefix and parameter encoding of each symbol is preferably appended to the bit stream after each symbol has been generated and not after all the symbols have been generated for all of the data to be compressed.
  • the step corresponding to step 115 also incorporates the forward movement of the current byte pointer.
  • the bit stream is generated after all of the symbols have been generated.
  • test condition 116 determines if all bytes in the original data have been compressed. If this is the case, test condition 116 is true and the last entry of the table of bit pointers is set to refer to one bit past the last bit of the bitstream, step 117, and the compression process is completed. If there are more bytes to be compressed, another test (118) is made to determine whether the current byte is at the beginning of a new block. If it is, the table of bit pointers is updated with a new entry referring to the first bit of the next block, and the current block is moved one step forward, step 119. Then step 102 is repeated, that is, the current byte pointer is set to the first byte of the new current block. The remaining bytes within each block are then transformed in the same way until no bytes remain, starting at step 103. The step corresponding to step 103 is also carried out if test condition 118 fails, that is, the current byte is not at the start of the next block.
  • the next step of the method according to the invention consists of encoding the symbol sequence of all the blocks into a bit stream.
  • the encoding step also uniquely defines the decoding step of transforming the bit stream into a sequence of symbols.
  • the whole decompression process is defined by transforming the sequence of symbols into the original sequence of bytes, or elements.
  • the symbol type and table index numbers are encoded together in a two-bit prefix code.
  • a literal byte symbol is encoded as binary "00”, a length-distance backward reference symbol as binary "01”, a two-byte table symbol as binary "10”, and a three-byte table symbol as binary "11".
  • the other parameters of each symbol are encoded into binary form.
  • the binary encoding of a symbol and its parameters are appended to the bit stream after each symbol, and not after the complete element sequence has been transformed into a sequence of symbols, as depicted in the flow chart in fig. 1 and in particular block 115.
  • a literal byte symbol is encoded with another 8 bits. These 8 bits simply constitute the binary representation of the byte.
  • the parameters of the length-distance reference symbols are encoded as follows. First a fixed number of bits following the prefix code are used for encoding the distance.
  • the distance is advantageously encoded with the same number of bits that are needed to represent the current block length. For example, in case the block length is 512 bytes, 9 bits are used to code the distance. It has further been found that the maximum length allowed by encoding the length with 5 bits works well. In a preferred embodiment, the minimum length is 3 bytes, which means that these 5 bits can encode lengths from 3 through 34 bytes. With these choices, the length-distance references are encoded with a total of 16 bits, including the prefix code.
  • the table index number is encoded with a fixed number of bits. In case both the two-and three-byte tables contain 512 entries each, 9 bits are used to encode the table index.
  • the table index numbers are represented with Huffman coding, resulting in a variable number of bits and in average less number of bits than with a fixed code length.
  • Huffman codes are preferably built as part of the pre-processing step when building the tables. The count of each subsequence is then used to compute the Huffman code, but all these counts do not have to be stored in the tables. There are more memory efficient methods of representing these Huffman codes, but these methods are familiar to those skilled in the art of the invention and will therefore not be described herein. In this embodiment, the Huffman code must also be made available to the decoder.
  • the step of encoding the symbols into a bit stream is completed by concatenating the bit stream of each compressed block into one single bit stream, representing all the compressed blocks.
  • the generated bit stream is stored in storage means, such as NAND or NOR flash or any other type of volatile or non-volatile memory units. If the bit stream is intended to be transmitted to another device, e.g. by a wireless transmission, the bit stream may be stored in storage means in form of a buffer prior to transmission.
  • a table of bit positions referring to the start of each compressed block in the bit stream is built.
  • this table is built incrementally as soon as the encoding of the current block is completed, as shown in step 118 and 119.
  • a 32-bit integer number is used to represent the bit position of the start, or first bit, of each compressed block.
  • the table is then represented as an array of integers.
  • the bit position of the first compressed block will always be zero, so this position is not stored in the array, or table.
  • one plus the position of the last bit of the last block is recorded in the last element of the array. In this way the length of the array is the same as the number of blocks, and the first and last bit position of each compressed block can be computed from the array. This information is needed during decompression to determine where to start and end the decoding of each block in the bit stream.
  • Schematic diagrams of the compression and decompression processes are depicted in fig. 2.
  • the upper diagram shows the compression process which takes as input a sequence of n blocks.
  • the result of the compression is the compressed data, or the bitstream, and a set of tables.
  • the set of tables include the table of bit pointers and, in an exemplary embodiment, a two- and a three-byte table.
  • the lower diagram shows the decompression process. Given the bitstream and the set of tables, it can decompress any block k between 1 and n.
  • the example sequence is the text string "this is miss", that is, the elements are characters which are represented with bytes, encoded with the ASCII standard.
  • this string is contained in one block and that the current byte pointer is located at the initial character "t".
  • the two-byte table contains the strings "th” and "is”, with indices 0 and 1, respectively.
  • the three-byte table contains the string "mis” whose index is 5.
  • the first symbol encodes the initial string "th" as a reference to the corresponding string in the two-byte table, under the assumption that the string "thi” is not in the three-byte table.
  • the second symbol encodes the string "is”, also with a reference to the two-byte table. Similarly, this is under the assumption that the string "is " is not contained in the three-byte table.
  • the third symbol encodes the following space as a literal byte.
  • the fourth symbol encodes the string "is " as a backward reference of length 3 and distance 3.
  • the fifth symbol encodes the string "mis” as a reference to the three-byte table.
  • the sixth symbol encodes the single letter "s" as a literal byte.
  • the table in fig. 3 displays the encoded symbols and their type, binary encoding, and the binary length, respectively.
  • the spaces in the binary strings are inserted just for illustrative purposes.
  • the two literals, the space and letter "s" are encoded into binary form using their decimal ASCII codes 32 and 115, respectively.
  • the length-distance backward reference encodes the distance 3 as the binary form of decimal 2, that is, the distance minus one, since the minimum distance is 1.
  • the corresponding length 3 is encoded as the binary form of decimal 0, that is, the length minus 3, since, in this example, the minimum length is 3.
  • This particular sequence is therefore encoded with a total of 69 bits, which should be compared to the 96 bits (12 times 8 bits) needed to encode the original string.
  • the present invention further relates to a device for compressing data.
  • the device comprises means for transforming the original sequence of data elements into a new sequence of symbols representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the tables.
  • the device also comprises means for encoding the new sequence of symbols into a uniquely decodeable bit stream for compressing the original sequence, and means for building a table of bit positions referring to the start of each compressed block in the bit stream.
  • the device may also comprise means for decompressing arbitrary blocks, given the bitstream and tables. Further, the device may comprise means for building one or more tables of frequently occurring subsequences of elements.
  • the present invention further relates to a device for decoding a set of symbols into a sequence of elements given a set of tables of subsequences of elements. This device comprises means for decoding the symbols into a one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the given tables, and means for transforming the decoded symbols into the original sequence of elements.
  • the above description of the compression and decompression process may, as stated above, e.g. be applied to compress part of the firmware, for example all applications and/or the operating system, in a mobile device.
  • the firmware might be stored in NAND flash memory and loaded to RAM by the boot code at system startup.
  • decompression is initiated by the boot code.

Abstract

The invention concerns a novel method for lossless data compression of a sequence of elements. The method comprises the steps of transforming the original sequence of data elements into a new sequence of symbols representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one or more tables of frequently occurring subsequences of elements, and encoding the new sequence of symbols into a uniquely decodeable bit stream for compressing the original sequence.

Description

METHOD AND SYSTEM FOR COMPRESSING DATA
FIELD OF THE INVENTION
The present invention relates to a lossless compression method which allows for block-wise random access of the compressed data.
BACKGROUND OF THE INVENTION
The memory size of the firmware (pre-installed software) in embedded computer devices is rapidly increasing. In embedded computer devices, such as mobile phones, the memory unit is one of the most expensive components, greatly affecting the bill of material. Thus, it is important to reduce the size of the firmware in such devices.
The firmware in embedded devices consists of machine code, that is, the actual microprocessor instructions. Embedded into the code is also constant, literal data, i.e. data that remains unchanged (the value of literal data is written at compile-time and read-only at run time), such as integers and strings, which typically occupies a few percent of the total code size. Firmware is normally stored in non-volatile memory such as NOR or NAND flash. There is a trend to replace NOR flash with NAND flash since the latter memory type is several times cheaper. The drawback with NAND flash is that it is a block-based memory, like a hard disk, which means that it can only be read and written in blocks, typically 512 bytes or 2 kilobytes in size. This means that code held in block-based memories has to be loaded on demand into RAM memory before it can be executed. If the firmware is stored in compressed form in a block-based memory, it is important that blocks can be decompressed individually (randomly accessed) to facilitate the integration of decompression into the memory management system of the device.
The microprocessor in an embedded device is usually a RISC (Reduced Instruction Set Computer) processor. This type of processor is characterized by having a relatively small number of instructions of fixed length, typically 16- or 32-bits. CISC (Complex Instruction Set Computer) processors are normally found in personal computers and have a much larger instruction set. An advantage of RISC processors is that they allow for a simpler and cheaper design, but a disadvantage is that more instructions are required to accomplish a given function, compared to CISC processors. The relatively low code density of RISC processors is known by embedded system manufacturers and they have addressed this concern in various ways.
ARM Limited, of Cambridge, England, has developed the 16-bit Thumb instruction set as an alternative, denser variant of its standard 32-bit ARM instruction set. This is a purely hardware-based code compression method and it is disclosed in US Patent 5,568,646.
IBM's has developed the CodePack code compression technique for the 32-bit RISC PowerPC processors. Here, compression is performed by a software utility that creates compressed application code images from standard PowerPC executable files. Decompression is performed by an ASIC core that is placed between the processor and the memory controller. The core decompresses instructions on-the-fly as needed by the processor. The technique is described in the IBM Technical White Paper "CodePack: Code Compression for PowerPC Processors" by M. Game and A. Booker. Compression is done by splitting each 32-bit instruction into two 16-bit halves. Each half is then substituted with a variable-length (entropy) code, representing an index into one of two decode tables, holding 512 16-bit entries each.
Microsoft has also developed code, or program binary, compression methods where both compression and decompression is made by software. One such method is disclosed in US Patent 6,907,516. This method splits the program binary into several sub-streams that have high local sequential correlation. These sub-streams are then compressed with PPM (Prediction by Partial Matching). The method is targeted at the x86, CISC-type architecture whose instructions have variable length. One of the most popular lossless compression methods is the LZ77 method based on the 1977 IEEE paper UA Universal Algorithm for Sequential Data Compression" by A. Lempel and J. Ziv. It compresses data by replacing repeated element sequences with backward references to the most recent occurrence of a particular repeated sequence. An element is normally an 8-bit byte but the method easily generalizes to elements of other lengths. During compression the original data is scanned element by element from start to end. An element is either encoded as a literal byte, i.e. data that is represented "as is" in the compressed data, or as a backward reference to a repeated element sequence. The reference is specified by the length, in bytes, of the repeated sequence and by the backward distance, also in bytes, to the most recent repeated sequence. The maximal distance allowed by the encoding scheme defines a "window" which defines how far back, relative to the current element, a reference can be made.
A popular and patent-free version of the LZ77 method is the DEFLATE algorithm, specified in the RFC (Request for Comments) 1951 internet standard and implemented in software as the zlib library (www.zlib.net). This algorithm specifies how to encode literal bytes and backward, length-distance references as uniquely decodeable and variable length bit sequences. It further employs Huffman entropy coding of the literals and length references to maximize the compression ratio. The DEFLATE algorithm's encoding method therefore uniquely defines the decompression method. Decompression interprets the compressed data as a bit stream and outputs either literal bytes or a sequence of bytes which is a repetition occurring before the most recent encoded byte.
Although the LZ77 method performs well on many types of data it has the drawback of not allowing random access to different parts of the compressed data. For decompressing a certain part, all data before that part has to be decompressed first. One way to achieve random access in LZ77 to individual blocks, in a block based memory, is to compress each block separately, concatenate the compressed data of all blocks, and store pointers to each compressed block in a table. But the problem with this approach is that the compression ratio decreases as the block size gets smaller. Another drawback of the LZ77 method is that it is not optimal for compressing code consisting of 16- and 32-bit RISC instructions since they have rather different characteristics than, for example, text files. This also explains, partly, why IBM has chosen a different approach in their CodePack technique.
IBM has also disclosed US Patent 5,001,478, which is a method for encoding compressed data. In this patent, the compression method operates by transforming the input data into a sequence of history, lexicon, and literal references. The history references are of the same type as the backward length-distance references in the LZ77 method. The lexicon references are also history-type references, but refer to a string in a lexicon which works like a buffer, holding the most recent history references. In this way, the compression ratio is increased since lexicon references require shorter binary codes.
However, since the memory unit is one of the most expensive components in many computer devices, there exists a need for a lossless compression method that compresses data in a more efficient manner than known solutions.
Further, there exists a need for an efficient lossless compression method that allows for block-wise random access of compressed data without reducing the compression ratio due to block partitioning of the data.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide a lossless compression method that utilizes redundancy in short subsequences of elements to maximize the compression ratio. It is a further object of the present invention to provide a lossless compression method that allows for block-wise random access of compressed data without reducing the compression ratio due to block partitioning of the data. The object of the invention is achieved by a novel method for lossless data compression of an input sequence of elements. The method comprises the steps of transforming the input sequence of data elements into a processed sequence of symbols, each symbol representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one or more tables of frequently occurring subsequences of elements. The method further includes the step of encoding the processed sequence of symbols into a uniquely decodeable bit stream for compressing the original sequence. The bit stream is stored in storage means. The one or more tables of frequently occurring subsequences of elements is a pre-defined table or a predefined set of tables. (In the following description and claims a pre-defined table or pre-defined set of tables is defined as a table or set of tables wherein the content of the tables is pre-defined). The method may include a step of building the one or more tables of frequently occurring subsequences of elements as a pre-processing step. This has the advantage that an efficient lossless compression method is obtained in a simple manner.
Further, this has the advantage that the tables may be adapted for the specific data to be compressed, and the step of building the tables can be carried out every time a new set of data is to be compressed. Alternatively, the tables may be given beforehand, as the same set of tables may be applicable on different sequences of elements. Alternatively, the tables may be built once for a particular type of data sets and then re-used each time a particular set of data of that type is to be compressed.
Use of pre-built tables has the advantage that, e.g. in case of transmission of the bit stream to a receiver via a wireless interface, the receiver can be provided with the tables beforehand, which reduces the amount of data to be transmitted.
The method may further include the step of building a table of bit positions referring to the starts of each compressed block in the bit stream. This has the advantage that the method allows for block-wise random access of compressed data without reducing the compression ratio due to block partitioning of the data. In this case, the original sequence of elements is assumed to be partitioned into a sequence of blocks. These blocks can be of fixed length or contain a varying amount of elements.
The compression method of the present invention operates on the original data as a block-partitioned sequence of elements, transforms the elements into a sequence of symbols, and outputs a bit stream, possibly together with a set of tables. The corresponding decompression method operates on the bit stream, encodes it into a sequence of symbols, and outputs the original data as a sequence of elements. However, the decompression method also needs access to the tables produced during the compression phase in order to identify the beginning of each block and in order to transform the table reference symbols into the corresponding subsequences. Either the tables containing the subsequences are pre-stored in the decoding device, or alternatively, the tables are included in the output bit stream.
It is to be understood that compression means that less number of bits, or bytes, are needed to represent the compressed data than the original data and that lossless means that the original data can be reconstructed exactly from the compressed data. Compression is achieved only if the sum of the size of the bit stream and the size of all the tables is less than the size of the original data.
The compression achieved by backward references to previously occurring subsequences is limited since references can not be made before and beyond the most recent block boundary. Thus, if it is possible to reference commonly occurring subsequences the first time they appear within a block, the compression ratio can be increased. According to the invention, the pre-processing step of building one or more tables of commonly occurring subsequences and referencing to them comprises a method to address this limitation and results in a higher compression ratio. Further, commonly shorter subsequences are repeated more frequently than longer ones. Preferably, the tables therefore contain short and frequent subsequences of elements. In an embodiment, one or two tables are built in the pre-processing step, one table consisting of the most common pairs, or bigrams, of elements and/or one table consisting of the most common subsequences of three elements, or trigrams, respectively. However, the present invention is not limited to this particular choice of tables.
The compression method processes the original data as a sequence of elements. An element is usually understood to mean an 8-bit byte, but the present invention is not limited to an element being a byte and it would certainly be possible to let an element belong to any finite alphabet. In the sequel, the terms element and byte are therefore used interchangeably. The original data is further assumed to be divided into blocks whose sizes typically range from 512 bytes to 4 kilobytes. These block sizes do not have to be equal to the physical block size of the block-based memory, or storage device. In most applications, the blocks are all of the same size but this is no limitation of the present invention and they could even be of variable length.
The method first transforms this sequence into a new representation which consists of a sequence of literal bytes, length-distance backward references, or table-index references. The length-distance references are of the same type as in the LZ77 method, that is, references to sequences of bytes that have occurred previously in the original byte sequence. If there are several repeated subsequences found, the one with the longest length is chosen and, if there are two or more repetitions of the same length, the one with the shortest backward distance is chosen. Preferably, the lengths are limited to be at least three bytes, with a maximum length defined by the number of bits used to encode the lengths. To allow for random access to individual blocks, the distance references can not refer beyond the most recent block boundary. The table-index references also refer to sequences of bytes. However, these sequences are stored in one or more tables. The first parameter in the ordered pair table-index is a number which specifies in which table the sequence is found and the second parameter is an index number which uniquely identifies the sequence in this particular table. Preferably, one table is built containing only two-byte sequences and one table is built containing only three-byte sequences, but this is no limitation of the present invention. If each table only contains sequences of a fixed length, less memory is needed to represent the tables.
In another embodiment, a fixed set of tables that are optimized for a particular type of data, for example, machine code for a particular RISC processor are used. In this embodiment, the firmware, or ROM image, from a number of different embedded devices with the same processor could be used to find the most common two- and three-byte sequences. The same set of tables could then be used when compressing different sets of data without having to go through the pre-processing step.
These tables serve two purposes. First, rather frequently it is the case that there is no repeated sequence of bytes to be found within the current block, but it is possible to find a repeated sequence further back in the original data. This results in a decreased compression ratio. In the LZ77 method, for example, the maximum distance is normally defined as a "window" extending 32 kilobytes backwards, much longer than the typical block size. However, since the tables used in the present invention may contain the most common two- and three-byte sequences, it is often the case that a sequence can be found in the tables instead. Statistically, repetition of short sequences is more common than longer sequences, which motivates the length of the sequences in the tables to two and three bytes, respectively. The second purpose of the tables is to further increase the compression ratio by entropy coding the index parameter with for example Huffman coding.
BRIEF DESCRIPTION OF DRAWINGS Fig. 1 is a flow chart depicting an exemplary compression method of the present invention.
Fig. 2 schematically illustrates the exemplary compression process, resulting in a compressed bitstream and a set of tables, and the decompression process for decompressing an arbitrary block, given the bit stream and the tables.
Fig. 3 is a table showing an exemplary encoding of a sample text string.
DETAILED DESCRIPTION OF AN EXEMPLARY EMBODIMENT OF THE INVENTION
The present invention will now be described in more detail by referring to the appended drawings.
As stated above, the memory size of the software in computer devices is rapidly increasing. In embedded computer devices, such as mobile phones, the memory unit is one of the most expensive components, and, accordingly, it is important to reduce the size of the software (in particular firmware) in such devices.
As also is stated above, firmware is normally stored in non-volatile memory such as NOR or NAND flash. The present invention provides a data compression method that is particularly suitable for use when compressing firmware of such devices. However, as is apparent to a person skilled in the art, the present invention is applicable on any type of data in any application wherein a lossless compression of the data is desired.
A flow chart of an exemplary compression method according to the present invention is depicted in Fig. 1. In this detailed description, the data to be compressed will be processed as a sequence of bytes but it should be understood that elements from any finite alphabet could be used as well, and the present invention is not limited to the case of an element being a byte. When the compression process starts, it is assumed that the original data to be compressed and the block boundaries are given. It is further assumed that the table of bit positions, which at the end of the process refers to the starts of each compressed block, and the bitstream, which at the end of the process contains the compressed data, have been initialized.
In an exemplary embodiment, the initial step of the compression method is comprised of building two- and three-byte tables, as shown in step 100. These tables should contain frequently occurring subsequences in the original data. Preferably, one table is defined to consist of the 512 most common two-byte sequences found in the original data and one table to consist of the 512 most common three-byte sequences. In this embodiment, the size of the two tables adds to 2,560 bytes. The sort and count methods needed to determine the most common subsequences are familiar to those skilled in the art of the invention and will therefore not be discussed more in detail. The choice of the number of tables and their content and sizes is not limited to these particular values and is no limitation of the present invention.
According to the invention, the transformation of the original byte sequence, given the tables, into a sequence of literal bytes, table or backward references now proceeds as follows. The original data is processed block-by-block and as a sequence of bytes within each block. First, the current block is set to the first block, step 101, and then the current byte pointer is set to point to the first byte of the current block to be processed, step 102.
According to the invention, the bytes in each block are then transformed as follows. First, a search is made to find the longest repetition within the current block and before the current byte pointer, step 103. If a backward reference of length at least four bytes is found (test condition 104), the corresponding length-distance reference symbol is output, step 105. At the beginning of each block it is, of course, not possible to find a length-distance backward reference, meaning that condition 104 leads to step 106. If a backward reference of length at least four bytes is not found, a search for the next three bytes is also made in the three-byte table, step 106. If these three bytes are found in the three-byte table, and if a backward reference of length three also were found in step 103, the symbol with the shortest bit length is output, that is, either a length-distance or a table-index symbol. The binary encoding of these symbols is defined later. In an exemplary embodiment, the three-byte table symbol will always have the shortest encoding and will therefore always be output. This assumption is made in the flow chart of fig. 1, but this is no limitation of the present invention. Therefore, in this embodiment, if a match is found (test condition 107) in the three-byte table, the corresponding three-byte table symbol is directly output (step 108). Only if this is not so, a length-distance reference symbol output (step 110) if the length is equal to three (test condition 109). If no backward reference or three-byte table reference can be found, the next two bytes are searched for in the two-byte table, as shown in step 111. If they are found in the table, the corresponding table-index symbol is output, step 113. Finally, if no backward or table references can be found, a literal byte symbol is output, step 114, and the byte pointer is moved to point to the next byte.
In all of the above steps it is assumed that a reference, backward or table, is never made so that the current byte pointer is moved beyond the first byte of the next block, that is, a symbol always encodes a sequence of one or more bytes within the current block. As shown in step 115 in fig. 1, the prefix and parameter encoding of each symbol is preferably appended to the bit stream after each symbol has been generated and not after all the symbols have been generated for all of the data to be compressed. Preferably, the step corresponding to step 115 also incorporates the forward movement of the current byte pointer. In alternative embodiment, the bit stream is generated after all of the symbols have been generated.
After the symbol has been encoded in step 115, a test is made to determine if all bytes in the original data have been compressed, condition 116. If this is the case, test condition 116 is true and the last entry of the table of bit pointers is set to refer to one bit past the last bit of the bitstream, step 117, and the compression process is completed. If there are more bytes to be compressed, another test (118) is made to determine whether the current byte is at the beginning of a new block. If it is, the table of bit pointers is updated with a new entry referring to the first bit of the next block, and the current block is moved one step forward, step 119. Then step 102 is repeated, that is, the current byte pointer is set to the first byte of the new current block. The remaining bytes within each block are then transformed in the same way until no bytes remain, starting at step 103. The step corresponding to step 103 is also carried out if test condition 118 fails, that is, the current byte is not at the start of the next block.
The next step of the method according to the invention consists of encoding the symbol sequence of all the blocks into a bit stream. The encoding step also uniquely defines the decoding step of transforming the bit stream into a sequence of symbols. Finally, the whole decompression process is defined by transforming the sequence of symbols into the original sequence of bytes, or elements.
In a preferred embodiment, the symbol type and table index numbers are encoded together in a two-bit prefix code. In this preferred embodiment, a literal byte symbol is encoded as binary "00", a length-distance backward reference symbol as binary "01", a two-byte table symbol as binary "10", and a three-byte table symbol as binary "11". Once the prefix code is determined, the other parameters of each symbol are encoded into binary form. In a preferred embodiment, the binary encoding of a symbol and its parameters are appended to the bit stream after each symbol, and not after the complete element sequence has been transformed into a sequence of symbols, as depicted in the flow chart in fig. 1 and in particular block 115.
In a preferred embodiment, a literal byte symbol is encoded with another 8 bits. These 8 bits simply constitute the binary representation of the byte.
In a preferred embodiment, the parameters of the length-distance reference symbols are encoded as follows. First a fixed number of bits following the prefix code are used for encoding the distance. The distance is advantageously encoded with the same number of bits that are needed to represent the current block length. For example, in case the block length is 512 bytes, 9 bits are used to code the distance. It has further been found that the maximum length allowed by encoding the length with 5 bits works well. In a preferred embodiment, the minimum length is 3 bytes, which means that these 5 bits can encode lengths from 3 through 34 bytes. With these choices, the length-distance references are encoded with a total of 16 bits, including the prefix code.
In a preferred embodiment, the table index number is encoded with a fixed number of bits. In case both the two-and three-byte tables contain 512 entries each, 9 bits are used to encode the table index.
In another embodiment, the table index numbers are represented with Huffman coding, resulting in a variable number of bits and in average less number of bits than with a fixed code length. These Huffman codes are preferably built as part of the pre-processing step when building the tables. The count of each subsequence is then used to compute the Huffman code, but all these counts do not have to be stored in the tables. There are more memory efficient methods of representing these Huffman codes, but these methods are familiar to those skilled in the art of the invention and will therefore not be described herein. In this embodiment, the Huffman code must also be made available to the decoder.
The step of encoding the symbols into a bit stream is completed by concatenating the bit stream of each compressed block into one single bit stream, representing all the compressed blocks. The generated bit stream is stored in storage means, such as NAND or NOR flash or any other type of volatile or non-volatile memory units. If the bit stream is intended to be transmitted to another device, e.g. by a wireless transmission, the bit stream may be stored in storage means in form of a buffer prior to transmission.
In the final step of the exemplary method according to the present invention a table of bit positions referring to the start of each compressed block in the bit stream is built. In a preferred embodiment, this table is built incrementally as soon as the encoding of the current block is completed, as shown in step 118 and 119. Preferably, a 32-bit integer number is used to represent the bit position of the start, or first bit, of each compressed block. The table is then represented as an array of integers. The bit position of the first compressed block will always be zero, so this position is not stored in the array, or table. Preferably, one plus the position of the last bit of the last block is recorded in the last element of the array. In this way the length of the array is the same as the number of blocks, and the first and last bit position of each compressed block can be computed from the array. This information is needed during decompression to determine where to start and end the decoding of each block in the bit stream.
Schematic diagrams of the compression and decompression processes are depicted in fig. 2. The upper diagram shows the compression process which takes as input a sequence of n blocks. The result of the compression is the compressed data, or the bitstream, and a set of tables. The set of tables include the table of bit pointers and, in an exemplary embodiment, a two- and a three-byte table. The lower diagram shows the decompression process. Given the bitstream and the set of tables, it can decompress any block k between 1 and n.
By way of example, the compression process of a particular sequence will now be described. The example sequence is the text string "this is miss", that is, the elements are characters which are represented with bytes, encoded with the ASCII standard. In this particular example, it is assumed that this string is contained in one block and that the current byte pointer is located at the initial character "t". It is also assumed that the two-byte table contains the strings "th" and "is", with indices 0 and 1, respectively. Further, it is assumed that the three-byte table contains the string "mis" whose index is 5.
According to the invention, the first symbol encodes the initial string "th" as a reference to the corresponding string in the two-byte table, under the assumption that the string "thi" is not in the three-byte table. The second symbol encodes the string "is", also with a reference to the two-byte table. Similarly, this is under the assumption that the string "is " is not contained in the three-byte table. The third symbol encodes the following space as a literal byte. The fourth symbol encodes the string "is " as a backward reference of length 3 and distance 3. The fifth symbol encodes the string "mis" as a reference to the three-byte table. Finally, the sixth symbol encodes the single letter "s" as a literal byte.
The table in fig. 3 displays the encoded symbols and their type, binary encoding, and the binary length, respectively. The spaces in the binary strings are inserted just for illustrative purposes. The two literals, the space and letter "s", are encoded into binary form using their decimal ASCII codes 32 and 115, respectively. The length-distance backward reference encodes the distance 3 as the binary form of decimal 2, that is, the distance minus one, since the minimum distance is 1. The corresponding length 3 is encoded as the binary form of decimal 0, that is, the length minus 3, since, in this example, the minimum length is 3.
This particular sequence is therefore encoded with a total of 69 bits, which should be compared to the 96 bits (12 times 8 bits) needed to encode the original string.
The present invention further relates to a device for compressing data. The device comprises means for transforming the original sequence of data elements into a new sequence of symbols representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the tables. The device also comprises means for encoding the new sequence of symbols into a uniquely decodeable bit stream for compressing the original sequence, and means for building a table of bit positions referring to the start of each compressed block in the bit stream.
The device may also comprise means for decompressing arbitrary blocks, given the bitstream and tables. Further, the device may comprise means for building one or more tables of frequently occurring subsequences of elements. The present invention further relates to a device for decoding a set of symbols into a sequence of elements given a set of tables of subsequences of elements. This device comprises means for decoding the symbols into a one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the given tables, and means for transforming the decoded symbols into the original sequence of elements.
The above description of the compression and decompression process may, as stated above, e.g. be applied to compress part of the firmware, for example all applications and/or the operating system, in a mobile device. In such a device the firmware might be stored in NAND flash memory and loaded to RAM by the boot code at system startup. When the firmware is compressed, decompression is initiated by the boot code.
It should be emphasized that while what has been described herein constitutes exemplary embodiments of the invention, it should be recognized that the invention could take numerous other forms.

Claims

1. A method for lossless data compression of an input sequence of elements, the method comprising the steps of:
- transforming the input sequence of data elements into a processed sequence of symbols representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one or more tables of frequently occurring subsequences of elements, and
- encoding the processed sequence of symbols into a uniquely decodeable bit stream for compressing the input sequence, and
- storing the bit stream in storage means, said method being characterised in that the one or more tables of frequently occurring subsequences of elements is a pre-defined table or a pre-defined set of tables.
2. The method according to claim 1, wherein it further includes the step of building one or more tables of frequently occurring subsequences of elements.
3. The method according to claim 1 or 2, wherein said pre-defined set of tables are built once and used for compressing several sets of data.
4. The method according to any of the claims 1-3, wherein the sequence of elements constitute one or more blocks of elements, and wherein the method further includes the step of building a table of bit positions referring to the starts of each block.
5. The method according to any of the claims 1-4, wherein a backward reference to a previously occurring subsequence of elements is only made to a sequence of elements within the current block.
6. The method according to any of the claims 1-5, wherein an element is defined to be an 8-bit byte or any other arbitrary bit length.
7. The method according to any of the claims 1-6, wherein one table is defined to contain the most common pairs of elements contained in the original sequence of data elements.
8. The method according to any of the claims 1-7, wherein one table is defined to contain the most common subsequences of three elements contained in the original sequence of data elements.
9. The method according to any of the claims 1-8, wherein a reference to a previously occurring subsequence is represented as a pair of numbers, wherein one number refers to the length of the subsequence and the other number refers to the distance to the repeating subsequence, wherein both numbers are limited to some minimum and maximum, respectively. lO.The method according to any of the claims 1-9, wherein a table reference is represented as a pair of numbers, where the first number refers to which table the subsequence is found and the second number is an index referring to a certain subsequence in the table given by the first number. ll.The method according to any of the claims 1-10, wherein the type of symbol, literal, backward, or table and the table number are encoded as a binary prefix code, that is, a fixed-length number of bits.
12.The method according to claim 11, wherein the prefix code is two bits long and the resulting binary encoding is one of the four symbols: literal element, two-byte table, three-byte table, or backward reference.
13.The method according to any of the claims 1-12, wherein the number of subsequences in, or the length of, a given table is set to an even power of two and wherein the index is encoded as a fixed-length binary number greater or equal to zero and less than the table length.
14. The method according to any of the claims 1-13, wherein the index is encoded into binary form with Huffman coding according to how frequent the subsequence, corresponding to the index, is in the original sequence of data elements.
15. The method according to claim 9, wherein the length is encoded into binary form using a fixed number of bits and the minimum length is encoded as zero and the other lengths, up to the maximum length, are encoded into binary form in order of increasing lengths.
16. The method according to claim 9, wherein the distance is encoded into binary form using a fixed number of bits and the minimum distance one is encoded as zero and the other distances, up to the maximum distance, are encoded into binary form in order of increasing distances.
17. The method according to claim 4 or 5, wherein the blocks have varying lengths.
18. The method according to claim 4 or 5, wherein the blocks are all of the same length.
19. A method for decoding a uniquely decodeable bit stream into a sequence of elements given a set of tables of subsequences of elements, the method comprising the steps of: retrieving a bit stream from storage means, decoding the uniquely decodeable bit stream into symbols, decoding the symbols into one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the given tables, transforming the decoded symbols into the original sequence of elements, wherein said one or more tables of subsequences of elements is a pre-defined table or a pre-defined set of tables.
2O.The method according to claim 19, wherein decoding is performed only for a given block of elements, identified in the symbol sequence by a table of block pointers.
21. The method according to any of the claims 19-20, wherein each symbol type, and table index number, is represented with a unique binary prefix code.
22. The method according to claim 21, wherein the parameters of each symbol are represented as fixed-length binary strings.
23. The method according to claim 21, wherein some of the parameters of the symbols are represented with a variable length entropy code.
24. A device for lossless compression of an input sequence of elements, the device comprising: means for transforming the input sequence of data elements into a processed sequence of symbols representing one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one or more tables of frequently occurring subsequences of elements, means for encoding the processed sequence of symbols into a uniquely decodeable bit stream for compressing the input sequence, and means for storing the bit stream in storage means, characterised in that the said one or more tables of frequently occurring subsequences of elements is a pre-defined table or a pre-defined set of tables.
25.The device according to claim 24, wherein it further comprises means for building one or more tables of frequently occurring subsequences of elements.
26.The device according to claim 24 or 25, wherein it comprises means for using a pre-defined set of tables, built once and used for compressing several sets of data.
27.The device according to any of the claims 24-26, wherein the device further includes means for building a table of bit positions referring to the start of a given sequence of blocks of elements.
28. A device for decoding a uniquely decodeable bit stream into a sequence of elements given a set of tables of subsequences of elements, the device comprising: means for retrieving a bit stream from storage means, means for decoding the uniquely decodeable bit stream into symbols, means for decoding the symbols into a one of a literal element, a backward reference to a previously occurring subsequence of elements, or a table reference to a subsequence in one of the given tables, and means for transforming the decoded symbols into the original sequence of elements, characterised in that said one or more tables of frequently occurring subsequences of elements is a pre-defined table or a pre-defined set of tables.
29. The device according to claim 28, wherein decoding is performed only for a given block of elements, identified in the symbol sequence by a table of block pointers.
30. The device according to claim 28 or 29, wherein the sequence of symbols are represented as a binary stream and each symbol is represented with a unique binary prefix code and the parameters of each symbol are represented with fixed or variable length binary strings.
31. A computer program product for making a computer or processor performing the method according to any of the claims 1-23.
32. A computer readable medium containing a computer program product according to claim 31.
PCT/SE2006/001198 2005-10-24 2006-10-23 Method and system for compressing data WO2007050018A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP06799795A EP1941617A4 (en) 2005-10-24 2006-10-23 Method and system for compressing data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE0502351A SE530081C2 (en) 2005-10-24 2005-10-24 Method and system for data compression
SE0502351-0 2005-10-24

Publications (2)

Publication Number Publication Date
WO2007050018A1 true WO2007050018A1 (en) 2007-05-03
WO2007050018A8 WO2007050018A8 (en) 2007-10-04

Family

ID=37968054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2006/001198 WO2007050018A1 (en) 2005-10-24 2006-10-23 Method and system for compressing data

Country Status (3)

Country Link
EP (1) EP1941617A4 (en)
SE (1) SE530081C2 (en)
WO (1) WO2007050018A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8165597B2 (en) 2009-03-25 2012-04-24 Motorola Mobility, Inc. Method and apparatus to facilitate partitioning use of wireless communication resources amongst base stations
EP2712089A1 (en) * 2012-09-20 2014-03-26 Alcatel-Lucent Method for compressing texts and associated equipment
US8983388B2 (en) 2008-09-30 2015-03-17 Google Technology Holdings LLC Method and apparatus to facilitate preventing interference as between base stations sharing carrier resources
US8996018B2 (en) 2008-10-30 2015-03-31 Google Technology Holdings LLC Method and apparatus to facilitate avoiding control signaling conflicts when using shared wireless carrier resources
US10387377B2 (en) 2017-05-19 2019-08-20 Takashi Suzuki Computerized methods of data compression and analysis
CN113868206A (en) * 2021-10-08 2021-12-31 八十一赞科技发展(重庆)有限公司 Data compression method, decompression method, device and storage medium
CN116665836A (en) * 2023-07-26 2023-08-29 国仪量子(合肥)技术有限公司 Editing and storing method, reading and playing method and electronic equipment for sequence data
US11741121B2 (en) 2019-11-22 2023-08-29 Takashi Suzuki Computerized data compression and analysis using potentially non-adjacent pairs

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4730348A (en) * 1986-09-19 1988-03-08 Adaptive Computer Technologies Adaptive data compression system
EP0438956A1 (en) * 1989-12-28 1991-07-31 International Business Machines Corporation Method of encoding compressed data
EP0438954A1 (en) * 1989-12-28 1991-07-31 International Business Machines Corporation Method of decoding compressed data
EP0673122A2 (en) * 1994-03-01 1995-09-20 Hewlett-Packard Company Coding apparatus
EP0788239A2 (en) * 1996-01-31 1997-08-06 Hitachi, Ltd. Method of and apparatus for compressing and decompressing data and data processing apparatus and network system using the same
WO1998006028A1 (en) 1996-08-06 1998-02-12 Reynar Jeffrey C A lempel-ziv data compression technique utilizing a dicionary pre-filled with fequent letter combinations, words and/or phrases

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5729228A (en) * 1995-07-06 1998-03-17 International Business Machines Corp. Parallel compression and decompression using a cooperative dictionary

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4730348A (en) * 1986-09-19 1988-03-08 Adaptive Computer Technologies Adaptive data compression system
EP0438956A1 (en) * 1989-12-28 1991-07-31 International Business Machines Corporation Method of encoding compressed data
EP0438954A1 (en) * 1989-12-28 1991-07-31 International Business Machines Corporation Method of decoding compressed data
EP0673122A2 (en) * 1994-03-01 1995-09-20 Hewlett-Packard Company Coding apparatus
EP0788239A2 (en) * 1996-01-31 1997-08-06 Hitachi, Ltd. Method of and apparatus for compressing and decompressing data and data processing apparatus and network system using the same
WO1998006028A1 (en) 1996-08-06 1998-02-12 Reynar Jeffrey C A lempel-ziv data compression technique utilizing a dicionary pre-filled with fequent letter combinations, words and/or phrases

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1941617A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8983388B2 (en) 2008-09-30 2015-03-17 Google Technology Holdings LLC Method and apparatus to facilitate preventing interference as between base stations sharing carrier resources
US8996018B2 (en) 2008-10-30 2015-03-31 Google Technology Holdings LLC Method and apparatus to facilitate avoiding control signaling conflicts when using shared wireless carrier resources
US8165597B2 (en) 2009-03-25 2012-04-24 Motorola Mobility, Inc. Method and apparatus to facilitate partitioning use of wireless communication resources amongst base stations
EP2712089A1 (en) * 2012-09-20 2014-03-26 Alcatel-Lucent Method for compressing texts and associated equipment
US10387377B2 (en) 2017-05-19 2019-08-20 Takashi Suzuki Computerized methods of data compression and analysis
US11269810B2 (en) 2017-05-19 2022-03-08 Takashi Suzuki Computerized methods of data compression and analysis
US11741121B2 (en) 2019-11-22 2023-08-29 Takashi Suzuki Computerized data compression and analysis using potentially non-adjacent pairs
CN113868206A (en) * 2021-10-08 2021-12-31 八十一赞科技发展(重庆)有限公司 Data compression method, decompression method, device and storage medium
CN116665836A (en) * 2023-07-26 2023-08-29 国仪量子(合肥)技术有限公司 Editing and storing method, reading and playing method and electronic equipment for sequence data
CN116665836B (en) * 2023-07-26 2023-10-27 国仪量子(合肥)技术有限公司 Editing and storing method, reading and playing method and electronic equipment for sequence data

Also Published As

Publication number Publication date
WO2007050018A8 (en) 2007-10-04
SE0502351L (en) 2007-04-25
EP1941617A1 (en) 2008-07-09
SE530081C2 (en) 2008-02-26
EP1941617A4 (en) 2012-09-19

Similar Documents

Publication Publication Date Title
US5870036A (en) Adaptive multiple dictionary data compression
KR100894002B1 (en) Device and data method for selective compression and decompression and data format for compressed data
US5001478A (en) Method of encoding compressed data
US6597812B1 (en) System and method for lossless data compression and decompression
CA2263453C (en) A lempel-ziv data compression technique utilizing a dictionary pre-filled with frequent letter combinations, words and/or phrases
EP1941617A1 (en) Method and system for compressing data
CA2299902C (en) Method and apparatus for data compression using fingerprinting
US5010345A (en) Data compression method
US20090060047A1 (en) Data compression using an arbitrary-sized dictionary
US5877711A (en) Method and apparatus for performing adaptive data compression
US5874908A (en) Method and apparatus for encoding Lempel-Ziv 1 variants
JPH07104971A (en) Compression method using small-sized dictionary applied to network packet
US5673042A (en) Method of and an apparatus for compressing/decompressing data
EP2455853A2 (en) Data compression method
US6518895B1 (en) Approximate prefix coding for data compression
US5010344A (en) Method of decoding compressed data
EP0435802B1 (en) Method of decompressing compressed data
Bhadade et al. Lossless text compression using dictionaries
Rathore et al. A brief study of data compression algorithms
Tank Implementation of Lempel-ZIV algorithm for lossless compression using VHDL
Kwong et al. A statistical Lempel-Ziv compression algorithm for personal digital assistant (PDA)
Hoang et al. Dictionary selection using partial matching
Klein et al. Parallel Lempel Ziv Coding
Swacha et al. Dynamic, semi-dynamic and static word-based compression: a comparison of effectiveness
Tabus et al. Text compression based on variable-to-fixed codes for Markov sources

Legal Events

Date Code Title Description
DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REEP Request for entry into the european phase

Ref document number: 2006799795

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2006799795

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE