|Publication number||US7026962 B1|
|Application number||US 09/626,551|
|Publication date||Apr 11, 2006|
|Filing date||Jul 27, 2000|
|Priority date||Jul 27, 2000|
|Publication number||09626551, 626551, US 7026962 B1, US 7026962B1, US-B1-7026962, US7026962 B1, US7026962B1|
|Inventors||Shahriar Emami, Julio C. Blandon|
|Original Assignee||Motorola, Inc|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (15), Non-Patent Citations (1), Referenced by (21), Classifications (8), Legal Events (4)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This invention relates in general to data communication methods and devices, and in particular to a method and apparatus for compressing text efficiently for transmission in a communication system.
Static and dynamic dictionary methods for compressing text or data files and messages are well known to one of ordinary skill in the art. A variety of static dictionary compression techniques have been described for use in communication systems in which short and medium length messages are transmitted using transmission medium such as radio, in which the signal conveying the message can undergo substantial distortion. Dynamic dictionary techniques have been described and used for compressing large files, such as for large files that are stored on hard disk.
The rapidly expanding transmission of data files that has resulted from the widespread use of the Internet emphasizes a need for continued improvement of data compression techniques.
While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described. Further, the terms and words used herein are not to be considered limiting, but rather merely descriptive. In the description below, like reference numbers are used to describe the same, similar, or corresponding parts in the several views of the drawings.
The data communication system 100 comprises a data compressor 110 that acquires the data file 105 from a memory such as a random access memory and generates a compressed data file using the unique technique described herein. The compressed data file is coupled to a data transmitter 120 that converts the compressed data file into information that is encoded with error protection and modulated onto an transmission signal 125, such as a radio signal used in a cellular telephone system or an analog voltage such as used in a public switched telephone network (PSTN). The transmission signal 125 is received by a data receiver 130 that demodulates and error decodes the information, generating a compressed file that is equivalent to the compressed data file, except for any uncorrected errors caused by distortion of the transmission signal 125. The compressed file generated by the data receiver 130 is then coupled to a data decompressor 140 that decompresses the compressed file, generating a decompressed data file 145 that is identical to the original data file 105 but for any errors caused by the distortion of the transmission signal 125. It will be appreciated that the components described with reference to
The PDL 220 is preferably a small segment of random access memory (RAM), certain contents of which are pointed to by the processor 210 using one of a plurality of index bytes or pointers, herein called primary tokens, before and during the compression of the data file 105. Referring to
The static portion 310 comprises two sets of items 311, 313, each item being identified (pointed to) by a different primary token. These primary tokens form two sets of primary tokens corresponding to the two sets of items 311, 313. A third set of primary tokens is reserved for a few (3 in the preferred embodiment of the present invention) commands 312 that are used in the encoding process. These command tokens preferably do not have any corresponding items stored in the PDL 220; they are used to indicate that the next two bytes within a compressed data file 240 identify a word from the common word dictionary list and to provide capitalization characteristics as shown below in Table 1. The command token together with the next two bytes is called a common word token.
Capitalization characteristics of the
common word identified by the common
Command token value
No capital letters
The first letter is capitalized
All letters are capitalized
The set of commonly used alphanumeric characters and symbols 311 preferably comprises those alphanumeric characters represented by ASCII (American Standard for the Coded Interchange of Information) values 0 to 127 plus a number of other selected ASCII values in the range 128 to 255 (including, for instance, slanted quotation marks, the copyright symbol, and the trademark symbol). The 128 items having token values from 0 to 127, although forming a part of the dictionary list 220, are not stored in the dictionary list 220 since the ASCII value is the same as the item value.
The set of short static words 313 comprises a predetermined quantity of the most commonly used words of less than four characters. In this example, the predetermined quantity is 37. Such words are determined using a set of test files, in a manner well known in the art of formulating static dictionaries for data compression. The test set of files includes, for example, a sampling of a predetermined set of books, magazine, and newspaper articles. The predetermined quantity is preferably 40 characters. Table 2 shows an exemplary list of some of the short static words 313.
The dynamic portion 320 comprises a set of most frequent words found in a data file that are not in the set of short static words, and not in the CWDL 230. Each word in this set of words is identified by a primary token that is called a dynamic word token, and the location of each word in the set is identified by an encoding and decoding pointer in an index of encoding and decoding pointers. The formulation of the dynamic portion 320 of the PDL 220 is described more fully after the following description of the CWDL 230, which is done with reference to
It will be appreciated that the PDL 220 includes some words that are more than one character in length. In accordance with the preferred embodiment of the present invention, three eight bit bytes are reserved for every item that corresponds to the tokens having values 128 to 255, and the location of any one of them is found by using an offset calculation that is well known to one of ordinary skill in the art. Therefore, some of the primary tokens indicate a corresponding word in the primary dictionary by means of intermediate pointers called encoding and decoding pointers that are stored in the index. However, the tokens having values from 0 to 127 do not point to any memory location since they directly represent the ASCII symbol by their value. Thus, the primary word dictionary list comprises a “virtual” list for the tokens having values from 0 to 127.
Other arrangements could be used for identifying each character, symbol, and word in the PDL 220. For example, the primary tokens having values 128 to 255 point could point to symbols or words by means of an intermediate decoding pointer table that stores the address of the symbol or beginning of a word.
The common word dictionary list (CWDL) 230 comprises a list of 65,536 items. In the preferred embodiment of the present invention, the items in the CWDL 230 are 65,426 of the most common words of more than three characters, determined from the set of test files described above, and the 110 ASCII symbols that are not identified by the primary tokens. Each item is identified by a three word “common word token” comprising two eight bit words preceded by a predetermined one of the command tokens that identifies the two tokens following it as a 16 bit pointer to one of the items in the CWDL 230. In accordance with the preferred embodiment of the present invention, an intermediate decoding pointer table 490 is used to identify the 110 ASCII symbol locations and the location of the beginning of the 65,536 common words, with each pointer pointing to one such symbol or word.
Referring now to
At step 535 each partition is stored at a starting location in memory, as illustrated in
When the word length is three characters or less at step 770, the data compressor searches for the word in the primary dictionary list (PDL) 220 at step 773. When the word is found in the PDL, the data compressor 110 generates a primary token at step 776, applies space compression in the same manner as described for step 760 above, and writes the primary token to the compressed data file 240 at step 783. When the word is not found in the PDL 220 at step 773, nor by the partial match function 755, or is found to have non-standard capitalization at step 735, the data compressor 110 writes intervening spaces to the compressed data file 240 at step 786, and writes the unencoded ASCII values that make up the word to the compressed data file 240 at step 790. At step 796, when a determination is made that there are no more lines in the data file 105, the compression is complete at step 799. When there are more lines in the data file 105, an ASCII end of line (EOL) character is added to the encoded file 240 at step 796, and the process continues at step 720. When the static encoding process is completed at step 799, the remaining unencoded strings of symbols are searched to determine the most frequently used unencoded combinations, in a conventional manner, and the combinations found are added to the PDL list 220, as described above. Then the remaining combinations are encoded with dynamic tokens, as described above.
Referring now to
In a first variation of the preferred embodiment of the present invention, the ASCII symbols that are removed from the PDL 220 and placed into the CWDL 230 are simply dropped and “text only” files can then be compressed by the compressor 110. This alternative works fine in those situations where the rarely used ASCII symbols that are removed from the PDL are never found in files that are to be encoded. In a modification to this first alternative embodiment, the data file 105 can be prefiltered to substitute a predetermined encodable ASCII symbol for any of the unencodable ASCII symbols.
In a second variation of the preferred embodiment of the present invention, there is no generation of, or encoding using, dynamic tokens.
Performance of the above listed compression techniques is illustrated in
It can be seen in
It will be appreciated that the present invention provides compression encoding of a data file that has a combination of speed, compression ratio, and error performance that is better than existing compression techniques.
Performance of the above-listed cascaded compression techniques is illustrated in
The left bar in each group represents the performance of a technique with the Book1 file, the center bar represents the performance of a technique with the Book2 file, and the right bar represents an average performance of a technique with the Book1 and Book2 files.
From the figure, it can be seen that the present invention, unlike most popular compression utilities, can be used in cascade with other compression utilities to advantageously further improve the compression ratio. In all cases, the compression was performed using the present invention first, and the resulting compressed file was further compressed using one of the conventional techniques indicated in
While the preferred and other embodiments of the invention have been illustrated and described, it will be clear that the invention is not so limited. Numerous modifications, changes, variations, substitutions, and equivalents will occur to those of ordinary skill in the art without departing from the spirit and scope of the present invention as defined by the following claims.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4809158 *||Oct 23, 1985||Feb 28, 1989||Mccauley Peter B||Sorting method and apparatus|
|US4843389 *||Dec 4, 1986||Jun 27, 1989||International Business Machines Corp.||Text compression and expansion method and apparatus|
|US5109437||Jan 27, 1989||Apr 28, 1992||Yokogawa-Hewlett-Packard Ltd.||Method for compressing digital data|
|US5148541 *||Nov 3, 1989||Sep 15, 1992||Northern Telecom Limited||Multilingual database system including sorting data using a master universal sort order for all languages|
|US5218700 *||Jan 30, 1990||Jun 8, 1993||Allen Beechick||Apparatus and method for sorting a list of items|
|US5506580 *||Dec 6, 1994||Apr 9, 1996||Stac Electronics, Inc.||Data compression apparatus and method|
|US5640488 *||May 5, 1995||Jun 17, 1997||Panasonic Technologies, Inc.||System and method for constructing clustered dictionary for speech and text recognition|
|US5872530 *||Jan 28, 1997||Feb 16, 1999||Hitachi, Ltd.||Method of and apparatus for compressing and decompressing data and data processing apparatus and network system using the same|
|US5974180||Jan 2, 1996||Oct 26, 1999||Motorola, Inc.||Text compression transmitter and receiver|
|US5999949 *||Mar 14, 1997||Dec 7, 1999||Crandall; Gary E.||Text file compression system utilizing word terminators|
|US6047298 *||Jan 30, 1997||Apr 4, 2000||Sharp Kabushiki Kaisha||Text compression dictionary generation apparatus|
|US6289509 *||Sep 1, 1998||Sep 11, 2001||Pkware, Inc.||Software patch generator|
|US6502064 *||Aug 31, 1998||Dec 31, 2002||International Business Machines Corporation||Compression method, method for compressing entry word index data for a dictionary, and machine translation system|
|US6542640 *||Jun 18, 1998||Apr 1, 2003||Fujitsu Limited||Data compressing apparatus, reconstructing apparatus, and its method|
|WO1995002873A1||Jul 4, 1994||Jan 26, 1995||Philips Electronics Nv||Digital communications system, a transmitting apparatus and a receiving apparatus for use in the system|
|1||Pike, J. "text Compression Using a 4 Bit Coding Scheme," The Computer Journal vol. 24, No. 4, 1981.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7587401 *||Mar 10, 2005||Sep 8, 2009||Intel Corporation||Methods and apparatus to compress datasets using proxies|
|US7693859 *||Feb 28, 2007||Apr 6, 2010||Symantec Operating Corporation||System and method for detecting file content similarity within a file system|
|US8306820 *||Oct 4, 2005||Nov 6, 2012||Siemens Aktiengesellschaft||Method for speech recognition using partitioned vocabulary|
|US8326604 *||Apr 24, 2008||Dec 4, 2012||International Business Machines Corporation||Dictionary for textual data compression and decompression|
|US8326605 *||Apr 24, 2008||Dec 4, 2012||International Business Machines Incorporation||Dictionary for textual data compression and decompression|
|US8442986||Mar 7, 2011||May 14, 2013||Novell, Inc.||Ranking importance of symbols in underlying grouped and differentiated files based on content|
|US8566323||Dec 29, 2009||Oct 22, 2013||Novell, Inc.||Grouping and differentiating files based on underlying grouped and differentiated files|
|US8676858||Jan 8, 2010||Mar 18, 2014||Novell, Inc.||Grouping and differentiating volumes of files|
|US8732660||Feb 2, 2011||May 20, 2014||Novell, Inc.||User input auto-completion|
|US8782734||Jan 14, 2011||Jul 15, 2014||Novell, Inc.||Semantic controls on data storage and access|
|US8811611||Oct 8, 2009||Aug 19, 2014||Novell, Inc.||Encryption/decryption of digital data using related, but independent keys|
|US8832103||Apr 13, 2010||Sep 9, 2014||Novell, Inc.||Relevancy filter for new data based on underlying files|
|US8874578||Dec 30, 2009||Oct 28, 2014||Novell, Inc.||Stopping functions for grouping and differentiating files based on content|
|US8983959||Dec 30, 2009||Mar 17, 2015||Novell, Inc.||Optimized partitions for grouping and differentiating files of data|
|US9053120||Dec 15, 2009||Jun 9, 2015||Novell, Inc.||Grouping and differentiating files based on content|
|US9166619 *||Jan 2, 2013||Oct 20, 2015||Verizon Patent And Licensing Inc.||Method and system for pattern-based compression|
|US20060259448 *||Mar 10, 2005||Nov 16, 2006||Boon-Lock Yeo||Methods and apparatus to compress datasets using proxies|
|US20070168320 *||Feb 28, 2007||Jul 19, 2007||Dhrubajyoti Borthakur||System and method for detecting file content similarity within a file system|
|US20080126090 *||Oct 4, 2005||May 29, 2008||Niels Kunstmann||Method For Speech Recognition From a Partitioned Vocabulary|
|US20140188900 *||Jan 2, 2013||Jul 3, 2014||Verizon Patent And Licensing Inc.||Method and system for pattern-based compression|
|US20140281882 *||Mar 13, 2013||Sep 18, 2014||Usablenet Inc.||Methods for compressing web page menus and devices thereof|
|U.S. Classification||341/51, 341/59, 341/106, 341/67, 341/65|
|Jul 27, 2000||AS||Assignment|
Owner name: MOTOROLA, INC., ILLINOIS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:EMAMI, SHAHRIAR;BLANDON, JULIO C.;REEL/FRAME:010975/0893
Effective date: 20000726
|Nov 16, 2009||REMI||Maintenance fee reminder mailed|
|Apr 11, 2010||LAPS||Lapse for failure to pay maintenance fees|
|Jun 1, 2010||FP||Expired due to failure to pay maintenance fee|
Effective date: 20100411