Data Compression
In many systems, it is desirable to use some form of compression technique in order to handle data. This can be necessary in order to improve upon data requirements for storage, or to facilitate quicker data transfer in case of limited bandwidth. There is no single data compression program that will work for all types of data, nor is there one that will compress all data of a certain type. Data compression techniques and software tend to be quite specialized.
It is important to understand why it is that data compression is not always possible for all files--many hoaxes and unscientific claims notwithstanding. A well-known concept in discrete mathematics called the pigeonhole principle says that if one has more than n items of a certain kind, but only n slots (or pigeonholes) to put them in, then at least one of the slots will have more than one item in it. This intuitive concept has important consequences for data compression. Consider that for any compression to be meaningful, it should be possible to derive the original data from the compressed data, and the compressed data should be smaller than the original data. Now, it should be noted that there are a larger number of strings of data of greater length; for example, there are 90 numbers of length two (10-99) while there are just 10 of length one (0-9). This means that any data compression method that is used for compressing data of larger sizes has to confront the pigeonhole problem--for example, how is it possible to compress all 90 two-place numbers into just the 10 available one-place slots? Obviously, then, no possible algorithm in the world could compress all larger files into smaller files, because of the mathematical restriction imposed by the pigeonhole principle. Any compression mechanism that preserves data integrity (creates a compressed version that can be uncompressed to obtain the original form, or an acceptably close match thereto) necessarily must leave some files little compressed, or even expanded. Unfortunately, bogus claims about compression persist, and more than one program that claims to achieve the impossible has been granted a patent.
Now and then, claims are made for programs that are said to achieve incredible compression ratios: these claims are fake because the programs do not compress at all but just store the uncompressed data in hidden files on the hard disk or keep it in unused clusters. Such programs may be dangerous and should never be used because there is a significant risk of losing all the data.
Compression techniques may be broadly classified into those addressing text compression and image compression. Each has different requirements. Text compression is relatively simpler; a very easy example of text compression is to take out all occurrences of 'e' in the text, since a human can usually restore them from context. Other ways of text compression are better in practice, of course, and text compression is by nature lossless, and methods that often achieve over 35% compression are known. (This is not a conflict with the pigeonhole principle because formatted text typically has a large amount of overhead in the form of blank spaces, lines, formatting, etc.) Image compression can be difficult, but in certain cases, we can tolerate loss in image quality when the final output is insignificantly different in appearance from the original image.
Among data compression techniques, the one that probably deserves the most attention for its widespread use, and for its status as a technique against which to compare others, is Huffman compression. This is a statistical data compression technique which gives a reduction in the average code length used to represent the symbols of an alphabet. The Huffman code is an example of a code which is optimal in the case where all symbols probabilities are integral powers of 1/2. A Huffman code can be built in the following manner:
- Rank all symbols in order of probability of occurrence.
- Successively combine the two symbols of the lowest probability to form a new composite symbol; eventually we will build a binary tree where each node is the probability of all nodes beneath it.
- Trace a path to each leaf, noticing the direction at each node.
For a given frequency distribution, there are many possible Huffman codes, but the total compressed length will be the same. It is possible to define a standard Huffman tree, that is, pick one of these alternative trees. Such a tree can then be represented very compactly, by transmitting only the bit length of each code.
Though Huffman coding is quite practical and easy-to-use, it is not always optimal, in part because it makes the unreasonable assumption that the symbol probabilities are integral powers of 1/2. For this reason, other compression techniques have been invented, of which a few may be briefly mentioned:
- JPEG--the joint photographic experts group works on a format for still pictures that also allows for compression. The JPEG format allows for lossy as well as lossless compression.
- MPEG--the moving picture experts group works on techniques for compressing moving image data. Moving images such as video frames are typically more difficult to compress than still images, because frames are related in time to their predecessors and successors as well, and good compression techniques must note and use this fact.
- Wavelet compression--since the 1980s, the traditional study of digital signals based on Fourier transforms has been replaced by wavelet analysis. The "Haar wavelet" and similar ones are used as a basis for image analysis, compression, and regeneration.
- Scalar quantization--in this technique, lossy compression is achieved by means of approximating the color values of pixels; for instance, in place of 256 shades of gray, which are unnecessary since the human eye can only distinguish 30 to 40, one can approximate the same image with just 32 or 64 shades of gray.
- Vector quantization--this compression technique works on arrays of independent values, rather than on single pixel values.
This is the complete article, containing 973 words
(approx. 3 pages at 300 words per page).