Coding Techniques
Representation of information is a fundamental aspect of all communication from bird songs to human language to modern telecommunications. In the case of digital storage and transmission of information, mathematical analysis has led to principles that drive the design of symbolic representations. For example, it has let a binary code be defined as a mapping from a set of source symbols, or source alphabet, to unique bit strings. A familiar example is the standard American Standard Code for Information Interchange (ASCII) code (see Table 1) in which each character from a standard keyboard is represented as a 7-bit sequence.
ASCII is an example of a block code, where each symbol is represented by a fixed-length "block" of n bits. Given a number of symbols (K) to encode in binary form, a number of bits (n) is chosen such that there are enough binary patterns of that length to encode all K symbols. With n bits, 2n unique strings exist, and so we choose the smallest integer n that satisfies K 2n. Thus a 3-bit code can represent up to eight symbols, a 4-bit code can be used for a set of up to 16 symbols, etc.
Because of its universal use, the ASCII code has great advantages as a means of storing textual data and communicating between machines. On the face of it, the ASCII design seems perfectly reasonable. After all, a common language is central to communication. However, ASCII lacks certain properties desirable in a code. One of these is efficiency and the other is robustness.
Efficiency
Knowledge of symbol probabilities can be exploited to make a code more efficient. Morse code, the system of dots and dashes used for telegraph transmission
Table 1. THE 7-BIT ASCII CODE
| Symbol | Code | Symbol | Code | Symbol | Code | Symbol | Code |
| NUL | 0000000 | | 0100000 | @ | 1000000 | ' | 1100000 |
| SOH | 0000001 | ! | 0100001 | A | 1000001 | a | 1100001 |
| STX | 0000010 | " | 0100010 | B | 1000010 | b | 1100010 |
| ETX | 0000011 | # | 0100011 | C | 1000011 | c | 1100011 |
| EOT | 0000100 | $ | 0100100 | D | 1000100 | d | 1100100 |
| ENQ | 0000101 | % | 0100101 | E | 1000101 | e | 1100101 |
| ACK | 0000110 | & | 0100110 | F | 1000110 | f | 1100110 |
| BEL | 0000111 | ' | 0100111 | G | 1000111 | g | 1100111 |
| BS | 0001000 | ( | 0101000 | H | 1001000 | h | 1101000 |
| TAB | 0001001 | ) | 0101001 | I | 1001001 | i | 1101001 |
| LF | 0001010 | * | 0101010 | J | 1001010 | j | 1101010 |
| VT | 0001011 | + | 0101011 | K | 1001011 | k | 1101011 |
| FF | 0001100 | , | 0101100 | L | 1001100 | l | 1101100 |
| CR | 0001101 | - | 0101101 | M | 1001101 | m | 1101101 |
| SO | 0001110 | . | 0101110 | N | 1001110 | n | 1101110 |
| SI | 0001111 | / | 0101111 | O | 1001111 | o | 1101111 |
| DLE | 0010000 | 0 | 0110000 | P | 1010000 | p | 1110000 |
| DC1 | 0010001 | 1 | 0110001 | Q | 1010001 | q | 1110001 |
| DC2 | 0010010 | 2 | 0110010 | R | 1010010 | r | 1110010 |
| DC3 | 0010011 | 3 | 0110011 | S | 1010011 | s | 1110011 |
| DC4 | 0010100 | 4 | 0110100 | T | 1010100 | t | 1110100 |
| NAK | 0010101 | 5 | 0110101 | U | 1010101 | u | 1110101 |
| SYN | 0010110 | 6 | 0110110 | V | 1010110 | v | 1110110 |
| ETB | 0010111 | 7 | 0110111 | W | 1010111 | w | 1110111 |
| CAN | 0011000 | 8 | 0111000 | X | 1011000 | x | 1111000 |
| EM | 0011001 | 9 | 0111001 | Y | 1011001 | y | 1111001 |
| SUB | 0011010 | : | 0111010 | Z | 1011010 | z | 1111010 |
| ESC | 0011011 | ; | 0111011 | [ | 1011011 | { | 1111011 |
| FS | 0011100 | < | 0111100 | 1011100 | 1111100 |
| GS | 0011101 | = | 0111101 | ] | 1011101 | } | 1111101 |
| RS | 0011110 | > | 0111110 | ^ | 1011110 | ~ | 1111110 |
| US | 0011111 | ? | 0111111 | _ | 1011111 | [.alpha] | 1111111 |
in the early days of electric communication, made use of such knowledge. By representing the more frequent letters in common English with shorter dash-dot sequences, the average time to transmit a character is reduced in a message whose character statistics are consistent with the assumed frequencies (see Table 2).
Consider codes in which the number of bits assigned to each symbol is not fixed, and let l i denote the number of bits in the string for the i th symbol si. In such a variable length code, it makes sense to assign shorter bit strings to symbols that tend to occur more frequently than average in typical use. Hence, an efficient code can be designed by making l i a function of a
Table 2. MORSE CODE
| Symbol | Code | Symbol | Code | Symbol | Code |
| A | . - | N | - . | 0 | - - - - - |
| B | -. . . | O | - - - | 1 | . - - - - |
| C | -. - . | P | . - - . | 2 | . . - - - |
| D | -. . | Q | - -. - | 3 | . .. - - |
| E | . | R | . - . | 4 | . .. . - |
| F | . . - . | S | . . . | 5 | . .. . . |
| G | - - . | T | - | 6 | -. .. . |
| H | . .. . | U | . . - | 7 | - -. . . |
| I | . . | V | . .. - | 8 | - - -. . |
| J | . - - - | W | . - - | 9 | - - - - . |
| K | -. - | X | -. . - | period | . -. -. - |
| L | . -. . | Y | -. - - | comma | - -. . - - |
| M | - - | Z | - -. . |
symbol's probability,
pi. Let the efficiency of a code be measured as the average number of bits per symbol,
Lavg, weighted by the probabilities [Eq. 1].
The example that follows illustrates the increase in efficiency offered by a variable length code. Consider a set of four symbols a, b, c, and d with corresponding probabilities p a 0.5, p b 0.25, p c pd 0.125. Two codes are listed in Table 3, with the average lengths for each computed according to Equation 1. Note that the average length computed for Code II depends on the probability distribution, whereas the average number of bits per symbol for an n -bit block code is obviously n, regardless of the probabilities. Which code would be more efficient if the four symbols all have the same probability? (They would be equally efficient, both two bits long for each symbol.)
A potential problem with variable length codes is that an encoded string of symbols may not be uniquely decodeable; that is, there may be more than one interpretation for a sequence of bits. For example, let the symbol set {a,b,c} be encoded as a 0, b 10, c 01. The sequence 010 could be interpreted as either ab or ca, thus this code is not uniquely decodeable. This problem does not occur with all variable length codes. Huffman Codes are uniquely decodeable codes that are generated based on symbol probabilities.
The entropy of a probability distribution (denoted H and defined in Equation 2) is the lower bound for L avg. That is, for a given set of probabilities p 1, p 2,… p K, the most efficient uniquely decodeable code must satisfy: 
Robustness
A principle of redundancy underlies the design of error correcting codes. By using more bits than are actually required to represent a set of symbols uniquely, a more robust code can be generated. If the code is designed such that any two legal codewords differ in at least 3 bits, then the result of "flipping" the value of any bit (that is, converting a 1 to a 0 or vice versa) will result in a string that remains closer to the original than it is to any other codeword. Similarly, if the minimum distance is 5 bits, double errors can be reliably corrected, with a 7-bit minimum distance, triple errors can be corrected, etc. A very simple illustration of this principle is the case of two symbols. In each of the four codes in Table 4, the symbol A is represented by a set of 0s, and B is represented by a block of 1s. For codes of increasing blocksize, more errors can be tolerated. For example, in the case of the 5-bit double-error correcting code, the received sequence 10100 would be interpreted as an A.
Table 4. SIMPLE ERROR CORRECTING CODES
| Symbol | Unique encoding | Single error correcting | Double error correcting | Triple error correcting |
| A | 0 | 000 | 00000 | 0000000 |
| B | 1 | 111 | 11111 | 1111111 |
Table 3. TWO CODES ON FOUR SYMBOLS
| Symbol | Probability | Code I | Code II |
| a | 0.5 | 00 | 0 |
| b | 0.25 | 01 | 10 |
| c | 0.125 | 10 | 110 |
| d | 0.125 | 11 | 111 |
| Average Length | | 2 | 1.75 |
Table 5. A 7-BIT HAMMING
| CODE |
| Symbol | Codeword | Symbol | Codeword |
| A | 0000000 |
| B | 0001011 |
| C | 0010110 |
| D | 0011101 |
| E | 0100111 |
| F | 0101100 |
| G | 0110001 |
| H | 0111010 |
| I | 1000101 |
| J | 1001110 |
| K | 1010011 |
| L | 1011000 |
| M | 1100010 |
| N | 1101001 |
| O | 1110100 |
| P | 1111111 |
The codes noted in Table 4 are inefficient, in that they require many bits per symbol. Even the single error correcting code in Table 4 uses three times as many bits than are necessary without error correction. Much more efficient error-correcting codes are possible. The code in Table 5 is an example of a family of codes developed by Richard Hamming. It is a representation of a set of 16 symbols using 7 bits per symbol. While 4 bits would be sufficient to represent each of the symbols uniquely, this 7-bit code designed by Hamming guarantees the ability to correct a single bit error in any 7-bit block. The Hamming code is designed such that any two codewords are different in at least 3 bits. Hence, if one bit is altered by a storage or transmission error, the resulting bit string is still closer to the original codeword than it is to any of the other 15 symbols. Thus, this code is robust to single errors. Note that if two errors occur in the same block, the code fails.
Conclusion
The primary framework for symbolic representation of information is human language, which has evolved over a period spanning more than 100,000 years. But only the past century has seen the application of mathematical principles to the design of encoding schemes. In combination with high-speed electronic signal transmission, the result is a system that enables communication with efficiency, reliability, and range that would have been inconceivable a few generations ago. Ongoing improvements in high-density magnetic and optical storage media have brought about a tremendous reduction in the physical space required to store information, thus amplifying the utility of these recently developed encoding techniques.
Paul Munro
Bell Labs; Binary Number System; Codes.
Bibliography
Lucky, Robert W. Silicon Dreams. New York: St. Martin's Press, 1989.
Wells, Richard B. Applied Coding and Information Theory for Engineers. Upper Saddle River, NJ: Prentice Hall, 1999.
This complete Coding Techniques contains 1,292 words. This
article contains 1,312 words (approx. 4 pages at 300
words per page).