Error-Detecting Codes
In certain applications where the integrity of stored or transmitted data may be compromised by security breaches, or errors may arise due to other reasons, it is important to have some method to detect that errors have occurred that need to be fixed.
An analogy can be used to illustrate the problem. If we are not sure whether certain text received--that is supposed to be prose in contemporary English--is actually error-free, then it suffices to check whether any of the words in the received text are misspelled. If we find words that are not found in the dictionary, or else are spelled in a manner different from the dictionary, then we may infer that errors have occurred. Once such an error is detected, recovery may be attempted either by requesting the sender to supply a replacement for the missing or incorrect information, or else by trying to look in the dictionary for a suitable word that would fit in the context. The first form of correction upon detection is called backward error correction, and is appropriate for brief messages transmitted over short distances. The other form is called forward error correction, and is used in case of transmissions from spacecraft visiting other planets, and such.
The most basic method for detecting errors in transmission is the addition of a single bit (called a parity) bit to each word in the message. ('Word' in this context refers to a binary word of the raw transmission.) The value of the parity bit can be either 0 or 1; the value is chosen either to make the total number of occurrences of 1 an even number (called even parity) or an odd number (odd parity). With this method, if a single bit in the word is flipped during transmission, a parity check at the received end would show the error because there would be an odd (or even) number of ones in the message and the parity bit, while the other was expected.
However, it is easy to see that the parity bit alone is not sufficient. It can only detect a change in an odd number of bits of the word; if even number of bits are flipped, the change passes unnoticed. An improvement is obtained by using the CRC (cyclic redundancy check) algorithms, which fall into the class of error detection algorithms that leave the data intact and append a checksum on the end. The basic idea of CRC algorithms is simply to treat the message as an enormous binary number, to divide it by another fixed binary number, and to make the remainder from this division the checksum. Upon receipt of the message, the receiver can perform the same division and compare the remainder with the "checksum" (transmitted remainder).
The transmitter constructs a value (called a checksum) that is a function of the message, and appends it to the message. The receiver can then use the same function to calculate the checksum of the received message and compare it with the appended checksum to see if the message was correctly received. For example, if we choose a checksum function which is simply the sum of the bytes in the message mod 256 (i.e. modulo 256), then it might go something as follows. All numbers are in decimal.
- Message: 6 23 4
- Message with checksum: 6 23 4 33
- Message after transmission: 6 27 4 33
In the above, the second byte of the message was corrupted from 23 to 27 by the communications channel. However, the receiver can detect this by comparing the transmitted checksum (33) with the computer checksum of 37 (6 + 27 + 4). If the checksum itself is corrupted, a correctly transmitted message might be incorrectly identified as a corrupted one. However, this is a safe-side failure. A dangerous-side failure occurs where the message and/or checksum is corrupted in a manner that results in a transmission that is internally consistent. Unfortunately, this possibility is completely unavoidable and the best that can be done is to minimize its probability by increasing the amount of information in the checksum (e.g. widening the checksum from one byte to two bytes).
In the checksum example in the previous section, we saw how a corrupted message was detected using a checksum algorithm that simply sums the bytes in the message mod 256; a problem with this algorithm is that it is too simple. If a number of random corruptions occur, there is a 1 in 256 chance that they will not be detected. For example:
- Message: 6 23 4
- Message with checksum: 6 23 4 33
- Message after transmission: 8 20 5 33
To strengthen the checksum, we could change from an 8-bit register to a 16-bit register (i.e. sum the bytes mod 65536 instead of mod 256) so as to apparently reduce the probability of failure from 1/256 to 1/65536. While basically a good idea, it fails in this case because the formula used is not sufficiently "random"; with a simple summing formula, each incoming byte affects roughly only one byte of the summing register no matter how wide it is. For example, in the second example above, the summing register could be a whole Megabyte wide, and the error would still go undetected. This problem can only be solved by replacing the simple summing formula with a more sophisticated formula that causes each incoming byte to have an effect on the entire checksum register.
Thus, we see that at least two aspects are required to form a strong checksum function:
- Width--a register width wide enough to provide a low enough probability of failure (e.g. 32-bits gives a 1 in 2 raised to 32 chance of failure).
- Chaos--a formula that gives each input byte the potential to change any number of bits in the register.
This is the complete article, containing 950 words
(approx. 3 pages at 300 words per page).