MULTIPLE MEMORY BIT/CHIP FAILURE DETECTION
CROSS REFERENCE TO CO-PENDING 5
The present application is a continuation of U.S. Ser. No. 08/233,811, filed Apr. 26,1994, and which is related to U.S. patent application Ser. No. 08/225,891, filed Apr. 11, 1994, entitled "Control Store Built-In-Self-Test", and U.S. patent application Ser. No. 07/978,093, filed Nov. 17,1992, entitled "Continuous Embedded Parity Checking for Error Detection in Memory Structures", and U.S. patent application Ser. No. 08/173,408, filed Dec. 23, 1993, entitled "Micro Engine Dialogue Interface", all assigned to the assignee of the present invention and all incorporated herein by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is generally related to general purpose, stored program, digital computers and more particularly relates to efficient means for performing memory error detection.
2. Description of the Prior Art
A key design element of high reliability computer systems is that of error detection and correction. It has long been recognized that the integrity of the data bits within the 3Q computer system is critical to ensure the accuracy of operations performed in the data processing system. The alteration of one data bit in a data word can dramatically affect arithmetic calculations or can change the meaning of a data word as interpreted by other sub-systems within the com- 35 puter system. Often the cause of an altered data bit is traced to a faulty memory element within the computer system and therefore it is critical that error detection be performed on the memory elements.
One method for performing error detection on the 40 memory elements is to associate an additional bit, called a "parity bit", along with the binary bits comprising an addressable word. This method involves summing without carry the bits representing a "one" within a data word and providing an additional "parity bit" so that the total number of 45 "ones" across the data word including the added parity bit is either odd or even. The term "Even Parity" refers to a parity mechanism which provides an even number of ones across the data word including the parity bit. Similarly, the term "Odd Parity" refers to a parity mechanism which provides 50 an odd number of ones across the data word including the parity bit.
A typical system which uses parity as an error detection mechanism has a parity generation circuit for generating the parity bit. When the system stores a data word into memory, 55 the parity generation circuit generates a parity bit from the data word and the system stores both the data word and the corresponding parity bit into an address location in a memory. When the system reads the address location where the data word is stored, both the data word and the corre- 60 sponding parity bit are read from the memory. The parity generation circuit then regenerates the parity bit from the data bits read from the memory device and compares the regenerated parity bit with the parity bit that is stored in memory. If the regenerated parity bit and the original parity 65 bit do not compare, an error is detected and the system is notified.
It is readily known that a single parity bit in conjunction with a multiple bit data word can detect a single bit error within the data word. However, it is also readily known that a single parity bit in conjunction with a multiple bit data word can be defeated by multiple errors within the data word. As calculation rates increase, circuit sizes decrease, and voltage levels of internal signals decrease, the likelihood of a multiple errors within a data word increase. Therefore, methods to detect multiple errors within a data word are essential.
System designers have developed methods for detecting multiple errors within multiple bit data words by providing multiple parity bits for each multiple bit data word. Although this technique has been successfully used, it significantly increases the overhead required to perform error detection because the parity generation circuit is more complex and the additional parity bits must be stored along with each data word. It can readily be seen that each additional parity bit that is included within a system adds a significant amount of overhead to the system.
Parity generation techniques are also used to perform error correction within a data word. Error correction is typically performed by encoding the data word to provide error correction code bits that are stored along with the bits of the data word. Upon readout, the data bits read from the addressable memory location are again subject to the generation of the same error correction code signal pattern. The newly generated pattern is compared to the error correction code signals stored in memory. If a difference is detected, it is determined that the data word is erroneous. Depending on the encoding system utilized it is possible to identify and correct the bit position in the data word indicated as being incorrect. The system overhead for the utilization of error correction code signals is substantial. The overhead includes the time necessary to generate the error correction codes, the memory cells necessary to store the error correction codes for each corresponding data word, and the time required to perform the decode when the data word is read from memory. These represent disadvantages to the error correction code system.
As can be seen from the previous discussion, a single parity bit system which stores the parity bit along with the data word requires the least amount of overhead within the system. The disadvantage of a single parity bit system is that only single bit failures can be detected. In prior art systems, a parity bit is typically provided for each data word that is stored within a memory device. Furthermore, the parity bit and the corresponding data word are typically stored at the same address location within the memory device. This is true even if there are multiple memory devices within the computer system. That is, each memory unit is treated independent from the others and therefore each memory unit stores both data words and the corresponding parity bits. Typically, there are three types of errors that occur within a memory device. The first type of error is an error in any single bit within the memory device. These errors will be detected by a single parity bit being stored along with the corresponding data word as discussed above. The second type of error is an error in multiple bits in a memory device. For example, if bit 0 and bit 1 both failed, they may cancel each other out when it comes to computing the parity bit, and the error may go undetected. This is a result of the two bits being in the same "parity domain". A parity domain is defined as a group of bits from which the parity bit is being calculated. Thus some multiple bit errors in a memory device will not be detected under the single bit parity method as described above. The third type of error is a chip failure.