|Publication number||US7398427 B2|
|Application number||US 10/887,524|
|Publication date||Jul 8, 2008|
|Filing date||Jul 8, 2004|
|Priority date||Jul 8, 2004|
|Also published as||US7681083, US20060010355, US20080189577|
|Publication number||10887524, 887524, US 7398427 B2, US 7398427B2, US-B2-7398427, US7398427 B2, US7398427B2|
|Inventors||Richard Louis Arndt, Patrick Allen Buckland, Gregory Michael Nordstrom, Steven Mark Thurber|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (17), Non-Patent Citations (3), Referenced by (12), Classifications (5), Legal Events (3)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The present application is related to co-pending applications entitled “ISOLATION OF INPUT/OUTPUT ADAPTER DIRECT MEMORY ACCESS ADDRESSING DOMAINS”, Ser. No. 10/887,522, and “ISOLATION OF INPUT/OUTPUT ADAPTER INTERRUPT DOMAINS”, Ser. No. 10/887,522, all filed on even date herewith. All the above related applications are assigned to the same assignee and are incorporated herein by reference.
1. Technical Field
The present invention relates generally to the data processing field and, more particularly, to a method, apparatus and system for isolating input/output adapter error domains in a data processing system.
2. Description of Related Art
In a server environment, it is important to be able to isolate input/output adapters (IOAs) so that an IOA can only obtain access to the resources which are allocated to it. Isolating IOAs from one another is important to create a system that is robust from a reliability and availability standpoint, and is especially important in a logical partitioned (LPAR) data processing system, so that IOAs, or parts of IOAs, can be allocated on an individual basis to different LPAR partitions.
In particular, in an LPAR data processing system, multiple operating systems or multiple copies of a single operating system are run on a single data processing system platform. Each operating system or operating system copy executing within the data processing system is assigned to a different logical partition, and each partition is allocated a non-overlapping subset of the resources of the platform. Thus, each operating system or operating system copy directly controls a distinct set of allocatable resources within the platform.
Among the platform resources that may be allocated to different partitions in an LPAR data processing system include regions of system memory and IOAs or parts of IOAs. Thus, different regions of system memory and different IOAs or parts of IOAs may be assigned to different partitions of the system. In such an environment, it is important that the platform provide a mechanism to enable an error occurring as a result of an operation with an IOA to be isolated to the particular partition to which the IOA is assigned. For example, for peripheral component interconnect (PCI) busses, if one IOA activates the System Error (SERR) signal on the bus, it is indistinguishable as to which IOA activated the signal since it is a shared signal. In such a situation where the error is not isolated, the system hardware must ensure that all partitions see the same error; and this requirement is contrary to the definition and intent of logical partitioning.
One solution that addresses the PCI problem is to assign all IOAs under one PCI Host Bridge (PHB) to the same LPAR partition. However, doing so results in a granularity that is not very usable. Ideally, a user should be able to assign IOAs to different partitions regardless of which PHB the IOA falls under.
Currently, error isolation between IOAs is accomplished by using unique, specially designed bridge chips that are located externally of the PCI Host Bridge (PHB). These external bridge chips include Enhanced Error Handling (EEH) technology (see, for example, commonly assigned application entitled “ISOLATION OF I/O BUS ERRORS TO A SINGLE PARTITION IN AN LPAR ENVIRONMENT”, Ser. No. 09/589,664, which is effective in preventing errors generated by one IOA from affecting partitions other than the partition to which the IOA is assigned. Such unique bridge chips, however, are relatively expensive and preclude the use of less costly, industry standard bridges in the data processing system.
It would, accordingly, be advantageous to provide for isolation of input/output adapter error domains in a data processing system without requiring the use of expensive, unique bridge chips.
The present invention provides a method, apparatus and system for isolating input/output adapter error domains in a data processing system. Errors occurring in one input/output adapter are isolated from other input/output adapters of the data processing system by functionality in a host bridge that connects the input/output adapters to a system bus of the data processing system, thus permitting the use of low cost, industry standard switches and bridges external to the host bridge.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures,
Data processing system 100 is a logical partitioned (LPAR) data processing system, however, it should be understood that the invention is not limited to an LPAR system but can also be implemented in other data processing systems. LPAR data processing system 100 has multiple heterogeneous operating systems (or multiple copies of a single operating system) running simultaneously. Each of these multiple operating systems may have any number of software programs executing within it. Data processing system 100 is logically partitioned such that different PCI input/output adapters (IOAs) 120, 121, 122, 123 and 124, graphics adapter 148 and hard disk adapter 149, or parts thereof, may be assigned to different logical partitions. In this case, graphics adapter 148 provides a connection for a display device (not shown), while hard disk adapter 149 provides a connection to control hard disk 150.
Thus, for example, suppose data processing system 100 is divided into three logical partitions, P1, P2, and P3. Each of PCI IOAs 120-124, graphics adapter 148, hard disk adapter 149, each of host processors 101-104, and memory from local memories 160-163 is assigned to each of the three partitions. In this example, memories 160-163 may take the form of dual in-line memory modules (DIMMs). DIMMs are not normally assigned on a per DIMM basis to partitions. Instead, a partition will get a portion of the overall memory seen by the platform. For example, processor 101, some portion of memory from local memories 160-163, and PCI IOAs 121, 123 and 124 may be assigned to logical partition P1; processors 102-103, some portion of memory from local memories 160-163, and PCI IOAs 120 and 122 may be assigned to partition P2; and processor 104, some portion of memory from local memories 160-163, graphics adapter 148 and hard disk adapter 149 may be assigned to logical partition P3.
Each operating system executing within a logically partitioned data processing system 100 is assigned to a different logical partition. Thus, each operating system executing within data processing system 100 may access only those IOAs that are within its logical partition. For example, one instance of the Advanced Interactive Executive (AIX) operating system may be executing within partition P1, a second instance (copy) of the AIX operating system may be executing within partition P2, and a Linux or OS/400 operating system may be operating within logical partition P3.
Peripheral component interconnect (PCI) host bridges (PHBs) 130, 131, 132 and 133 are connected to I/O bus 112 and provide interfaces to PCI local busses 140, 141, 142 and 143, respectively. PCI IOAs 120-121 are connected to PCI local bus 140 through I/O fabric 180, which comprises switches and bridges. In a similar manner, PCI IOA 122 is connected to PCI local bus 141 through I/O fabric 181, PCI IOAs 123 and 124 are connected to PCI local bus 142 through I/O fabric 182, and graphics adapter 148 and hard disk adapter 149 are connected to PCI local bus 143 through I/O fabric 183. The I/O fabrics 180-183 provide interfaces to PCI busses 140-143 and will be described in greater detail hereinafter. A typical PCI host bridge will support between four and eight IOAs (for example, expansion slots for add-in connectors). Each PCI IOA 120-124 provides an interface between data processing system 100 and input/output devices such as, for example, other network computers, which are clients to data processing system 100.
PCI host bridge 130 provides an interface for PCI bus 140 to connect to I/O bus 112. This PCI bus also connects PCI host bridge 130 to service processor mailbox interface and ISA bus access pass-through logic 194 and I/O fabric 180. Service processor mailbox interface and ISA bus access pass-through logic 194 forwards PCI accesses destined to the PCI/ISA bridge 193. NVRAM storage 192 is connected to the ISA bus 196. Service processor 135 is coupled to service processor mailbox interface and ISA bus access pass-through logic 194 through its local PCI bus 195. Service processor 135 is also connected to processors 101-104 via a plurality of JTAG/I2C busses 134. JTAG/I2C busses 134 are a combination of JTAG/scan busses (see IEEE 1149.1) and Phillips I2C busses. However, alternatively, JTAG/I2C busses 134 may be replaced by only Phillips I2C busses or only JTAG/scan busses. All SP-ATTN signals of the host processors 101, 102, 103, and 104 are connected together to an interrupt input signal of the service processor. The service processor 135 has its own local memory 191, and has access to the hardware OP-panel 190.
When data processing system 100 is initially powered up, service processor 135 uses the JTAG/I2C busses 134 to interrogate the system (host) processors 101-104, memory controller/cache 108, and I/O bridge 110. At completion of this step, service processor 135 has an inventory and topology understanding of data processing system 100. Service processor 135 also executes Built-In-Self-Tests (BISTs), Basic Assurance Tests (BATs), and memory tests on all elements found by interrogating the host processors 101-104, memory controller/cache 108, and I/O bridge 110. Any error information for failures detected during the BISTs, BATs, and memory tests are gathered and reported by service processor 135.
If a meaningful/valid configuration of system resources is still possible after taking out the elements found to be faulty during the BISTs, BATs, and memory tests, then data processing system 100 is allowed to proceed to load executable code into local (host) memories 160-163. Service processor 135 then releases host processors 101-104 for execution of the code loaded into local memory 160-163. While host processors 101-104 are executing code from respective operating systems within data processing system 100, service processor 135 enters a mode of monitoring and reporting errors. The type of items monitored by service processor 135 include, for example, the cooling fan speed and operation, thermal sensors, power supply regulators, and recoverable and non-recoverable errors reported by processors 101-104, local memories 160-163, and I/O bridge 110.
Service processor 135 is responsible for saving and reporting error information related to all the monitored items in data processing system 100. Service processor 135 also takes action based on the type of errors and defined thresholds. For example, service processor 135 may take note of excessive recoverable errors on a processor's cache memory and decide that this is predictive of a hard failure. Based on this determination, service processor 135 may mark that resource for deconfiguration during the current running session and future Initial Program Loads (IPLs). IPLs are also sometimes referred to as a “boot” or “bootstrap”.
Data processing system 100 may be implemented using various commercially available computer systems. For example, data processing system 100 may be implemented using an IBM eServer iSeries Model 840 system available from International Business Machines Corporation. Such a system may support logical partitioning using an OS/400 operating system, which is also available from International Business Machines Corporation.
Those of ordinary skill in the art will appreciate that the hardware depicted in
With reference now to
Additionally, these partitions also include partition firmware 211, 213, 215, and 217. Partition firmware 211, 213, 215, and 217 may be implemented using initial boot strap code, IEEE-1275 Standard Open Firmware, and runtime abstraction software (RTAS), which is available from International Business Machines Corporation. When partitions 203, 205, 207, and 209 are instantiated, a copy of boot strap code is loaded onto partitions 203, 205, 207, and 209 by platform firmware 210. Thereafter, control is transferred to the boot strap code with the boot strap code then loading the open firmware and RTAS. The processors associated or assigned to the partitions are then dispatched to the partition's memory to execute the partition firmware.
Partitioned hardware 230 includes a plurality of processors 232-238, a plurality of system memory units 240-246, a plurality of IOAs 248-262, and a storage unit 270. Each of the processors 232-238, memory units 240-246, NVRAM storage 298, and IOAs 248-262, or parts thereof, may be assigned to one of multiple partitions within logical partitioned platform 200, each of which corresponds to one of operating systems 202, 204, 206, and 208.
Partition management firmware 210 performs a number of functions and services for partitions 203, 205, 207, and 209 to create and enforce the partitioning of logical partitioned platform 200. Partition management firmware 210 is a firmware implemented virtual machine identical to the underlying hardware. Thus, partition management firmware 210 allows the simultaneous execution of independent OS images 202, 204, 206, and 208 by virtualizing the hardware resources of logical partitioned platform 200.
Service processor 290 may be used to provide various services, such as processing of platform errors in the partitions. These services also may act as a service agent to report errors back to a vendor, such as International Business Machines Corporation. Operations of the different partitions may be controlled through a hardware management console, such as hardware management console 280. Hardware management console 280 is a separate data processing system from which a system administrator may perform various functions including reallocation of resources to different partitions.
In an LPAR environment, it is not permissible for resources or programs in one partition to affect operations in another partition. Furthermore, to be useful, the assignment of resources needs to be fine-grained. For example, it is often not acceptable to assign all IOAs under a particular PHB to the same partition, as that will restrict configurability of the system, including the ability to dynamically move resources between partitions.
Accordingly, some functionality is needed in the bridges that connect IOAs to the I/O bus so as to be able to assign resources, such as individual IOAs or parts of IOAs to separate partitions; and, at the same time, prevent the assigned resources from affecting other partitions such as by obtaining access to resources of the other partitions.
Unique bridge chip 308 includes a terminal bridge for each IOA. In particular, IOA 302 is connected to terminal bridge 312 by PCI bus 322, and IOA 304 is connected to terminal bridge 314 by PCI bus 324. Terminal bridges 312 and 314 contain endpoint states of IOAs 302 and 304, respectively, and serve to isolate IOAs 302 and 304 from one another.
In resource isolation system 300 illustrated in
As will become apparent hereinafter, a PE as defined herein also comprises an input/output unit that is something more or something less than a single IOA. For example, a PE also comprises a plurality of IOAs that function together and, thus, that should be assigned as a unit to a single partition. A PE can also comprise a portion of a single IOA, for example, two ports of a chip that perform as separately configurable functions. If the two ports provide separate functions, they are capable of being separately assigned to different partitions; and, thus, each port may be defined as a separate PE. In general, a PE is defined by its function rather than by its structure.
The present invention utilizes the concept of a PE to provide a resource isolation system in which the isolation functionality is moved from a unique bridge chip located externally of the PHB, such as in system 300 in
I/O fabric 460 includes PCI bridge 462 and switches 464 and 466, and is connected to PHB 450 by local PCI bus 410 that connects switch 466 to PHB 450, and to PEs 402, 404, 406 and 408 by various secondary busses. As shown in
It should be understood that the specific configuration of I/O fabric 460 illustrated in
PE 402 and PE 406 each comprises a single IOA 412 and 416, respectively, such that IOAs 412 and 416 can each be assigned to a different partition of the data processing system. PE 404 comprises two IOAs 414 and 424 that function together and, thus, must be assigned to the same partition. PE 408 comprises three IOAs 418, 428 and 438 and bridge 448 that function together and must be assigned to the same partition.
In isolation system 400, the endpoint states of each PE, referred to herein as Partitionable Endpoint states, are located in PHB 450 in the illustrated example rather than in a unique bridge chip as in system 300 illustrated in
The ability to move the isolation functionality from a unique bridge chip to the PHB is achieved, in part, by providing a PE Domain Number that associates various domain components to the same PE. The PE Domain Number is an identifier that includes a plurality of fields that can be used to differentiate different IOAs in a PE. These fields include:
The PE Domain number (Bus/Dev/Func number), allows for division down to the lowest level of division i.e., use of all of the Bus/Dev/Func fields allows separate functions of a multiple function IOA to be differentiated. In isolation systems that do not require such a fine granularity, the PE Domain number can be defined by the Bus field alone, allowing differentiation between the PEs connected to the PHB, or by the Bus field together with either the Dev field or the Func field to permit differentiation between IOAs of a PE or differentiation between functions of an IOA in a PE that contains a multiple function IOA.
Among the isolation functionalities included in PHB 450 in
More particularly, the PHB should include, for example by utilizing EEH technology, the capability of stopping operations to and from a PE when an error is detected (referred to as the Stopped State). The stopping of operations should be accomplished in such a way that:
In order to achieve these objectives, the error isolation system of the present invention includes mechanisms in the PHB that provide the following isolation functionalities:
The above isolation functionalities are enabled by providing a PE State Array in the PHB. The PE State Array is accessed by the PE Number which is obtained from the PE Number field of an MMIO Domain Entry (MDE), a Translation Validation Entry (TVE) or an MSI Validation Entry (MVE) for MMIO Load/Store, normal DMA and MSI operations, respectively. When a DMA operation (normal or MSI) does not have its Bus/Dev/Func validate in the TVE or MVE, or when an ERR_FATAL or ERR_NONFATAL comes in from the I/O fabric, then a PE Lookup Table (PELT) is used to lookup all possible PE Numbers that could be in the same error domain as the Bus/Dev/Func. When the PELT is used, and the Bus/Dev/Func is not found, then the hardware assumes that all PE numbers under the PHB are affected.
Validation and PE correlation for MMIO Load/Store operations begins by using an MDE Index 521 which comprises certain bits of an MMIO Load and Store address 522, to lookup the PE Number 525 in MMIO Domain Table 524 in the PHB, as shown by arrow 523. Those skilled in the art will recognize that there are other ways to get an index into a table based on an address, such as base and bounds registers, base and extent registers, and so on. The PE Number is then used to access the PE State Array 516 as shown by arrow 530.
If the PE State Array indicates that an MMIO Stopped State for the PE is not set, the MMIO operation is allowed to continue. If there is an error during completion of the MMIO operation, the MMIO Stopped State is set as shown at 514, and the DMA Stopped State for the PE Number is set as shown at 513, and the operation is not allowed to continue.
If the PE State in the PE State Array indicates that the MMIO Stopped State for the PE is set, then on an MMIO Store, the data is discarded (no error signaled); and on an MMIO Load, the operation is completed with all-1's data returned. If EEH is enabled, as indicated at 515 in the PE State Array 516, for the PE Number which is the target of the MMIO, then an error is not signaled, and in that case, it is up to the device driver to recognize that all-1's may mean that an error occurred, and to run a program to determine if the all-1's is good data or not. If the EEH is not enabled for the PE Number which is the target for the MMIO, then a machine check is signaled to the processor that issued the Load.
Validation and PE Correlation for Normal DMA Read/Write operations begins with the DMA address 503 and the Bus/Dev/Func number 501 coming in on the I/O bus. The Bus/Dev/Func number uniquely identifies the PE requesting the operation. A specific TVE is selected by the address provided by the I/O operation, which comprises the PE Domain Number and the bus address. Those skilled in the art will recognize that there are several ways to get from this address provided by the PE to a unique entry in the TVT. For example, certain bits of the I/O bus address bits may be used to index into the TVT as follows: TVE Index bits 502 are used to access TVE 506 in TVT (Translation Validation Table) 507 as shown by arrow 504. The TVE contains a Bus number field and a Bus number Validate field. Optionally, it may also include Device number field and a Device number Validate field and/or a Function number field and a Function number Validate field, all of which are used to determine if the Bus/Dev/Func number 501 coming in with the transaction has valid access to the TVE that it is trying to access.
If the Bus/Dev/Func compares for the operation as shown at 509, then the address 503 is compared against Translation Control Entry (TCE) Table Size (Address Size) field of TVE 506 as shown at 509, to determine if the address is valid. (The TCE is used to translate an I/O address page number to a Real Page Number in system memory.) If it is not valid, PE Number 508 from the TVE is used to lookup the PE state in the PE State Array, the Stopped State for the PE (MMIO and DMA) are set, and the operation is aborted. If valid, then PE Number 508 from the TVE is used to lookup the DMA Stopped State in the PE State Array to see if the PE Number is already in the DMA Stopped State. While the PE Number is in the DMA Stopped State, all DMA operations for the PE Number are prohibited and will be aborted. While the PE Number is not in the DMA Stopped State, DMA operations for the PE Number will be allowed.
The PELT is used when there is no other valid way to get an association between a failed operation and the PE Number or PE Numbers associated with the failure. That is, if there is no valid TVE or MVE associated with an operation, or if a fatal or non-fatal error message is received by the PHB from the I/O bus.
The PELT lookup is done as follows:
The PELT 520 is scanned for an entry where the Bus/Dev/Func number 518 of the PELT entry compares to the Bus/Dev/Func number 501 from the incoming PE Number (for errors, the Bus/Dev/Func number is in the ERR_FATAL or ERR_NONFATAL message, and for DMA operations that do not verify in the TVT, the Bus/Dev/Func number is in the DMA packet). The scan of the PELT may be performed by any method that performs well enough to prevent side error effects. Specifically, the MMIO and DMA queues/pipelines must be held up momentarily during the scan, so that operations affected by the lookup can be terminated, and thus stalling of pipelines must not cause additional errors in the PHB or other chips. Also, the PELT entries may have validation fields just like the TVEs, allowing the comparison of less than the full Bus/Dev/Func number.
If an entry is found in the PELT which matches the Bus/Dev/Func, then the PE Bit Array 519 field of the PELT entry specifies the PE Number or Numbers that are in the error domain for this Bus/Dev/Func number, and these are used to access the PE State Array 516 and set the appropriate MMIO stopped states 514 and DMA Stopped States 513.
For ERR_FATAL or ERR_NONFATAL, Bus/Dev/Func lookup, both the MMIO Stopped State and DMA Stopped State are set for all the PELT-specified PEs, regardless of their current state (the ERR_FATAL and ERR_NONFATAL may be from any fabric error and any operation, including an MMIO failure).
For the case where the PELT lookup is due to an invalid Bus/Dev/Func validation from the MVE or TVE validation process, for any given PE, if the DMA Stopped State is already set for the PE, then leave the MMIO and DMA Stopped States for the PE as-is, otherwise (DMA Stopped State not set) set the MMIO and DMA Stopped States for the PE.
A determination is then made as to whether the PE state in the PE State Array indicates that the MMIO Stopped state for the PE is set (step 604). If it is set (Yes output of step 604), continue at error processing (step 608). If it is not set (No output of step 604), the MMIO operation is performed (step 605).
A determination is then made as to whether there was an error in performing the MMIO operation (step 606). If there was no error during completion of the MMIO operation (No output of step 606), the operation ends (step 613). If there was an error (Yes output of step 606), the MMIO Stopped State and the DMA Stopped State for the PE Number are set (step 607), and error processing is continued.
A determination is then made as to whether the operation is an MMIO Load or Store operation (step 608). If the operation is a Store (No output of step 608), then discard the Store data (step 610) and the operation ends (step 613).
If the operation is a Load (Yes output of step 608), a determination is made as to whether or not EEH is enabled in the PE State Table for the PE Number (step 609). If the EEH is not enabled for the PE Number (No output of step 609), then the operation is completed with all-1's data returned and a machine check to the processor that issued the Load (step 611). If EEH is enabled for the PE (Yes output of step 609), then the operation is completed with all-1's data returned with no error signaled (step 612). After completing step 611 or 612, the MMIO operation is complete and ends (step 613).
A determination is made as to whether this is a DMA operation or an MSI operation (step 702). This determination is made, for example, by looking at a particular bit in the DMA address. A zero is a normal DMA and a one is an MSI operation. If it is an MSI operation (No output of step 702), it is processed as an MSI operation (step 703).
If it is a DMA operation (Yes output of step 702), it is determined if the TVE index accesses past the end of the TVT (step 704). If so (Yes output of step 704) error processing is performed (step 715). If not (No output of step 704), the TVE Index Field address is used to access the TVE (step 705). A determination is then made if the Bus/Dev/Func number validates with the TVE (step 706) If it does not validate (No output of step 706), error handling is performed (step 715). If it does validate (Yes output of step 706), a determination is made whether the TVE is valid (step 707). If not valid (No output of step 707), error handling is performed (step 715). If the TVE is valid (Yes output of step 707), the address is then checked to see if it exceeds what the TVE says is valid (step 708). This is done by using the TVE Table Size (Address Size) field to determine how many of the high-order bits of the TCE Index field of the DMA address have to be zero. If the address is too large, the access is not valid (Yes output of step 708) and error handling is performed (step 715). If the TCE Table Size is zero, then the address will always be deemed to be invalid, so a value of zero can be used to mark the TVE as invalid with a good Bus/Dev/Func validation. If the access is valid (No output of step 708), the PE Number field from the TVE is used to access the PE State Array (step 709). A determination is then made as to whether or not the PE Number has its Stopped State set (step 710). If not (No output of step 710), continue. If the state is set (Yes output of step 710) error handling is performed (step 715).
The I/O page size field in the TVE is then checked to see if it is zero (step 711). If so (Yes output of step 711), the TCE access and address translation is by-passed using the number of low order address bits from the I/O bus address as specified by the TCE Table Size (Address Size) field, and appending on the appropriate number of TVE TCE Table Address (TTA) field low-order bits as the high-order bits of the real address to create enough bits to address the entire address range supported by the implementation (step 717), and the operation is allowed to continue (step 716).
If the I/O Page Size field in the TVE is not zero (No output of step 711), then the TTA field of the TVE is used along with the TCE Index bits of the DMA address to access the TCE for the operation (step 712).
A comparison is made with the type of DMA operation (read or write) to the TCE Page Mapping and Control field of the TCE (step 713). If the type of operation does not match, or if the Page Mapping and Control field indicates a page fault (Yes output of step 717), error handling is performed (step 715).
If the operation does match (No output of step 713), the Real Page Number field of the TCE is used along with the Page Offset field of the incoming DMA address to construct the physical address to be used to access system memory (step 714), and the operation is allowed to continue (step 716).
A determination is made as to whether this is a DMA or an MSI (step 802). This determination is made, for example, by looking at a bit in the DMA address. A zero is a normal DMA operation and a one is an MSI operation. If it is a normal DMA operation (Yes output of step 802), it is processed as a DMA operation as described with reference to
If it is an MSI operation (No output of step 802), a determination is then made as to whether the MVE Index Field from bits in the I/O address will access beyond the end of the MVT (step 804). If it does (Yes output of step 804), error handling is performed (step 814) is performed. If not, (No output of step 804), the MVE Index is used to access MVE (step 805), and the Bus/Dev/Func fields of the MVE are used to determine if the PE Number (as specified by the Bus/Dev/Func # in the DMA operation) has access to MVE (step 806). If the Bus/Dev/Func number does not validate (No output of step 806), error handling is performed (step 814).
If the Bus/Dev/Func number does validate (Yes output of step 806), the MVE is then checked to see if it is valid (step 807). The MVE validity is verified by checking to make sure that the MCE Table Size (Address Size) field is non-zero. If the MVE is not valid (No output of step 807), then error handling is performed (step 814).
If the MVE is valid (Yes output of step 807), the PE Number field from the MVE is used to access the PE State Array (step 808). A determination is made as to whether or not the PE Number has its DMA Stopped State set (step 809). If not (No output of step 809), the method continues. If yes (Yes output of step 809), error handling is performed (step 814).
The MSI number Interrupts field of the MVE is used to mask off an appropriate number of high-order DMA data bits (to determine which data bits are valid), and the result is then ORed with the MSI Table Offset field of the MVE (that is, valid bits of the data are appended to the MSI Table Offset)(step 810). The result is then used as the index into the XIVT (external Interrupt Vector Table containing XIVEs) to get the XIVE (external Interrupt Vector Table Entry that provides the interrupt priority and server number for routing an interrupt, step 811).
The interrupt is then presented to the interrupt routing logic using the Server Number and Priority from the XIVE (step 812), and the MSI operation is complete (step 813).
A determination is then made if the Bus/Dev/Func number is validated in the TVE or MVE (step 903). If not (No output of step 903), a PELT lookup is done (step 904). Otherwise (Yes output of step 903), use the PE Number from the TVE or MVE to lookup the DMA Stopped State information for the PE.in the PE State Array (step 905).
A determination is then made as to whether the DMA Stopped State is set for the given PE Number (step 906). If not set (No output of step 906), then set both the MMIO Stopped State and the DMA Stopped State for the PE Number (step 907) and the error processing is complete (step 908). Otherwise (Yes output from step 906), do not set the Stopped States and error processing is then complete (step 908).
If an entry with a matching Bus/Dev/Func number is not found (No output of step 1003), then it is assumed that all error domains for all PEs under the PHB are potentially affected and the method continues as though an entry was found in the PELT with all the PE bits set in the PE Bit Array (step 1004), or if an entry is found (Yes output of step 1003), lookup the state of each PE Number which is indicated in the PE Bit Array field of the matching PELT entry, in the PE State Array (step 1005). Both the MMIO Stopped State and the DMA Stopped State for the given PE number in the PE State Array are set (step 1007). Then make sure all PEs are processed (step 1008). If all PEs have been processed (No output of step 1008), the operation is complete (step 1009). If all PEs have not been processed (Yes output of step 1008), the operation returns to step 1005.
The present invention thus provides a method, apparatus and system for isolating input/output adapter error domains in a data processing system. Errors occurring in one input/output adapter are isolated from other input/output adapters by functionality in a host bridge that connects the plurality of input/output adapters to a system bus of the data processing system, thus permitting the use of low cost, industry standard switches and bridges external to the host bridge.
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5640584||Dec 12, 1994||Jun 17, 1997||Ncr Corporation||Virtual processor method and apparatus for enhancing parallelism and availability in computer systems|
|US5771387||Mar 21, 1996||Jun 23, 1998||Intel Corporation||Method and apparatus for interrupting a processor by a PCI peripheral across an hierarchy of PCI buses|
|US6081861||Jun 15, 1998||Jun 27, 2000||International Business Machines Corporation||PCI migration support of ISA adapters|
|US6219743||Sep 30, 1998||Apr 17, 2001||International Business Machines Corporation||Apparatus for dynamic resource mapping for isolating interrupt sources and method therefor|
|US6523140 *||Oct 7, 1999||Feb 18, 2003||International Business Machines Corporation||Computer system error recovery and fault isolation|
|US6629157||Jan 4, 2000||Sep 30, 2003||National Semiconductor Corporation||System and method for virtualizing the configuration space of PCI devices in a processing system|
|US6643727||Jun 8, 2000||Nov 4, 2003||International Business Machines Corporation||Isolation of I/O bus errors to a single partition in an LPAR environment|
|US6691192||Sep 30, 2001||Feb 10, 2004||Intel Corporation||Enhanced general input/output architecture and related methods for establishing virtual channels therein|
|US6823404||Jan 23, 2001||Nov 23, 2004||International Business Machines Corporation||DMA windowing in an LPAR environment using device arbitration level to allow multiple IOAs per terminal bridge|
|US20020010811||Jan 23, 2001||Jan 24, 2002||International Business Machines Corporation||DMA windowing in an LPAR environment using device arbitration level to allow multiple IOAs per terminal bridge|
|US20020152344||Apr 17, 2001||Oct 17, 2002||International Business Machines Corporation||Method for processing PCI interrupt signals in a logically partitioned guest operating system|
|US20030172322 *||Mar 7, 2002||Sep 11, 2003||International Business Machines Corporation||Method and apparatus for analyzing hardware errors in a logical partitioned data processing system|
|US20040153853 *||Feb 25, 2003||Aug 5, 2004||Hitachi, Ltd.||Data processing system for keeping isolation between logical partitions|
|US20040225792 *||Jan 19, 2001||Nov 11, 2004||Paul Garnett||Computer system|
|US20040260981 *||Jun 19, 2003||Dec 23, 2004||International Business Machines Corporation||Method, system, and product for improving isolation of input/output errors in logically partitioned data processing systems|
|US20050081126 *||Oct 9, 2003||Apr 14, 2005||International Business Machines Corporation||Method, system, and product for providing extended error handling capability in host bridges|
|US20060010276||Jul 8, 2004||Jan 12, 2006||International Business Machines Corporation||Isolation of input/output adapter direct memory access addressing domains|
|1||*||"A Highly Available Local Leader Election Service"- http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=815321.|
|2||Arndt et al., Isolation of Input/Output Adapter Direct Memory Access Addressing Domains.|
|3||Arndt et al., Isolation of Input/Output Adapter Interrupt Domains.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7747809 *||Feb 19, 2008||Jun 29, 2010||International Business Machines Corporation||Managing PCI express devices during recovery operations|
|US7779293 *||Nov 30, 2004||Aug 17, 2010||Fujitsu Limited||Technology to control input/output device bridges|
|US8261128||Aug 4, 2010||Sep 4, 2012||International Business Machines Corporation||Selection of a domain of a configuration access|
|US8386679||Apr 12, 2011||Feb 26, 2013||International Business Machines Corporation||Dynamic allocation of a direct memory address window|
|US8495271||Aug 4, 2010||Jul 23, 2013||International Business Machines Corporation||Injection of I/O messages|
|US8549202||Aug 4, 2010||Oct 1, 2013||International Business Machines Corporation||Interrupt source controller with scalable state structures|
|US9336029||Aug 4, 2010||May 10, 2016||International Business Machines Corporation||Determination via an indexed structure of one or more partitionable endpoints affected by an I/O message|
|US9501308||Feb 20, 2015||Nov 22, 2016||International Business Machines Corporation||Implementing coherent accelerator function isolation for virtualization|
|US20060059389 *||Nov 30, 2004||Mar 16, 2006||Fujitsu Limited||Technology to control input/output device bridges|
|US20080148104 *||Sep 1, 2006||Jun 19, 2008||Brinkman Michael G||Detecting an Agent Generating a Parity Error on a PCI-Compatible Bus|
|US20090210607 *||Feb 19, 2008||Aug 20, 2009||International Business Machines Corporation||Managing pci express devices during recovery operations|
|WO2012016931A1||Jul 29, 2011||Feb 9, 2012||International Business Machines Corporation||Determination of one or more partitionable endpoints affected by an i/o message|
|U.S. Classification||714/43, 714/56|
|Jul 23, 2004||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ARNDT, RICHARD LOUIS;BUCKLAND, PATRICK ALLEN;NORDSTROM, GREGORY MICHAEL;AND OTHERS;REEL/FRAME:014893/0773;SIGNING DATES FROM 20040628 TO 20040630
|Oct 26, 2011||FPAY||Fee payment|
Year of fee payment: 4
|Sep 30, 2015||FPAY||Fee payment|
Year of fee payment: 8