|Publication number||US7761588 B2|
|Application number||US 12/245,691|
|Publication date||Jul 20, 2010|
|Filing date||Oct 3, 2008|
|Priority date||Jul 16, 2004|
|Also published as||CN1722732A, CN1722732B, US7475153, US8176187, US20060013251, US20090034553, US20100217878|
|Publication number||12245691, 245691, US 7761588 B2, US 7761588B2, US-B2-7761588, US7761588 B2, US7761588B2|
|Inventors||John Lewis Hufferd|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (34), Non-Patent Citations (11), Classifications (9), Legal Events (3)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application is a continuation of U.S. patent application Ser. No. 10/893,213, filed on Jul. 16, 2004, which issued as U.S. Pat. No. 7,475,153, and which patent is incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates to a method, system, and program for enabling communication between nodes.
2. Description of the Related Art
In storage environments, data access commands are communicated from a host system to a storage controller, which manages access to the disks. The storage controller may be a card inside the host system or a separate device. The Internet Small Computer Systems Interface (iSCSI) protocol is used for storage networks that utilize Ethernet connections, including Ethernet switches and routers. The term “iSCSI” as used herein refers to the syntax and semantic of the iSCSI protocol defined by the IETF (Internet Engineering Task Force) standards body, and any variant of that protocol. In current storage networks where iSCSI is utilized, the packet configuration comprises an Ethernet package encapsulating an Internet Protocol (IP) and Transmission Control Protocol (TCP) package layers, which further encapsulate an iSCSI package that includes one or more SCSI commands. The Ethernet protocol provides for link-level error checking as the packets flow from point-to-point on any network segment (link) to determine whether data has been corrupted while passing on a link. In network data transmission operations, an initiator device transmits data or commands over the network to a target device. The TCP/IP package includes an error detection code to perform an end-to-end checking to determine at the opposite end whether the transmitted packet has changed during the transmission as the packet passes through switches and routers. A receiving device detecting an error will send a negative acknowledgment to the sending device to request retransmission of those packets in which errors were detected.
The Remote Direct Memory Access (RMDA) protocol enables one network node to directly place information in another network node's memory with minimal demands on memory bus bandwidth and processor overhead. RDMA over TCP/IP (also known as iWARP) defines the interoperable protocols to support RDMA operations over standard TCP/IP networks. An RDMA Network Interface Card (RNIC) implements the RDMA protocol and performs RDMA operations to transfer data to local and remote memories. Further details of the RDMA protocol are described in the specifications entitled “RDMA Protocol Verbs Specification (Version 1.0)”, published by the RDMA Consortium (April, 2003); “Direct Data Placement over Reliable Transports (Version 1.0)”, published by RDMA Consortium (October 2002); and “Marker PDU Aligned Framing for TCP Specification (Version 1.0)”, published by the RDMA Consortium (October 2002), and which specifications are incorporated herein by reference in their entirety.
One specification entitled “iSCSI Extensions for RDMA Specification (Version 1.0), by Michael Ko et al., released by the RDMA Consortium (July, 2003), which specification is incorporated herein in its entirety, defines a protocol for providing the RDMA data transfer capabilities to iSCSI by layering iSCSI on top of RDMA.
Many of the features defined as part of RDMA over TCP/IP, also known as iWARP, were previously defined as operations in an InfiniBand network. The InfiniBand adaptor hardware supports RDMA operations. InfiniBand also defines a set of protocols called Socket Direct Protocols (SDP) that allow a normal TCP/IP socket application to send a message across an InfiniBand network, in the same manner they would if they were operating on a TCP/IP network. Further details of the InfiniBand and SDP protocols are described in the publication “InfiniBand™ Architecture, Specification Volume 1”, Release 1.1 (November, 2002, Copyright InfiniBand™ Trade Association), which publication is incorporated herein in its entirety.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
Provided are a method, system, and program performed at a local node to communicate with a remote node. A first communication protocol is used to communicate with the remote node to establish a connection for a second communication protocol. Data structures are created to enable communication with the remote node to establish the connection with the remote node for the second communication protocol. An extension layer is invoked for the second communication protocol. The data structures are passed to the extension layer to use to communicate with the remote node using the second communication protocol.
In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments of the present invention. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the present invention.
The nodes 2 a, 2 b . . . 2 n in
In cases where the adaptor 24 comprises an InfiniBand adaptor, the node 2 may include a Sockets Direct Protocol (SDP) layer 42, such that the socket layer 20 interfaces with the SDP layer 42 and the SDP layer 42 interfaces between the sockets layer 20 and the RDMA layer 26. In InfiniBand embodiments, the SDP layer 42 provides an interface between an application 14 making calls using the socket layer 20 and the RDMA layer 26 in the adaptor 24, by implementing the socket layer 20 calls from the iSCSI layer 18 through RDMA calls to the RDMA layer 26 (either directly or via the IB adapter driver 44). In both InfiniBand and RNIC embodiments, an iSER layer 22 is provided, such that after login, the iSCSI layer 18 would call an iSER layer 22 to make calls to the RNIC 24. The iSER layer 22 may call the RNIC 24 directly through function calls or through the RNIC driver 44 comprising an RDMA verb layer. In embodiments where the adaptor 24 comprises an RNIC adaptor, the node 2 may not include the SDP layer 42, whereas in InfiniBand adaptor embodiments, an SDP layer 42 is included.
The RDMA layer 26 may directly access registered memory locations in the initiator and target nodes (locally or locally and remotely) in a logically contiguous fashion. A defined memory location, such as a memory region or memory window, is identified by a steering tag created by the RDMA layer 26 and used to reference the registered memory location, such as memory regions 32. In RNIC implementations, the steering tag is referred to as an STag and in InfiniBand embodiments, the steering tags are referred to as an R_Key for a Remote steering tag and as an L_Key for a Local steering tag (the generic term that is used here for both is #_Key). In certain embodiments, a memory region or subset of a memory region referred to as a memory window may be registered, where a separate STag/#_key would be associated with each registered memory location, region or window. The RDMA layer 26 uses the STag/#_key to access the referenced memory location. In certain embodiments, the iSER layer 22 would call the adaptor 24 to register the memory regions by calling the RDMA verb layer 44. The RDMA verb layer 44 (RNIC/IB adapter driver) comprises the device driver to interface the operating system 8 with the adaptor 24. In response to the call from the function in the iSER layer 22 or SDP layer 42 to declare and register a memory location, e.g., memory region or window, the adapter driver 44 would call the adaptor 24.
The RDMA layer 26 maintains a memory translation table 34, and when registering a memory region, would add an entry to the memory translation table 34 identifying the registered memory region and the STag/#_key generated to reference that memory region to enable the RDMA layer 26 to associate the STag/#_key with the memory region. The memory translation table 34 may be maintained within buffers in the adapter 24 or within the memory 30. The Stags/#_Keys would be returned to the iSER layer 22 functions requesting the registration to use for I/O operations.
After the adapter 24 generates and returns Stags/#_Keys to the iSER layer 22, the iSER layer 22 may proceed with the I/O operation. The iSER layer 22 wraps the packet received from the iSCSI layer 18 with header information and the STag/R_Key received from the adapter 24 and pass the packet to the adapter 24 to transfer.
To manage RDMA data transfers, the RDMA layer 26 maintains a send queue 36, a receive queue 38, and a complete queue 40. The send queue 36 and receive queue 38 comprise the work queues that the RDMA layer 26 uses to manage RDMA data transfer requests. The complete queue 40 may comprise a sharable queue containing one or more entries having completion entries to provide a single point of completion notification for multiple work queues. The queues 36, 38, and 40 may have many instances, perhaps for each logical connection, and may be allocated by the adaptor 24 in the memory 30 or within buffers in the adaptor 24.
The iSER header 54 would include the Stag/R_Key used with the I/O operation and information indicating whether the remote node receiving the advertised Stag/R_Key is to read or write to the memory region (window) referenced by the Stag/R_Key and the work queues related to the request. The iSER header 54 and iSCSI PDU 52 are further encapsulated in one or more additional network layers 60, such as a TCP layer or InfiniBand network protocol layer. In certain embodiments, the network layers 28 in the adapter 24 would assemble the iSER header 54 and PDU 52 within the network layers 60, such as TCP, IB, etc. etc.
The iSER layer 22 further maintains an ITT to Stag/#_Key map 70 (
If (at block 110) the initiator node 2 is not willing to establish an RDMA session with the remote node via the Networking Layer in the RNIC or InfiniBand adaptor 24, then the initiator node 2 would break the negotiation connection and attempt (at block 112) to locate other RDMA compatible target nodes. Otherwise, if (at block 110) an RDMA session is acceptable, then (at block 114) the SDP layer 42 (for InfiniBand adaptors 24) or the network layers 28 (for RNIC adaptors) (at block 114) returns the response from the target node, and continues to enable the iSCSI layer to use Socket APIs for triggering either the RNIC networking layer 28 or the SDP layer 42 to send and receive additional Login Request and Login Response messages with the target.
With respect to
With respect to
If (at block 412) the message is an iWARP “send with invalidate message” having an STag, then the protocol converter 314, 364 or 382 creates (at block 416) an InfiniBand “send with (or without) solicited event” message. The protocol converter 314, 364 or 382 adds (at block 418) the STag, referencing a direct reference to a memory location in the target or initiator, from the iWARP message to the immediate data field in the InfiniBand message (alternatively, discard STag and prepare to send without any immediate data). The protocol converter 314, 364 or 382 transmits (at block 420) the converted message to the initiator (or subsequent gateway) over the InfiniBand network. From block 410 or 420, control proceeds to block 422 where if there is a subsequent gateway, then such gateway will convert the iSER/IB message into iSER/iWARP by performing the operations from block 440 in
If (at block 402) the message from the target node was in the InfiniBand protocol, then control proceeds to block 440 in
If (at block 446) the message is an Infiniband “send with immediate data” message, then the protocol converter 362 creates (at block 450) an iWARP send with invalidate (with solicited event) message and adds (at block 452) the R_Key from the immediate data field in the InfiniBand message to the STag field in an iWARP “send with invalidate message” (alternatively, discard R_Key (immediate data) and setup send message without the STag). The protocol converter 362 transmits (at block 454) the converted message to a gateway 354 over an iWARP network, such as shown in
If (at block 440) the InfiniBand transmission from the target node will not continue over an iWARP network to a gateway before going to the initiator node, (i.e., the InfiniBand message will continue through Gateway 322 or 374 on an iWARP network directly to the initiator 324 or 376 as shown in
If (at block 488) the message is an Infiniband send with immediate data message, then the protocol converter 334 or 386 creates (at block 492) an iWARP send with invalidate (with solicited event) message and adds (at block 494) the R_Key from the immediate data field in the InfiniBand message into the STag field in an iWARP send with invalidate message (alternatively, discard R_Key (immediate data) and setup send message without the STag). The protocol converter 334 or 386 transmits (at block 496) the converted message to the initiator node over an iWARP network, such as shown in
Described embodiments provide a technique for allowing a message to be transmitted between networks using different communication protocols by processing and, if necessary, converting the message to a format compatible with the communication protocol used by the receiving node.
The embodiments described herein may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.) or a computer readable medium, such as magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, firmware, programmable logic, etc.). Code in the computer readable medium is accessed and executed by a processor. The code in which preferred embodiments are implemented may further be accessible through a transmission media or from a file server over a network. In such cases, the article of manufacture in which the code is implemented may comprise a transmission media, such as a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc. Thus, the “article of manufacture” may comprise the medium in which the code is embodied. Additionally, the “article of manufacture” may comprise a combination of hardware and software components in which the code is embodied, processed, and executed. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present invention, and that the article of manufacture may comprise any information bearing medium known in the art.
The described operations may be performed by circuitry, where “circuitry” refers to either hardware or software or a combination thereof. The circuitry for performing the operations of the described embodiments may comprise a hardware device, such as an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc. The circuitry may also comprise a processor component, such as an integrated circuit, and code in a computer readable medium, such as memory, wherein the code is executed by the processor to perform the operations of the described embodiments.
In the described implementations, the physical layer utilized the Ethernet protocol. In alternative implementations, alternative protocols providing link-to-link checksumming/CRCs (or other data detecting schemes) of the packet may be used instead of Ethernet, such as Serial Advanced Technology Attachment (SATA), Infiniband, serial attached SCSI cable, etc.
In described implementations, the transport layer comprised the iSCSI protocol. In alternative implementations other protocols known in the art for transmitting I/O commands in packets and providing end-to-end checksumming/CRCs (or other data detecting schemes) may be used.
In the described implementations, the packaged I/O commands comprised SCSI commands. In alternative implementations, the commands may be in different I/O command formats than SCSI, such as Advanced Technology Attachment (ATA).
In described embodiments, the iSCSI layer made calls to the iSER layer to access the RDMA data transfer capabilities. In additional embodiments, data transfer protocol layers other than iSCSI, such as an application or other data transfer protocols, may call the iSER layer to access RDMA data transfer capabilities.
In alternative embodiments, the IP over InfiniBand protocol (with Reliable Connections—RC) may be used instead of SDP to transmit packets encoded using a protocol, such as TCP, across an InfiniBand network. Further details on the IP over InfiniBand protocol (with Reliable Connections—RC) are described in the publication “IP over InfiniBand: Connected Mode”, published by the IETF as “draft-kashyap-ipoib-connected-mode-01.txt” (September 2003), which publication is incorporated herein by reference in its entirety. For instance, the SDP layer can instead be substituted for a TCP stack layered on top of an IPoIB (RC) implementation, and any part of that TCP/IPoIB combination can be placed either within the node 2 software or the adapter 24. In such embodiments, the IPoIB (RC) function may invoke the RDMA layer 26 as needed according to the IPoIB (RC) specification.
In additional embodiments, protocols other than TCP may be used to transmit the packets over an IP capable network, such as the Stream Control Transmission Protocol (SCTP), which protocol is defined in the publication “Stream Control Transmission Protocol”, RFC 2960 (Internet Society, 2000), which publication is incorporated herein by reference in its entirety.
The foregoing description of the implementations has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many implementations of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5495614||Dec 14, 1994||Feb 27, 1996||International Business Machines Corporation||Interface control process between using programs and shared hardware facilities|
|US6032224||Dec 3, 1996||Feb 29, 2000||Emc Corporation||Hierarchical performance system for managing a plurality of storage units with different access speeds|
|US6301250||Jul 31, 1997||Oct 9, 2001||Alcatel||Method of operating an interface device as well as interface device and exchange with such an interface device|
|US6360282||Mar 24, 1999||Mar 19, 2002||Network Appliance, Inc.||Protected control of devices by user applications in multiprogramming environments|
|US6374248||Dec 2, 1999||Apr 16, 2002||Sun Microsystems, Inc.||Method and apparatus for providing local path I/O in a distributed file system|
|US7043578||Jan 9, 2003||May 9, 2006||International Business Machines Corporation||Method, system, and program for processing a packet including I/O commands and data|
|US7299266||Sep 5, 2002||Nov 20, 2007||International Business Machines Corporation||Memory management offload for RDMA enabled network adapters|
|US7475153 *||Jul 16, 2004||Jan 6, 2009||International Business Machines Corporation||Method for enabling communication between nodes|
|US20020029281||May 23, 2001||Mar 7, 2002||Sangate Systems Inc.||Method and apparatus for data replication using SCSI over TCP/IP|
|US20020059309||Jun 25, 2001||May 16, 2002||International Business Machines Corporation||Implementing data management application programming interface access rights in a parallel file system|
|US20020059451||Aug 23, 2001||May 16, 2002||Yaron Haviv||System and method for highly scalable high-speed content-based filtering and load balancing in interconnected fabrics|
|US20020095547||Jan 12, 2001||Jul 18, 2002||Naoki Watanabe||Virtual volume storage|
|US20020124137||Jan 29, 2002||Sep 5, 2002||Ulrich Thomas R.||Enhancing disk array performance via variable parity based load balancing|
|US20030014544||Feb 15, 2001||Jan 16, 2003||Banderacom||Infiniband TM work queue to TCP/IP translation|
|US20030041211||Mar 7, 2002||Feb 27, 2003||Merkey Jeffrey Vernon||Dual axis RAID systems for enhanced bandwidth and reliability|
|US20030046396||Apr 5, 2002||Mar 6, 2003||Richter Roger K.||Systems and methods for managing resource utilization in information management environments|
|US20030058870||Sep 6, 2002||Mar 27, 2003||Siliquent Technologies Inc.||ISCSI receiver implementation|
|US20030061402||Sep 26, 2001||Mar 27, 2003||Satyendra Yadav||Method and apparatus enabling both legacy and new applications to access an InfiniBand fabric via a socket API|
|US20030067913||Oct 5, 2001||Apr 10, 2003||International Business Machines Corporation||Programmable storage network protocol handler architecture|
|US20030070043||Mar 7, 2002||Apr 10, 2003||Jeffrey Vernon Merkey||High speed fault tolerant storage systems|
|US20030084209||Oct 31, 2001||May 1, 2003||Chadalapaka Mallikarjun B.||System and method for storage virtualization|
|US20030084243||Oct 24, 2002||May 1, 2003||Kabushiki Kaisha Toshiba||Access method and storage apparatus of network-connected disk array|
|US20030099254||Oct 22, 2002||May 29, 2003||Richter Roger K.||Systems and methods for interfacing asynchronous and non-asynchronous data media|
|US20030101239||Nov 27, 2001||May 29, 2003||Takeshi Ishizaki||Storage device with VLAN support|
|US20030131228||Apr 1, 2002||Jul 10, 2003||Twomey John E.||System on a chip for network storage devices|
|US20030135514||Oct 25, 2002||Jul 17, 2003||Patel Sujal M.||Systems and methods for providing a distributed file system incorporating a virtual hot spare|
|US20030135692||Jan 14, 2002||Jul 17, 2003||Raidcore, Inc.||Method and system for configuring RAID subsystems with block I/O commands and block I/O path|
|US20030165160||Apr 23, 2002||Sep 4, 2003||Minami John Shigeto||Gigabit Ethernet adapter|
|US20030169690||Mar 5, 2002||Sep 11, 2003||James A. Mott||System and method for separating communication traffic|
|US20030172169||Mar 7, 2002||Sep 11, 2003||Cheng Charles T.||Method and apparatus for caching protocol processing data|
|US20040073622||Aug 19, 2003||Apr 15, 2004||Mcdaniel Scott S.||One-shot RDMA|
|US20050240678||Apr 21, 2004||Oct 27, 2005||Hufferd John L||Method, system, and program for communicating data transfer requests between data transfer protocols|
|US20050240941||Apr 21, 2004||Oct 27, 2005||Hufferd John L||Method, system, and program for executing data transfer requests|
|US20060013253||Jul 16, 2004||Jan 19, 2006||Hufferd John L||Method, system, and program for forwarding messages between nodes|
|1||Chu, H.K. Jerry et al. "Transmission of IPover InfiniBand (draft-ietf-ipoib-ip-over-infinband-06.txt)", 18 pp [online]. Working document of the internet Engineering Task Force (IETF) [online] Available from http://www.ietf.org/ietf/lid-abstracts.txt.|
|2||Culley, P. U. Elzur, R. Recio, S. Bailey, et al. "Marker PDU Aligned Framing for TCP Specification (Version 1.0)(draft-culley-iwarp-mpa-v1.0)", pp. 1-32. Release Specification of the RDMA Consortium. Available at http://www.rdmaconsortium.org.|
|3||InfiniBand Trade Association, "InfiniBank Architecture. Specification vol. 1, Release 1.1", Nov. 6, 2002, Final Title Copyright (pp. 1-2); Table of Contents (pp. 3-34); Chapter 1: Introduction (pp. 51-60); chapter2 :Glossary (pp. 61-75) and Chapter 3 : Architecture Overview (pp. 76-130).|
|4||Kashyap, V. "IP over infiniBand: Connected Mode (draft-kashyap-ipoid-connected-mode-01.txt)", 9 pages, Working Document of the Internet Engineering Task Force (IETF) [online] Available from http://ietf.org/ietf/lid-abstracts.txt.|
|5||Ko, M. Chadlapaka, U. Elzur, H. Shah, and P. Thaler. "iSCSI Extensions for RDMA Specification (Version 1.0)(draft-ko-iwarp-iser-v1.0)," pp. 1-76. Release Specification One RDMA Consortium. Available at http://www.rdmaconsortium.org.|
|6||Pinkerton, J. "Sockets Direct Protocol (SDP) for iWARP over TCP (v1.0) (draft-pinkerton-iwarp-sdp-v1.0)", 106pp. Release Specification of the RDMA Consortium. Available at http://rdmaconsortium.org.|
|7||Pinkerton, J. "Sockets Direct Protocol v1.0", Oct. 24, 2003. RDMA Consortium, 32 pages.|
|8||Recio, R. "RDMA enabled NIC (RNIC) Verbs Overview," pp. 1-28, dated Apr. 29, 2003. Available from http://www.rdmaconsortium.org/home/RNIC-Verbs-Overview2.pdf.|
|9||Recio, R. "RDMA enabled NIC (RNIC) Verbs Overview," pp. 1-28, dated Apr. 29, 2003. Available from http://www.rdmaconsortium.org/home/RNIC—Verbs—Overview2.pdf.|
|10||Shah, H., J. Pinkerton, R. Recio and P. Culley, "Direct Data Placement over Reliable Transports (Version 1.0) (draft-shah-iwarp-ddp-v 1.0)" pp. 1-35. Release Specification of the RDMA Consortium. Available at http://www.rdmaconsortium.org.|
|11||Stewart, R. et al. The Internet Society , "Stream Control Transmission Protocol", 114 pp. RFC2960 (Oct. 2000).|
|U.S. Classification||709/230, 709/217, 710/22, 709/223, 709/213, 709/246|
|Feb 28, 2014||REMI||Maintenance fee reminder mailed|
|Jul 20, 2014||LAPS||Lapse for failure to pay maintenance fees|
|Sep 9, 2014||FP||Expired due to failure to pay maintenance fee|
Effective date: 20140720