|Publication number||US7346064 B2|
|Application number||US 10/255,419|
|Publication date||Mar 18, 2008|
|Filing date||Sep 26, 2002|
|Priority date||Sep 26, 2002|
|Also published as||US20040062242|
|Publication number||10255419, 255419, US 7346064 B2, US 7346064B2, US-B2-7346064, US7346064 B2, US7346064B2|
|Inventors||Percy K. Wadia, Ronald L. Dammann, James A. McConnell|
|Original Assignee||Intel Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (8), Non-Patent Citations (1), Referenced by (2), Classifications (14), Legal Events (4)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This invention relates generally to data communications over interconnect infrastructure, and particularly to routing packets in packet-based input/output data communications.
Using a host of interconnect technologies, different forms of data may be typically routed between a variety of data computers or servers in application processing and enterprise computing environments. As an example, in a packet-based input/output communication a data packet, such as a unicast and/or a multicast packet may be received at an interconnect device including a switch or a router. While the unicast packet is forwarded to a predefined destination port, the multicast packet is transmitted to more than one destination ports. Although each transfer of a multicast packet may be treated as a unicast transfer, however, satisfying all the multicast requirements may be extremely inefficient and time consuming because of the number of iterations required to transfer the multicast packet to multiple destination ports on a one-to-one transfer basis. Also at a switch (or any other interconnect device), the multicast packet may remain in an input buffer for a long time, creating a backlog of pending packets thereafter.
Regardless of the packet type, another significant bottleneck for routing a packet through an interconnect device is mapping of addresses. While routing a packet, typically some form of address mapping occurs in a central unit, which is accessed by all the ports on the switch. Since a single unit does the entire mapping for each port, it is a serial process, creating a significant bottleneck and slow down in the switch, even though the ports may have the required bandwidth to transfer the packet. For mapping, an address translation process commonly involves two stages. Upon receipt of a packet, while one stage provides mapping to a specific destination port, the other stage involves mapping to provide one or more destination paths associated with one or more destination ports on which the packet is routed. Because the second stage needs the destination port, it must sequentially follow the first stage, resulting in increased latency.
Thus, there is a continuing need for better ways to route packets in packet-based input/output data communications, especially in a switch.
An interconnect device 20 shown in
In one particular example, the interconnect medium includes a fabric, which may connect via one or more switches and routers, multiple destination ports that may be located in a variety of end node devices, such as input/output (I/O) devices. In many other examples, the interconnect medium may handle input/output data communications and inter-processor communications in a multi-computer environment which may include data servers networked over Internet, forming a network.
Examples of the end node devices include a single chip or processor-based devices or Internet adapters and host computers, such as data servers used in application processing and enterprise computing networks connected through a system area network (SAN). One example of a fabric is a switched input/output fabric that is capable of connecting multiple end node devices including processor nodes, redundant array of inexpensive disks (RAID) subsystems, I/O chassis, and storage subsystems. In some embodiments, the I/O chassis may further be connected to Internet, video devices, graphics, and/or a fiber channel hub.
Instead of sending the inbound data packet 40, if determined to be a multicast packet, to each destination port iteratively, the packet is simultaneously broadcast to all the destination ports in a specific example. Multicast refers to the case in which the inbound data packet 40 is to be transmitted to more than one destination port. A unicast packet may simply be forwarded to a predefined destination port, typically referred to a primary default multicast port. To appropriately route the inbound data packet 40 based on a packet attribute associated therewith, the packet broadcaster 30 may include an address mapping unit 50 and a multicast unit 55 in accordance with one embodiment of the present invention. The address mapping unit 50 may further comprise a memory 60, storing all combinations of a source port (e.g., the input port 25(1)) and one or more destination ports, and one or more destination paths associated with each of the destination ports. Likewise, the multicast unit 55 may include a set of buffers 65 to hold multicast packets in one illustrative example.
Using the memory 60, the destination ports and the destination paths may be looked up and all the destination ports for a packet transfer may be requested. In particular, a linear address mapping in the address mapping unit 50 may specify the destination ports to which the inbound data packet 40 may be routed or broadcast depending upon whether the inbound data packet 40 is a unicast or a multicast packet, respectively. For each destination port, a destination path (e.g., a virtual lane) may also be determined on which the inbound data packet 40 is to be transferred.
When identified to be a multicast packet from the memory 60, all combinations of a source port and one or more destination ports along with one or more destination paths associated with each of destination ports may be simultaneously determined for the inbound data packet 40. Thereafter, the inbound data packet 40 may be broadcast to at least one destination path associated with each of the destination ports indicated by a lookup of the memory 60, routing the multicast packet to a desired destination in some embodiments.
Upon receipt of a grant to transfer from one or more destination ports, the inbound data packet 40 may be transferred to all the associated destination ports on respective destination paths. If the inbound data packet 40 is determined to be a multicast packet, the set of buffers 65 may hold the packet while it is being transferred to the destination ports instead of holding the packet in the input port 25(1) queue as conventionally done, providing substantial speed up in handling of multicast packets in some embodiments. Moreover, being a multicast packet, the inbound data packet 40 may be broadcast simultaneously to all the desired destination ports over associated destination paths, obviating the need for one-to-one iterative transfer to each destination port.
For each inbound data packet 40, an associated packet attribute may be extracted from a local route header (LRH) in some embodiments. A destination local identifier (DLID) may be obtained from the packet attribute in order to route the packet from a source port to one or more destination ports. The inbound data packet 40 may also include a service level attribute, indicating a service level associated with the data packet, in some situations for instance. Based on the lookup, a mapping of a service level attribute onto one or more destination paths associated with each of the destination ports to which the inbound data packet 40 is to be sent may be performed at the interconnect device 20, in some particular examples consistent with the present invention.
The memory 60 may be configured such that the lookup is independent of a destination port. According to one particular embodiment, configuring the memory 60 entails providing an indication of all the destination paths of the destination ports, forming a combination of the source port and the service level attribute that the interconnect device 20 supports. By associating this combination with a virtual lane indication included in the memory 60 for at least two or each destination path associated with at least two or each destination port on the same link, depending upon an application, one or more differentiable service levels may be implemented in some cases. A service level associated with the inbound data packet 40 may be extracted from the service level attribute. In this way, based on all combinations of the source ports and the destination ports for the inbound data packet 40, mapping of the service level onto at least one destination path of at least two or each destination port over a next link (e.g., the link 35(2) being a continuation of the link 35(1)) may be accomplished, in some embodiments of the present invention.
Using a source port number (SOURCE PORT #) and the service level designated to the inbound data packet 40, all the destination paths or virtual lanes (VLs) required to route the inbound data packet 40 to any destination port may be obtained. Because both the SL-VL mapping and the DLID-destination port mapping are done independently of the destination port, a single lookup of the memory 60 may provide a specific destination path for a unicast packet or the destination paths for all the destination ports for a multicast packet. The destination port number and the destination paths or virtual lanes for all destination ports may be combinatorially processed by a multiplexer (MUX) 90, selecting one of all the destination paths or virtual lanes available as a result of the SL-VL mapping in the single lookup for the unicast packet or all the destination paths or virtual lanes corresponding to all the destination ports for the multicast packet.
More specifically, by looking up the memory 60, in accordance with many examples, both DLID-destination port mapping returning the bitmap of all the destination ports in parallel to the SL-VL mapping resulting in all the destination paths or virtual lanes for each destination port to which the inbound data packet 40 is to be sent, may be obtained within a single lookup time. For multicast packets in which the packet is to be broadcast to multiple destination ports, a significant speed up is possible in one scenario. As an example, for a multicast packet that is to be broadcast to all the destination ports in a 32-port switch, a single lookup implementation may result in a 32 X speed up in the address mapping unit 50, substantially reducing the memory 60 lookup latency. Essentially, the latency may be reduced from up to 32 serial lookups of the memory 60 to a single lookup latency, because both mappings are done in parallel in some situations.
A lookup process is illustrated in
A switch 120 shown in
Besides the packet broadcaster 30, the switch 120 may further include an interconnect logic 125 capable of controlling the switch 120 and providing one or more configuration options to suitably reconfigure the switch 120, depending upon a particular application or environment in which it may be deployed. In addition to the interconnect logic 125, the switch 120 may comprise an interconnect memory 130, storing a request-acknowledge (REQ-ACK) protocol 135 and a packet relay routine 140. The packet broadcaster 30 may include the address mapping unit 50 and multicast unit 55. As an example of some embodiments, the address mapping unit 50 may further include a service level-virtual lane (SL-VL) random access memory 145, while the multicast unit 55 may include a multicast unit (MU) random access memory 150, in one embodiment.
Using the REQ-ACK protocol 135, the packet relay routine 140 may control relaying of packets from the packet broadcaster 130. For a requester, the REQ-ACK protocol 135 between the multicast unit 55 and one or more destination ports may employ a set of control signals to route a unicast data packet from one input port 25(1) to another input port 25(N) to a specific destination port. Likewise, a multicast packet may be broadcast to all the desired destination ports. In particular, any destination port to which the multicast packet sent, to upon receipt thereof an associated acknowledge signal to the requester may be dropped, while the broadcasting of the multicast packet is continued until all the destination ports drop respective acknowledge signals, in one embodiment. By this approach, the multicast packet may use relatively fewer iterations rather than the one-to-one transfer case, substantially increasing the throughput for the multicast packet transfer.
Each port 25 may include an input buffer (IB) 155 to receive the inbound data packet and an output buffer (OB) 160 to hold the outbound data packet 45, in some embodiments of the present invention. Additionally, each port 25 may interface with the packet broadcaster 30 using a virtual interface 170. For each port 25, the virtual interface 170 may include a plurality of destination paths or virtual lanes (VLs) 175(1) through 175(M). Each virtual lane 175 may provide an independent data stream on a same link 35 in many different instances consistent with the present invention.
A memory map is shown in
At block 180, an initial mapping may be performed between a destination local identifier (DLID) to a specific destination port or ports. A check at diamond 182 may determine whether the inbound data packet 40 is a multicast packet. If the inbound data packet 40 is determined to be a unicast packet, the packet relay routine 140 may forward the packet to the specific destination port for appropriate routing at block 184. Conversely, at block 186, the multicast packet may be transferred to the multicast RAM 150 in the packet broadcaster 30 of the switch 120, as an example. Any source information, such as the source port number may also be transferred to the multicast RAM 150 for further processing by the multicast unit 55.
A lookup of the SL-VL RAM 145 may be performed for the service level (SL) and all the destination ports in parallel at block 188. At block 190, for the purposes of transferring the multicast packet, a request may be asserted to each of the destination ports. A check at diamond 192 may ascertain whether a receipt of acknowledge is received from any of the destination ports. If no receipt of acknowledge is received, the request to all the destination ports may be asserted again. Otherwise, the multicast packet may be broadcast to all the destination ports at block 194. Another check at diamond 196 may determine whether or not the multicast packet is received at any of the destination ports.
A retry attempt may be carried out at block 198, in case the multicast packet was not received at any of the destination ports. That is, the multicast packet may be broadcast again to all the destination ports. However, if the multicast packet is indicated to be received at any one of the destination ports, an acknowledge signal may be dropped corresponding to that port at block 200, indicating that the multicast packet was indeed received in a desired manner. Another check at diamond 202 may determine if the multicast packet is received by all the destination ports to which it is intended for, in one case without departing from the scope of the present invention. In case the multicast packet is not received at all the destination ports to which it is broadcast, the multicast packet may be broadcast again to all the destination ports. Alternatively, after a predetermined time out is reached for the multicast packet, the broadcasting is terminated and flow ends. When the multicast packet is determined to have not reached all the destination ports where it is indicated to be routed and yet the time out has not occurred then, broadcasting of the multicast packet may be performed again, in some embodiments.
A system area network (SAN) 310 as shown in
The switch 120 may use the packet relay routine 140 (
Illustrative timing charts for the REQ-ACK protocol 135 are shown in
Although not so limited, in some examples of the present invention, using a request-acknowledge hand-shake mechanism, upon receipt of a low-high transition 265 on an acknowledge signal ACK_A 270 from any of the two destination ports 230(1), 230(2), the multicast packet may be broadcast to both the destination ports. If any one of the destination ports 230(1), 230(2) receives the multicast packet in a desired manner, the corresponding acknowledge signal 270 may be dropped, i.e., transitioned to a high-low transition 275.
Another acknowledge signal ACK_B 280 as shown in
Consistent with some embodiments, Infiniband architecture (IBA) specification which describes a mechanism for interconnecting processor nodes and (I/O) nodes to form a system area network may be deployed to form the SAN 210. The Infiniband architecture specification is set forth by the Infiniband Trade Association in a specification entitled, “Infiniband Architecture, Vol. 1: General Specifications,” Release 1.0, published in June 2001. Using this architecture, the SAN 210 may be independent of the host operating system (OS) and processor platform. Because the IBA architecture is designed around a point-to-point architecture, a switched I/O fabric may interconnect end node devices 215 that may be connected by switches 120. The end node devices 215 may range from relatively inexpensive I/O devices like a single chip or Internet adapters to very complex host computers. The IBA-based switched I/O fabric may provide a transport mechanism for messages and queues for delivery between end node devices 215. The interconnect infrastructure called switched I/O fabric may be based on the way input and output connections may be constructed between multiple host and targets in many examples. Using the channel-based I/O model of mainframe computers, channels may be created by attaching host channel adapters and target channel adapters through switches 120. The host channel adapters may serve as I/O engines located within a server. The target channel adapters may enable remote storage and network connectivity into the switched I/O fabric.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US6839794 *||Oct 12, 2001||Jan 4, 2005||Agilent Technologies, Inc.||Method and system to map a service level associated with a packet to one of a number of data streams at an interconnect device|
|US6950394 *||Sep 7, 2001||Sep 27, 2005||Agilent Technologies, Inc.||Methods and systems to transfer information using an alternative routing associated with a communication network|
|US7010607 *||Sep 14, 2000||Mar 7, 2006||Hewlett-Packard Development Company, L.P.||Method for training a communication link between ports to correct for errors|
|US7111101 *||May 7, 2003||Sep 19, 2006||Ayago Technologies General Ip (Singapore) Ptd. Ltd.||Method and system for port numbering in an interconnect device|
|US20020143981 *||Apr 3, 2001||Oct 3, 2002||International Business Machines Corporation||Quality of service improvements for network transactions|
|US20030200315 *||Apr 23, 2002||Oct 23, 2003||Mellanox Technologies Ltd.||Sharing a network interface card among multiple hosts|
|US20040013088 *||Jul 19, 2002||Jan 22, 2004||International Business Machines Corporation||Long distance repeater for digital information|
|US20040024903 *||Jul 30, 2002||Feb 5, 2004||Brocade Communications Systems, Inc.||Combining separate infiniband subnets into virtual subnets|
|1||Intel(R) InfiniBand* Architecture "Solution for Developers and IT Managers" Brochure, 2002.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7599998 *||Jul 7, 2004||Oct 6, 2009||Arm Limited||Message handling communication between a source processor core and destination processor cores|
|US20050138249 *||Jul 7, 2004||Jun 23, 2005||Galbraith Mark J.||Inter-process communication mechanism|
|U.S. Classification||370/401, 370/229|
|International Classification||G06F11/00, H04L12/56, G08C15/00, H04L12/28, H04J3/14, H04J1/16|
|Cooperative Classification||H04L45/00, H04L45/24, H04L45/308|
|European Classification||H04L45/00, H04L45/308, H04L45/24|
|Sep 26, 2002||AS||Assignment|
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WADIA, PERCY K.;DAMMANN, RONALD L.;MCCONNELL, JAMES A.;REEL/FRAME:013341/0696;SIGNING DATES FROM 20020925 TO 20020926
|Jul 8, 2008||CC||Certificate of correction|
|Sep 14, 2011||FPAY||Fee payment|
Year of fee payment: 4
|Sep 2, 2015||FPAY||Fee payment|
Year of fee payment: 8