Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS8189603 B2
Publication typeGrant
Application numberUS 11/242,463
Publication dateMay 29, 2012
Filing dateOct 4, 2005
Priority dateOct 4, 2005
Also published asUS20070098001, US20120226835
Publication number11242463, 242463, US 8189603 B2, US 8189603B2, US-B2-8189603, US8189603 B2, US8189603B2
InventorsMammen Thomas
Original AssigneeMammen Thomas
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
PCI express to PCI express based low latency interconnect scheme for clustering systems
US 8189603 B2
Abstract
PCI Express is a Bus or I/O interconnect standard for use inside the computer or embedded system enabling faster data transfers to and from peripheral devices. The standard is still evolving but has achieved a degree of stability such that other applications can be implemented using PCIE as basis. A PCIE based interconnect scheme to enable switching and inter-connection between external systems, such that the scalability can be applied to enable data transport between connected systems to form a cluster of systems is proposed. These connected systems can be any computing or embedded system. The scalability of the interconnect will allow the cluster to grow the bandwidth between the systems as they become necessary without changing to a different connection architecture.
Images(3)
Previous page
Next page
Claims(13)
1. A network switch enabled for communication between a plurality of computers in a cluster wherein each computer is not a PCI-Express peripheral device;
said network switch comprise a plurality of ports,
wherein each port in a first set of ports of said plurality of ports is connected to a computer of said plurality of computers using a PCI Express link, forming said cluster;
a data transfer mechanism to transfer data to and from each of said first set of ports to the computer connected to that port, wherein said data transfer is done using a PCI-Express protocol;
a switching mechanism in said network switch enabled to transfer data at least between a first port of said first set of ports, on said network switch and any of a rest of said first set of ports, on the network switch;
wherein said switching mechanism and said data transfer mechanism together allow data received from the computer connected to said first port to be sent to any of said rest of said plurality of computers; and
wherein said switching mechanism and said data transfer mechanism together allows data received from any of said rest of said plurality of computers to be sent to said computer connected to said first port;
enabling interconnection and communication within said cluster, between said first of said plurality of computers and any of said rest of said plurality of computers, over said PCI Express Links using PCI Express Protocol.
2. The network switch of claim 1, wherein said data transfer mechanism and switching mechanism in said network switch further enable data transfer between any of said plurality of computers connected to said plurality of ports on said network switch enabling communication between said plurality of computers over said PCI Express Links using PCI Express Protocol.
3. The network switch of claim 1, wherein at least one of said plurality of ports is an interconnection port enabled to connect to a second interconnection port on a second network switch, similar in all respects to the network switch of claim 1.
4. The connection of claim 3, wherein said interconnection port on said network switch and said second interconnection port on said second network switch, is using PCI Express links.
5. The network switch of claim 3, wherein said switching mechanism further enabled to transfer data between said interconnection port on said network switch and any other of said plurality ports, also on said network switch.
6. The network switch of claim 3, wherein said data transfer mechanism further enabled to transfer data between said interconnection port on said network switch and said second interconnection port on said second network switch connected to it.
7. The data transfer mechanism of claim 6, wherein said transfer of data between said network switch and said second network switch over PCI Express links is using PCI Express Protocol.
8. A system for communication between a plurality of computers using network switches wherein each computer is not a PCI-Express peripheral device, and wherein the system comprise:
a plurality of network switches;
each of said plurality of network switches having a plurality of ports;
at least one of said plurality of ports on each of said plurality of network switches being an interconnection port;
each of said interconnection port on each of said plurality of network switches enabled to connect to another of said interconnection port on any another of said plurality of network switches, forming interconnected network switches, using PCI Express links;
a data transfer mechanism enabled to transfer data between any of said connected interconnection ports, wherein said data transfer is done using a PCI-Express protocol;
each of said plurality of computers connected to one of said plurality of ports, excluding said interconnection ports, of said plurality of network switches using PCI Express links;
said plurality of network switches with said plurality of computers connected to it forming a cluster of connected computers;
said data transfer mechanism further enabled to transfer data to and from each one of said plurality of ports to said one of said plurality of computers connected to it, wherein said data transfer is done using a PCI-Express protocol; and
a switching mechanism, on each of said plurality of network switches, enabled to transfer data between at least a first of said plurality of ports, on each of said plurality of network switches, and any other of said plurality of ports, on said network switch;
such that said data transfer mechanism and said switching mechanism together enable data communication within and between said clusters of connected computers through said interconnected network switches.
9. The system of claim 8, wherein the system enabling interconnecting and data communication between computers connected in multiple clusters, create a super cluster of connected computers.
10. The system of claim 9, wherein said data communication between computers connected in said super cluster is over PCI Express links, using PCI Express protocols.
11. The system of claim 8, wherein the data transfer mechanism is enabled to transfer data between a third interconnection port on a first network switch to a fourth interconnect port on a second network switch.
12. The system in claim 8, wherein said switching mechanism, on each of said plurality of network switches, enabled to transfer data between any of said plurality ports, on each of said plurality of network switches, and any other of said plurality
of ports, on said network switch, including said interconnection port.
13. A network switch used for interconnecting a plurality of computers in a cluster,
wherein each computer is not a PCI-Express peripheral device, and wherein said cluster comprising: said plurality of interconnected computers and said network switch, wherein said network switch enables communication between said interconnected computers;
said network switch comprising:
at least a first port wherein said first port is connected to a first computer using a PCI-Express link;
at least a second port wherein said second port is connected to a second computer using PCI-Express link;
a data transfer mechanism enabled to transfer data to and from said first computer to said first port wherein said data transfer is done using PCI-Express protocol;
said data transfer mechanism further enabled to transfer data to and from said second computer to said second port, wherein said data transfer is done using PCI-Express protocol;
a switching mechanism enabled to transfer data between said first port and said second port on said network switch;
such that said switching mechanism together with said data transfer mechanism allow data transfer and hence communication between the plurality of computers in the cluster.
Description
FIELD OF INVENTION

This invention relates to cluster interconnect architecture for high-speed and low latency information and data transfer between the systems in the configuration.

BACKGROUND AND PRIOR ART

The need for high speed and low latency cluster interconnect scheme for data and information transport between systems have been recognized as one needing attention in recent times. The growth of interconnected and distributed processing schemes have made it essential that high speed interconnect schemes be defined and established to provide the speed up the processing and data sharing between these systems.

There are interconnect schemes that allow data transfer at high speeds, the most common and fast one existing today is the Ethernet connection allowing transport speeds from 10 MB to as high as 10 GB/sec. TCP/IP protocols used with Ethernet have high over head with inherent latency that make it unsuitable for some distributed applications. Effort is under way in different areas of data transport to reduce the latency of the interconnect as this is a limitation on growth of the distributed computing power.

What is Proposed

PCI Express (PCIE) is an emerging I/O interconnect standard for use inside computers, or embedded systems that allow serial high speed data transfer to and from peripheral devices. The typical PCIE provides 2.5 GB transfer rate per link (this may change as the standard and data rates change). Since the PCIE standard is starting become firm and used within the systems, what is disclosed is the use of PCIE standard based peripheral to PCIE standard based peripheral connected directly using data links, as an interconnect between individual stand alone systems, typically through an interconnect module or a network switch. This interconnect scheme by using only PCIE based protocols for data transfer over direct physical connection links between the PCIE based peripheral devices, (see FIG. 1), with out any intermediate conversion of the transmitted data stream to other data transmission protocols or encapsulation of the transmitted data stream within other data transmission protocols, reduces the latencies of communication in a cluster. The PCIE standard based peripheral at a peripheral end point of the system, by directly connecting using PCIE standard based peripheral to PCIE standard based peripheral direct data link connections to the PCIE standard based peripheral at the switch, provides for increase in the number of links per connection as band width needs increase and thereby allow scaling of the band width available within any single interconnect or the system of interconnects.

Some Advantages of the Proposed Connection Scheme:

    • 1. Reduced Latency of Data transfer as conversion from PCIE to other protocols like ethernet is avoided during transfer.
    • 2. The number of links per connection can scale from X1 to larger numbers X32 or even X64 possible based on the bandwidth needed.
    • 3. Minimum change in interconnect architecture is needed with increased bandwidth, enabling easy scaling with need.
    • 4. Standardization of the PCIE based peripheral will make components easily available from multiple vendors, making the implementation of interconnect scheme easier and cheaper.
    • 5. The PCIE based peripheral to PCIE based peripheral links in connections allow ease of software control and provide reliable bandwidth.
DESCRIPTION OF FIGURES

FIG. 1 Typical Interconnected (multi-system) cluster (shown with eight systems connected in a star architecture using direct connected data links between PCIE standard based peripheral to PCIE standard based peripheral)

FIG. 2 A cluster using multiple interconnect modules or switches to interconnect smaller clusters.

EXPLANATION OF NUMBERING AND LETTERING IN THE FIG. 1

  • (1) to (8): Number of Systems interconnected in FIG. 1
  • (9): Switch sub-system.
  • (10): Software configuration and control input for the switch.
  • (1 a) to (8 a): PCI Express based peripheral module (PCIE Modules) attached to systems.
  • (1 b) to (8 b): PCI Express based peripheral modules (PCIE Modules) at switch.
  • (1L) to (8L): PCIE based peripheral module to PCIE based peripheral module connections having n-links (n-data links)
EXPLANATION OF NUMBERING AND LETTERING IN THE FIG. 2

  • (12-1) and (12-2): clusters
  • (9-1) and (9-2): interconnect modules or switch sub-systems.
  • (10-1) and (10-2): Software configuration inputs
  • (11-1) and (11-2): Switch to switch interconnect module in the cluster
  • (11L): Switch to switch interconnection
DESCRIPTION OF THE INVENTION

PCI Express is a Bus or I/O interconnect standard for use inside the computer or embedded system enabling faster data transfers to and from peripheral devices. The standard is still evolving but has achieved a degree of stability such that other applications can be implemented using PCIE as basis. A PCIE based interconnect scheme to enable switching and inter-connection between external systems, such that the scalability can be applied to enable data transport between connected systems to form a cluster of systems is proposed. These connected systems can be any computing or embedded system. The scalability of the interconnect will allow the cluster to grow the bandwidth between the systems as they become necessary without changing to a different connection architecture.

FIG. 1 is a typical cluster interconnect. The Multi-system cluster shown consist of eight units or systems {(1) to (8)} that are to be interconnected. Each system has a PCI express (PCIE) based peripheral module {(1 a) to (8 a)} as an IO module, at the interconnect port, with n-links built into or attached to the system. (9) is an interconnect module or a switch sub-system, which has number of PCIE based interconnect modules equal to or more than the number of systems to be interconnected, in this case of FIG. 1 this number being eight {(1 b) to (8 b)}, that can be interconnected for data transfer through the switch. A software based control input is provided to configure and/or control the operation of the switch. Link connections {(1L) to (8L)} attach the PCIE based peripheral modules on the respective systems to those on the switch with n links. The value of n can vary depending on the connect band width required by the system.

When data has to be transferred between say system 1 and system 5, in the simple case, the control is used to establish an internal link between PCIE based peripheral modules 1 b and 5 b inside the switch. The hand shake is established between outbound PCIE based peripheral module (PCIE Module) 1 a and inbound PCIE module 1 b and outbound PCIE module 5 a and inbound PCIE module 5 b. This provides a through connection between the PCIE modules 1 a to 5 b through the switch allowing data transfer. Data can then be transferred at speed between the modules and hence between systems. In more complex cases data can also be transferred and qued in storage implemented in the switch and then when links are free transferred out to the right systems at speed.

Multiple systems can be interconnected at one time to form a multi-system that allow data and information transfer and sharing through the switch. It is also possible to connect smaller clusters together to take advantage of the growth in system volume by using an available connection scheme that interconnects the switches that form a node of the cluster.

If need for higher bandwidth and low latency data transfers between systems increase, the connections can grow by increasing the number of links connecting the PCIE modules between the systems in the cluster and the switch without completely changing the architecture of the interconnect. This scalability is of great importance in retaining flexibility for growth and scaling of the cluster.

It should be understood that the system may consist of peripheral devices, storage devices and processors and any other communication devices. The interconnect is agnostic to the type of device as long as they have a PCIE module at the port to enable the connection to the switch. This feature will reduce the cost of expanding the system by changing the switch interconnect density alone for growth of the multi-system.

PCIE is currently being standardized and that will enable the use of the existing PCIE modules to be used from different vendors to reduce the over all cost of the system. In addition using a standardized module in the system as well as the switch will allow the cost of software development to be reduced and in the long run use available software to configure and run the systems.

As the expansion of the cluster in terms of number of systems, connected, bandwidth usage and control will all be cost effective, it is expected the over all system cost can be reduced and over all performance improved by standardized PCIE module use with standardized software control.

Typical connect operation may be explained with reference to two of the systems, example system (1) and system (5). System (1) has a PCIE module (1 a) at the interconnect port and that is connected by the connection link or data-link or link (1L) to a PCIE module (1 b) at the IO port of the switch (9). System (5) is similarly connected to the switch trough the PCIE module (5 a) at its interconnect port to the PCIE module (5 b) at the switch (9) IO port by link (5L). Each PCIE module operates for transfer of data to and from it by standard PCI Express protocols, provided by the configuration software loaded into the PCIE modules and switch. The switch operates by the software control and configuration loaded in through the software configuration input.

FIG. 2 is that of a multi-switch cluster. As the need tom interconnect larger number of systems increase, it will be optimum to interconnect multiple switches of the clusters to form a new larger cluster. Such a connection is shown in FIG. 2. The shown connection is for two smaller clusters (12-1 and 12-2) interconnected using PCIE modules that can be connected together using any low latency switch to switch connection (11-10 and 11-2), connected using interconnect links (11L) to provide sufficient band width for the connection. The switch to switch connection transmits and receives data and information using any suitable protocol and the switches provide the interconnection internally through the software configuration loaded into them.

The following are some of the advantages of the disclosed interconnect scheme

    • 1. Provide a low latency interconnect for the cluster.
    • 2. Use of PCIExpress based protocols for data and information transfer within the cluster.
    • 3. Ease of growth in bandwidth as the system requirements increase by increasing the number of links within the cluster.
    • 4. Standardized PCIE component use in the cluster reduce initial cost.
    • 5. Lower cost of growth due to standardization of hardware and software.
    • 6. Path of expansion from a small cluster to larger clusters as need grows.
    • 7. Future proofed system architecture.

In fact the disclosed interconnect scheme provides advantages for low latency multi-system cluster growth that are not available from any other source.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US5805597 *Jun 4, 1996Sep 8, 1998National Semiconductor CorporationMethod and apparatus for providing low power basic telephony type service over a twisted pair ethernet physical layer
US5859844 *Aug 26, 1996Jan 12, 1999Fujitsu LimitedUninterrupted-service expansion system for a cross connecting device
US5920545 *May 8, 1996Jul 6, 1999Nokia Telecommunications OyNon-transparent data transmission in a digital telecommunications system
US6108739 *Apr 29, 1999Aug 22, 2000Apple Computer, Inc.Method and system for avoiding starvation and deadlocks in a split-response interconnect of a computer system
US6260092 *Sep 24, 1998Jul 10, 2001Philips Semiconductors, Inc.Point to point or ring connectable bus bridge and an interface with method for enhancing link performance in a point to point connectable bus bridge system using the fiber channel
US6362908 *Dec 2, 1998Mar 26, 2002Marconi Communications, Inc.Multi-service adaptable optical network unit
US6615306 *Nov 3, 1999Sep 2, 2003Intel CorporationMethod and apparatus for reducing flow control and minimizing interface acquisition latency in a hub interface
US7000037 *Dec 9, 2002Feb 14, 2006Josef RabinovitzLarge array of mass data storage devices connected to a computer by a serial link
US7058738 *Apr 28, 2004Jun 6, 2006Microsoft CorporationConfigurable PCI express switch which allows multiple CPUs to be connected to multiple I/O devices
US7145866 *Jun 1, 2001Dec 5, 2006Emc CorporationVirtual network devices
US7151774 *Jun 13, 2001Dec 19, 2006Advanced Micro Devices, Inc.Method and apparatus for trunking links having different transmission rates
US7457322 *May 2, 2005Nov 25, 2008Rockwell Automation Technologies, Inc.System and method for multi-chassis configurable time synchronization
US20030099247 *Nov 27, 2001May 29, 2003Rene ToutantProgrammable interconnect system for scalable router
US20030123461 *Dec 31, 2001Jul 3, 2003Riley Dwight D.Downstream broadcast PCI switch
US20030158940 *Feb 20, 2002Aug 21, 2003Leigh Kevin B.Method for integrated load balancing among peer servers
US20030188079 *Jun 18, 2001Oct 2, 2003Ashok SinghalNode controller for a data storage system
US20040083323 *Oct 24, 2002Apr 29, 2004Josef RabinovitzLarge array of SATA data device assembly for use in a peripheral storage system
US20040083325 *Dec 9, 2002Apr 29, 2004Josef RabinovitzLarge array of mass data storage devices connected to a computer by a serial link
US20040170196 *Mar 4, 2004Sep 2, 2004Intel CorporationSynchronization of network communication link
US20050147119 *Dec 8, 2004Jul 7, 2005Tofano Mark E.Computer program products supporting integrated communication systems that exchange data and information between networks
US20060015537 *Jul 19, 2004Jan 19, 2006Dell Products L.P.Cluster network having multiple server nodes
US20060050693 *Sep 3, 2004Mar 9, 2006James BuryBuilding data packets for an advanced switching fabric
US20060088046 *Oct 26, 2004Apr 27, 2006Wong Kar LQueue resource sharing for an input/output controller
US20060098659 *Nov 5, 2004May 11, 2006Silicon Graphics Inc.Method of data packet transmission in an IP link striping protocol
US20060101185 *Nov 5, 2004May 11, 2006Kapoor Randeep SConnecting peer endpoints
US20060114918 *May 10, 2005Jun 1, 2006Junichi IkedaData transfer system, data transfer method, and image apparatus system
US20060143311 *Dec 29, 2004Jun 29, 2006Rajesh MadukkarumukumanaDirect memory access (DMA) address translation between peer-to-peer input/output (I/O) devices
US20060206655 *Dec 10, 2004Sep 14, 2006Chappell Christopher LPacket processing in switched fabric networks
US20060251096 *Apr 18, 2005Nov 9, 2006Cisco Technonogy, Inc.PCI express switch with encryption and queues for performance enhancement
Classifications
U.S. Classification370/401
International ClassificationH04L12/56
Cooperative ClassificationH04L49/40
European ClassificationH04L49/40