CROSS REFERENCE TO RELATED CO-PENDING APPLICATIONS
This application claims the benefit of U.S. provisional application Ser. No. 61/156,069 filed on Feb. 27, 2009 and entitled METHOD AND SYSTEM FOR COMPUTER CLOUD MANAGEMENT, which is commonly assigned and the contents of which are expressly incorporated herein by reference.
FIELD OF THE INVENTION
This application claims the benefit of U.S. provisional application Ser. No. 61/165,250 filed on Mar. 31, 2009 and entitled CLOUD ROUTING NETWORK FOR BETTER INTERNET PERFORMANCE, RELIABILITY AND SECURITY, which is commonly assigned and the contents of which are expressly incorporated herein by reference.
- BACKGROUND OF THE INVENTION
The present invention relates to network design and management and in particular to a system and a method for an adaptive network with automatic capacity scaling in response to load demand changes.
Networking changed the information technology industry by enabling different computing systems to communicate, collaborate and interact. There are many types of networks. The Internet is probably the biggest network on earth. It connects millions of computers all over the world. Wide Area Networks (WAN) are networks that are typically used to connect the computer systems of a corporation located in different geographies. Local Area Networks (LAN) are networks that typically provide connectivity in an office environment.
The purpose of a network is to enable communications between the systems that are connected to the network by delivering information from the source of the information to its destination. In such a mission, the network itself needs to have sufficient processing capacity and bandwidth capacity in order to perform traffic delivery and various processing tasks including figuring out an appropriate route for the traffic to travel through, handling of errors and accidents and ensuring the necessary security measures, among others.
A typical network includes two types of components: traffic processing components and connectivity components. Traffic processing components include the various types of networking devices such as router, switch and hub, among others. The connectivity components are typically called “links” that interconnect two processing components or end points. There are many ways to classify network links. Physical network links include those via Ethernet cable, wireless connectivity, satellite connectivity, optic fiber connections, dial-up phone line and so on. Virtual network links refer to logic links formed between two entities and may include many physical links as well as various processing components along the way. The combination of the processing capacity of the traffic processing components of a network determines the network's processing capacity. The bandwidth capacity of the various links together ultimately determines the bandwidth capacity of a network.
FIG. 1 shows a typical network 90 with many traffic processing components 105, 115, 125, 135 labeled as “router” as well as many links 101, 111, 121,131, 141, 151. Through this network 90, traffic is sent from source 100 to destination 150. When designing and managing a network, it is crucial to provision sufficient capacity. When there is not enough capacity, problems ranging from slowness, congestion, to packet loss and malfunctioning would occur.
In the prior art, network design and management are based on a fixed amount of capacity provisioned beforehand. One would acquire all the necessary hardware and software components, configure them, and then build connectivity between them. This fixed infrastructure provides a fixed amount of capacity. The problems of such approaches include high acquisition cost and over-provisioning or under-provisioning of capacity. Acquiring all the traffic processing components and setting up the links upfront can be very expensive for a large-scale network. The cost to build a large-scale network can range from millions of dollars to even higher. An example is the Internet itself, which costs billions of dollars to build and we are still investing millions of dollars to improve its capacity. An important aspect of the network is the fact that network traffic demand varies. Peak demands can be a few hundred percent or even higher than the average demand. In order to meet the needs of peak demand, the capacity of the network has to be over-provisioned. For example, a rule of thumb in designing a network is to provision 3-5 times the capacity of its normal demand. Such over-provisioning is necessary in order for the network to function properly and to meet its service agreements. However, normal bandwidth demand and processing demand are significantly lower than peak demands. It is not unusual to see a typical network's utilization rate to be only at 20%. Thus a significant portion of capacity is wasted. For large-scale networks, such waste is significant and ranges from thousands of dollars to millions of dollars or even higher. Further, such over-provisioning creates a significant carbon footprint. Today's telecommunication networks are responsible for 1% to 5% of global carbon footprint, and this percentage has been rising rapidly due to the rapid growth and adoption of information technology. FIG. 1A shows the discrepancy for typical networks between the provisioned capacity and actual capacity demand. Because prior art networks are based on fixed capacity, service suffers when capacity demand overwhelms the fixed capacity and waste occurs when demand is below the provisioned capacity.
- SUMMARY OF THE INVENTION
Thus there is an unfulfilled need for new approaches to build and manage a network that can eliminate the expensive upfront costs, reduce capacity waste, and improving utilization efficiency.
In general, in one aspect, the invention features a method for automatic scaling the processing capacity and bandwidth capacity of a network. The method includes providing a network comprising a plurality of traffic processing units and a plurality of network links. Next, providing monitoring means for monitoring processing capacity demand and bandwidth capacity demand of the network. Next, providing managing means for adding traffic processing units to the network, removing traffic processing units from the network, connecting links to the network and disconnecting links from the network. Next, monitoring processing capacity demand and bandwidth capacity demand of the network via the monitoring means and then dynamically adjusting processing capacity of the network by selectively adding or removing traffic processing units in the network via the managing means upon observation of processing capacity demand increase or processing capacity demand decrease, respectively. The method also includes dynamically adjusting bandwidth capacity of the network by selectively connecting or disconnecting links in the network via the managing means upon observation of bandwidth capacity demand increase or bandwidth capacity decrease, respectively.
Implementations of this aspect of the invention may include one or more of the following. The traffic processing units include specially designed traffic processing hardware, such as router, switch, and hub, among others. The traffic processing units also include general purpose computers running specially designed traffic processing software. The traffic processing units utilize virtual machines and physical machines. The virtual machines are based on virtualization technology including VMWare, Xen and Microsoft Virutalization. The virtual machines are virtual computing instances provided by commercial cloud computing providers. The cloud computing providers include Amazon.com's EC2, RackSpace, SoftLayer, AT&T, GoGrid, Verizon, Fijitsu, Voxel, Google, Microsoft, FlexiScale, among others. The network is an overlay network superimposed over an underlying network. The network links are virtual network links of the underlying network. The underlying network may be the Internet, WAN, Wireless Network or a private network. The traffic processing units are distributed at different geographic locations. The traffic processing units are added or removed via an Application Programming Interface (API).
In general, in another aspect, the invention features a system for automatic scaling of the processing capacity and bandwidth capacity of a network. The system includes a network comprising a plurality of traffic processing units and a plurality of network links, monitoring means for monitoring processing capacity demand and bandwidth capacity demand of the network and managing means for adding traffic processing units to the network, removing traffic processing units from the network, connecting links to the network and disconnecting links from the network. The monitoring means monitor processing capacity demand and bandwidth capacity demand of the network and provide processing capacity demand information and bandwidth capacity demand information to the managing means. The managing means dynamically adjust the processing capacity of the network by selectively adding or removing traffic processing units in the network upon receiving information of processing capacity demand increase or processing capacity demand decrease, respectively. The managing means also dynamically adjust bandwidth capacity of the network by selectively connecting or disconnecting links in the network upon receiving information of bandwidth capacity demand increase or bandwidth capacity decrease, respectively.
Among the advantages of the invention may be one or more of the following. The network system is adaptive so that it always “provision” optimal capacity in response to the demand, eliminating capacity waste without sacrificing service quality, as shown in FIG. 2A. The network system is horizontally scalable. Its capacity increases linearly by just adding more traffic processing nodes to the system. It is also fault-tolerant. Failure of individual components within the system does not cause system failure. In fact, the system assumes component failures as common occurrences and is able to run on commodity hardware to deliver high performance and high availability services.
BRIEF DESCRIPTION OF THE DRAWINGS
The details of one or more embodiments of the invention are set forth in the accompanying drawings and description below. Other features, objects and advantages of the invention will be apparent from the following description of the preferred embodiments, the drawings and from the claims.
FIG. 1 shows the current Internet routing (prior art);
FIG. 1A is a graph of the network capacity demand versus time in a prior art network with fixed capacity;
FIG. 2 shows a cloud routing network of the present invention;
FIG. 2A shows the global locations of a geographically distributed network;
FIG. 2B a graph of the network capacity demand versus time in an adaptive network that changes its capacity based on demand;
FIG. 3 shows the functional blocks of the cloud routing system of FIG. 2;
FIG. 4 shows the traffic processing pipeline in the cloud routing network of FIG. 2;
FIG. 5 shows the cloud routing workflow of the present invention;
FIG. 6 shows the process of network capacity auto-scaling and route convergence of the present invention;
FIG. 7 shows the node management workflow of the present invention;
FIG. 8 shows various components in a cloud routing network;
FIG. 9 shows a traffic management unit (TMU); and
DETAILED DESCRIPTION OF THE INVENTION
Cloud Routing Network
FIG. 10 shows the various sub-components of a traffic processing unit (TPU).
The present invention describes a cloud routing network that is implemented as an overlay virtual network or as a physical network. By way of background, we use the term “cloud routing network” to refer to a network (virtual or physical) that includes traffic processing nodes (TPUs) deployed at various locations inter-connected by network links, through which client traffic travels to destinations. A cloud routing network can be a virtual overlay network superimposed on an underlying physical network, a physical network or a combination of both. Referring to FIG. 2, the cloud routing network 300 includes router clouds 340, 350 and 360, which are superimposed over a physical network 370, which in this case is the Internet. Cloud 340 includes TPUs 342, 344, 346. Cloud 350 includes TPUs 352, 354 and cloud 360 includes TPUs 362, 364. Each TPU has a certain amount of processing capacity. The TPUs are connected to each other via network links. Each link possesses a certain amount of bandwidth. The processing capacity of the cloud network is the combined processing capacities of all the TPUs. The bandwidth capacity of the cloud network is the combined bandwidth capacity of all the links.
Cloud network 300 also includes a traffic management system 330, a traffic processing system 334, a data processing system 332 and a monitoring system 336. These systems are specialized software that the traffic processing nodes run in order to perform functions such as traffic monitoring, TPU node management, traffic re-direction, traffic splitting, load balancing, traffic inspection, traffic cleansing, traffic optimization, route selection, route optimization, among others. In one example, cloud network 300 is implemented as a virtual network that includes virtual machines at various commercially available cloud computing data centers, such as Amazon.com's Elastic Computing Cloud (EC2), SoftLayer, RackSpace, GoGrid, FlexiScale, AT&T, Verizon, Fujitsu, Voxel, among others. These cloud computing data centers provide the physical infrastructure to add or remove TPU nodes dynamically, which further enables the virtual network to scale both its processing capacity and network bandwidth capacity. When traffic grows to a certain level, the network starts up more TPUs, adds links to these new TPU nodes and thus increases the network's processing power as well as bandwidth capacity. When traffic level decreases to a certain threshold, the network shuts down certain TPUs to reduce its processing and bandwidth capacity.
The traffic management system 330 directs network traffic to its traffic processing units (TPU). The traffic monitoring system 336 monitors the network traffic, the traffic processing system 334 inspects and processes the network traffic and the data processing 332 gathers data from different sources and provides global decision support and means to configure and manage the system. Referring to FIG. 3, the functional components of the cloud routing system 300 include a traffic management interface unit 410, a traffic redirection unit 420, a traffic routing unit 430, a node management unit 440, a monitoring unit 450 and a data repository 460. The traffic management interface unit 410 includes a management user interface (UI) 412 and a management API 414.
For a virtual overlay network based cloud routing network, most TPU nodes are virtual machines running specialized traffic handling software. Various TPU nodes may belong to different clouds. Each cloud itself is a collection of nodes located in the same data center (or the same geographic location). Some nodes perform traffic management. Some nodes perform traffic processing. Some nodes perform monitoring and data processing. Some nodes perform management functions to adjust the network's capacity. Some nodes perform access management and security control. These nodes are connected to each other via the underlying network 370. The connection between two nodes may contain many physical links and hops in the underlying network, but these links and hops together form a conceptual “virtual link” that conceptually connects these two nodes directly. All these virtual links together with the TPU nodes form a virtual network. Each node has only a fixed amount of bandwidth and processing capacity. The capacity of the network is the sum of the capacity of all nodes, and thus a cloud routing network has only a fixed amount of processing and network capacity at any given moment. This fixed amount of capacity may be insufficient or excessive for the traffic demand. By adjusting the capacity of individual nodes or by adding or removing nodes, the network is able to adjust its processing power as well as bandwidth capacity.
- Traffic Processing
In the case when a cloud routing network is primarily a physical network, most TPU nodes are physical machines running specialized traffic handling software, including general purpose computers as well as specially designed hardware appliances. Again, various TPU nodes may belong to different clouds. In each cloud, some nodes perform traffic management. Some nodes perform traffic processing. Some nodes perform monitoring and data processing. Some nodes perform management functions to adjust the network's capacity. Some nodes perform access management and security control. These nodes are connected to each other via network links. These links together with the TPU nodes form a network. Each node has only a fixed amount of bandwidth and processing capacity. The capacity of this network is the sum of the capacity of all nodes, and thus a cloud routing network has only a fixed amount of processing and network capacity at any given moment. This fixed account of capacity may be insufficient or excessive for the traffic demand. By adjusting the capacity of individual nodes or by adding or removing nodes, the network is able to adjust its processing power as well as bandwidth capacity.
The invention uses a cloud routing network service to process traffic and thus delivers “conditioned” traffic from source to destination according to delivery requirements. FIG. 2
shows a typical traffic processing service. When a client 305
issues a request to a network service running on servers 550
, the cloud routing network 300
processes the request by doing the following steps:
- 1. Traffic management service 330 intercepts the requests and routes the request to a TPU node;
- 2. The TPU node checks the service's specific policy and performs the pipeline processing shown in FIG. 4;
- 3. If necessary, a global data repository 332 is used for data collection and data analysis for decision support;
- 4. If necessary, the client request is routed to the next TPU node, i.e., from TPU 342 to 352; and then
- 5. Request is sent to an “optimal” server 550 for processing
More specifically, when a client issues a request to a server (for example, a consumer enters a web URL into a web browser to access a web site), the default Internet routing mechanism would route the request through the network hops along a certain network path from the client to the target server (“default path”). Using a cloud routing network, if there are multiple server nodes, the cloud routing network first selects an “optimal” server node from the multiple server nodes to as the target serve node to serve the request. This server node selection process takes into consideration factors including load balancing, performance, cost, and geographic proximity, among others. Secondly, instead of going through the default path, the traffic management service redirects the request to an “optimal” TPU within the overlay network (“Optimal” is defined by the system's routing policy, such as being geographically nearest, most cost effective, or a combination of a few factors). This “optimal” TPU further routes the request to second “optimal” TPU within the cloud routing network if necessary. For performance and reliability reasons, these two TPU nodes communicate with each other using either the best available or an optimized transport mechanism. Then the second “optimal” node may route the request to a third “optimal” node and so on. This process can be repeated within the cloud routing network until the request finally arrives at the target. The set of “optimal” TPU nodes together form a “virtual” path along which traffic travels. This virtual path is chosen in such a way that a certain routing measure (such as performance, cost, carbon footprint, or a combination of a few factors) is optimized. When the server responds, the response goes through a similar pipeline process within the cloud routing network until it is reaches the client.
- Process Capacity Scaling and Bandwidth Capacity Scaling
FIG. 5 shows a typical network routing process. In this embodiment, the traffic management service utilizes a Domain Name Server (DNS) mechanism. The customer 801 configures the DNS record for an application so that DNS queries are processed by the cloud routing network 800, as shown in FIG. 8. Typical ways of configuring DNS records include setting the DNS server, the CNAME record or the “A” record of the application to a DNS server provided by the cloud routing network. When a client wants to access the application (e.g. www.somesite.com), the client needs to resolve the hostname to an IP address. The cloud routing network receives the DNS query. Based on the current routing policy, the network 800 first selects an “optimal” server node among the plurality of server nodes that the application is running on, and then selects an entry router 803. The IP address of the entry router node 803 is returned as a result of the DNS query. When the entry router 803 receives a message from the client 801, it selects an optimal exit router node 804, optimal path 805 as well as an optimal transport mechanism to deliver the message. The exit router node 804 receives the message, and further delivers it to the target server node 820. In this process, client IP, path information and performance metrics data are collected and logged in data processing unit (DPU) 806, which can be used for future path selection and node selection.
The invention enables a network to adjust its process capacity and bandwidth in response to traffic demand variations. The cloud routing network 300 monitors traffic demand, load conditions, network performance and various other factors via its monitoring service 336. When certain conditions are met, it dynamically launches new nodes at appropriate locations, activates links to these new nodes and spreads traffic to these new nodes in response to increased demand, or shuts down some existing nodes in response to decreased traffic demand. The net result is that the cloud routing network dynamically adjusts its processing and network capacity to deliver optimal results while eliminating unnecessary capacity waste and carbon footprint.
A cloud routing network utilizes an Application Programming Interface (API) from individual nodes to add or remove nodes from the network. Cloud computing providers typically provide APIs that allows a third party to manage machines instances. For example, Amazon.com's EC2 provides Amazon Web Services (AWS) based APIs that a third party can send web services messages to interact with and manage virtual machine instances, such as starting a new node, shutting down an existing node, checking the status of a node, etc. The managing means of the cloud routing network typically utilizes such APIs to add or remove traffic processing nodes and links, thus adjusting the network's capacity.
FIG. 6 depicts two important aspects of the cloud routing network: adaptive scaling and path convergence. Based on the continuously collected metrics data from monitor nodes and logs, the node management module 440 (shown in FIG. 3) checks the current capacity and takes actions. When it detects that capacity is “insufficient” according to a certain measure, it starts new router nodes. The router table is updated to include the new routers and thus spreads traffic to the new routers. When too much capacity is detected, node management module selectively shuts down some of the router nodes after traffic to these nodes have been drained up. The router tables are updated by removing these router nodes from the tables. At any time, when an event such as router failure or path condition change occurs, the router table is updated to reflect the change. The updated router table is used for subsequent routing.
- Traffic Processing Unit Node Management
Further, the cloud routing network can quickly recover from “fault”. When a fault such as node failure and link failure occurs, the system detects the problem and recovers from it by either starting a new node or selecting an alternative route. As a result, though individual components may not be reliable, the overall system is highly reliable.
Node management module 440 provides services for managing the TPU nodes, such as starting a virtual machine (VM) instance, stopping a VM instance and recovering from a node failure, among others. In accordance to the node management policies in the system, this service launches new nodes when the traffic demand is high and it shuts down some nodes when it detects these nodes are not necessary any more.
The node monitoring module 450 monitors the TPU nodes over the network, collects performance and availability data, and provides feedback to the cloud routing system 300. This feedback is then used to make decisions such as when to scale up and when to scale down. Data repository 460 contains data for the cloud routing system, such as Virtual Machine Image (VMI), application artifacts (files, scripts, and configuration data), routing policy data, and node management policy data, among others.
FIG. 7 shows the node management workflow. When the system receives a node status change event from its monitoring agents, it first checks whether the event signals a node down. If so, the node is removed from the system. If the system policy says “re-launch failed nodes”, the node controller will try to launch a new node. Then the system checks whether the event indicates that the current set of server nodes are getting overloaded. If so, at a certain threshold, and if the system's policy permits, a node manager will launch new nodes and notify the traffic management service to spread load to the new nodes. Finally, the system checks to see whether it is in the state of “having too much capacity”. If so and the node management policy permits, a node controller will try to shut down a certain number of nodes to eliminate capacity waste.
In launching new nodes, the system picks the best geographic region to launch the new node. Globally distributed cloud environments such as Amazon.com's EC2 cover several continents, as shown in FIG. 2A. Launching new nodes at appropriate geographic locations help spread application load globally, reduce network traffic and improve application performance. In shutting down nodes to reduce capacity waste, the system checks whether session stickiness is required for the application. If so, shutdown is timed until all current sessions on these nodes have expired.
The cloud routing network contains a monitoring service 336
(that includes monitoring module 450
) that provides the necessary data to the cloud routing network 300
as the basis for its decisions. Various embodiments implement a variety of techniques for monitoring. The following lists a few examples of monitoring techniques:
- 1. Internet Control Message Protocol (ICMP) Ping: A small IP packet that is sent over the network to detect route and node status;
- 2. traceroute: a technique commonly to check network route conditions;
- 3. Host agent: an embedded agent running on host computers that collects data about the host;
- 5. Security monitoring: A monitor node periodically scans a target system for security vulnerabilities such as network port scanning and network service scanning to determine which ports are publicly accessible and which network services are running, further determining whether there are vulnerabilities.
- 6. Content security monitoring: a monitor nodes would periodically crawls a web site and scans its content for detection of infected content, such as malware, spyware, undesirable adult content, or virus, among others.
The above examples are for illustration purpose. The present invention is agnostic and accommodates a wide variety of ways of monitoring. An embodiment of the present invention employs all above techniques for monitoring different target systems: Using ICMP, traceroute and host agent to monitor the cloud routing network itself, using web performance monitoring, network security monitoring and content security monitoring to monitor the available, performance and security of target network services such as web applications. A data processing system (DPS) would aggregate data from such monitoring service and provides all other services global visibility to such data.
Several embodiments of the present invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.