METHOD AND APPARATUS FOR WFQ SCHEDULING USING A PLURALITY OF SCHEDULING QUEUES TO PROVIDE FAIRNESS, HIGH SCALABILITY, AND LOW
COMPUTATION COMPLEXITY
Technical Field
The present invention relates to a scheduler which implements a scheduling algorithm for fixed length cells or variable length packets . The present invention provides a fair allocation of bandwidth among multiple connections based on each connection's weight, and can be scaled without any granularity restrictions. The scheduler can perform its computations in constant time, regardless of the number of connections it supports. Further, the scheduler can interleave packets from different connections to decrease the burstiness of the traffic.
Background of the Invention
There are different types of applications in an integrated service packet network such as an Asynchronous Transfer Mode (ATM) network. Some applications, such as voice or real-time media stream, need to be transmitted with little delay and little delay jitter. Similarly, applications, such as remote log-in or on-line trading, can tolerate only small amounts of delay. Other applications, such as e-mail or FTP, can tolerate longer delays, and therefore do not
need to be transmitted within a strict delay or delay jitter constraint. Because of the diverse range of acceptable delays for various applications, it is very important to support different levels of qualities of service (QoS) for these various applications. Also, it is important to allocate bandwidth among the connections in a fair manner so that a connection which is sending high rate or bursty traffic cannot occupy more bandwidth than it should be occupying. Thus, a scheduler should fairly allocate bandwidth among multiple connections according to their weights, so a connection sending traffic having an excessively high rate or bursty traffic will not adversely affect other connections .
In a system where a plurality of virtual connections (VCs) are competing to share a common resource (such as the same input port, output port, etc.), one way to control delay and fairly allocate that resource among the connections is to assign different weights to each VC and configure the system to serve the VCs according to their weights. Weight-based scheduling schemes have received a lot of attention, and many weight-based scheduling algorithms have been proposed, such as: • The General Processor Sharing (GPS) approach (See "A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single-Node Case," by Abhay K. Parekh and Robert G. Gallager, published in IEEE/ACM Transactions on Networking, Vol. 1, No. 3, June 1993, p. 344-357), • The Weighted Fair Queuing (WFQ) approach {See the above-mentioned Parekh article) ,
• The Worst Case Weighted Fair Queuing (WF2Q) Approach (See "WF2Q: Worst-case Fair Weighted Fair Queuing," by J.C. Bennett and H. Zhang, published in Proceedings of IEEE INFOCOM '96, p. 120-128), • The VirtualClock approach (See "VirtualClock: A New Traffic
Control Algorithm for Packet Switching Networks," by L. Zhang, published in Proceedings of ACM SIGCOMM 1990, p. 19-29 and "Leap Forward Virtual Clock: A new Fair Queuing Scheme with Guaranteed Delays and Throughput Fairness," by Subhash Suir, G. Varghese and Cirish Chandranmenon, published on IEEE INFOCOM' 97),
• The Self-Clocked Fair Queuing (SCFQ) approach (See "A Self- Clocked Fair Queuing Scheme for Broadband Applications," by S. J. Golestani, published in Proceedings of IEEE INFOCOM 1994, p. 636- 646) ,
• The Delay Earliest Due Date (Delay-EDD) and Jitter Earliest Due Date (Jitter-EDD) approaches (See "Comparison of Rate-Based Service Disciplines," by H. Zhang and S. Keshav, published in Proceedings of ACM SIGCOMM 1991, p. 113-121) , and 5 • The Head-of-the-Line Earliest Due Date (HOL-EDD) approach (See
"HOL-EDD: A Flexible Service Scheduling Scheme for ATM Networks, " by M. Vishnu and J. W. Mark, published in Proceedings of IEEE INFOCOM 1996, p. 647-654) . The entire disclosures of each of the above-cited publications are 10 incorporated herein by reference.
The commonality of these algorithms is that each one is based on time stamps that are assigned to each incoming packet or to each packet queue . The packets are then sent according to a time stamp sorting result . 15 While the GPS approach is very flexible and provides perfect fairness (which thereby would give users widely different performance guarantees) , GPS is an idealized approach which is not practical to implement because GPS assumes that the scheduler can serve multiple sessions simultaneously, and further assumes that packet traffic is 20 infinitely divisible. The WFQ approach (which is also called packet- by-packet GPS, or PGPS) and the WF2Q approach are packet-by-packet disciplines which closely approximate the GPS approach. Both WFQ and WF2Q have attractive characteristics with respect to delay and fairness, but are impractical to implement because of the intensive 25., computations that these approaches require. SCFQ and VirtualClock reduce the computation complexity of WFQ and WF2Q using an approximation algorithm to simplify the calculation of the virtual time. Because these two algorithms use the internally generated virtual time to reflect the progress of work in the system (rather 30 than using virtual time generated in the hypothetical GPS system) , the performance of these two algorithms is not as good as WFQ and WF2Q. Therefore, a need has existed within the art for a scheduling algorithm that can provide high performance with minimal computation complexity. 35 A major problem with these time stamp-based algorithms is the computation cost. Figure 1 shows the basic configuration used in implementing time stamp-based algorithms. As packets arrive, a time stamp is attached thereto. Then, the packets are stored in the buffer. The algorithm needs to calculate virtual time for each
packet or for each connection. Then, the algorithm requires that each packet or connection be sorted according to their time stamps in order to serve the packet or connection with the minimum time stamp. If no approximation method is employed, the computation cost for the time stamp-based algorithm is 0 (log2N) , where N is the number of backlogged queues. Because typical systems have thousands of VCs multiplexed into one link, the resultant high computation cost makes these algorithms impractical to implement in a high-speed system. Another traditional scheduling algorithm that is often implemented is the Weighted Round Robin (WRR) approach. Figure 2(a) depicts a conventional WRR algorithm. The backlogged connections 100 (labeled A-F) are connected in a ring. As can be seen, each connection has a weight, W, assigned thereto. A connection with weight 2 will get served twice as often as a connection with weight 1. In the example of Figure 2(a), the scheduler can serve connection A three times (e.g. dequeue 3 packets from connection A) before moving on to connection B, which gets served twice. Thereafter, connections C, D, E, and F get served once, three times, once, and twice respectively as the scheduler moves from connection to connection. With this conventional WRR algorithm, a high weight connection may block a low weight connection, thereby increasing the traffic delay.
An improved WRR algorithm was proposed in "Weighted Round-Robin Cell Multiplexing in a General-Purpose ATM Switch Chip, " authored by M. Katavenis et al. and published in IEEE J. Select Areas, Commun., Vol. 9, No. 8, p. 1265-1279, Oct. 1991, the entire disclosure of which is incorporated herein by reference.
Figure 2 (b) depicts the improved WRR algorithm. Rather than successively serving each connection by a number of packets equal to that connection's weight before moving on to the next connection, the improved WRR algorithm will move to the next connection after serving a previous connection by one packet. To serve the connections according to their weights, multiple "trips" around the connections may have to be made. As shown in Figure 2(b) , connection A (whose weight is 3) will get served by one packet. After such service, connections A's residual weight will be decremented by one. Next connection B (whose weight is 2) will get served by one packet, and then its residual weight will be decremented by one. This pattern will repeat itself as the scheduler moves from connection to
connection, and makes a number of "roundtrips" equal to the highest weight assigned to a connection. Service to a given connection will occur in a roundtrip so long as that connection's residual weight is greater than zero. Once all connections have a residual weight equal to zero, the serving cycle will be reset with each connection being accorded its maximum weight . In Figure 2 (b) , there will be 3 roundtrips. In the first roundtrip, the serving order will be ABCDEF. In the second roundtrip, the serving order will be ABDF. In the third roundtrip, the serving order will be AD. Thus, the overall serving pattern for a full serving cycle will be ABCDEF-ABDFAD.
The improved WRR algorithm serves connections more evenly in time, but still maintains the same serving frequencies. This algorithm can be easily implemented in hardware, and the computation cost for the serving position only has 0 (1) of complexity. However, because it takes 0 (log2N) time to search through the binary tree to find the next VC entry to serve, this scheme is not scalable to a high speed system. Also, the round robin techniques of Figures 2 (a) and 2 (b) are unable to guarantee fairness among different flows if variable length packets are served because the scheduler fails to take into account packet length when serving packets .
The Deficit Round Robin (DRR) technique shown in Figure 3 addresses the unfairness using a deficit counter. As shown in Figure 3, a weight W is assigned to each FlowID. Also, a deficit counter (D) is maintained for each FlowID. Initially, D is set equal to W. When a given FlowID is selected for service, the scheduler checks whether the D associated with that FlowID is larger than or equal to the length of the packet at the head of the FlowID' s queue. If D is larger than or equal to that packet's length, the packet is served. Thereafter D is decremented by the packet's length, and the scheduler checks the next packet in FlowID' s queue. As long as the packet at the head of FlowID has a length less than or equal to D, FlowID will continue to be served. If D drops below the packet's length, then D is incremented by W and the scheduler requeues the current FlowID at the end of the serving queue before proceeding to the next FlowID. The DRR technique shown in Figure 3 can guarantee fairness in terms of throughput, but it allows burstiness, especially for FlowIDs having a large weight .
U.S. Patent No. 6,101,193 issued to Ohba, the entire disclosure of which is incorporated herein by reference, discloses a modified
version of DRR that seeks to improve short-term fairness. Figure 4 generally discloses the modified DRR technique of the λ193 patent. As shown in Figure 4, the scheduler maintain two queues: a serving queue and a waiting queue. Backlogged flows are initially placed in the serving queue. The scheduler selects each flow in round robin fashion. The packet at the head of the selected flow is served if the flow' s D value is not less than the length of the packet at the head of the flow. When a packet from a flow is served, that flow's D value is decremented by the length of served packet. Thereafter, that flow is moved to the end of the serving queue. If the packet at the head of the selected flow has a length greater than D, then that flow's D value is incremented by W and the flow is moved to the end of the waiting queue. No packet is served at this time. Once the serving queue becomes empty, the scheduler treats the waiting queue as the serving queue and the serving queue as the waiting queue, and the process continues. This modified DRR scheme provides fairness through interleaved service, but it requires the scheduler to constantly check whether a flow' s allocated weight has been exhausted before deciding whether to serve a packet. Time-Wheel is another scheduling approach used in ATM systems . As shown in Figure 5, several queues 102 are organized in a round topology. Each queue 102 contains zero or more backlogged connections (denoted as A, B, C, D, F, and G) . Two time pointers, the time pointer 101 and transmission pointer 103 are employed to control the scheduler. The time pointer 101 moves clockwise from one queue to the next for each cell time. All connections behind time pointer 101 are eligible to be served. The transmission pointer 103, which also moves clockwise, either lags or steps together with time pointer 101. The connections in the queues pointed to by the transmission pointer 103 get served one by one. External controller 104 places the connections into one of the queues.
U.S. Patent No. 6,041,059 issued to Joffe et al., the entire disclosure of which is incorporated herein by reference, describes an improved Time-Wheel algorithm. The computation complexity of the Time-Wheel algorithm is independent of the number of VCs, and the scheduler can precisely pace the assigned bandwidth described in [i, m] (i cells and m cell time) . However, several limitations hinder this approach. The first limitation being that the Time-Wheel scheme cannot serve more than m/i VCs . The second limitation being that
fairness cannot be guaranteed when the bandwidth is oversubscribed. Therefore, the Time-Wheel approach also does not provide a scheduling algorithm suitable for high speed systems.
Because WFQ has several analytically attractive characteristics relating to delay and fairness, many simplified WFQ algorithms have been implemented. For example, the article, "A Queue-Length Based Weighted Fair Queuing Algorithm in ATM Networks, " authored by Yoshihiro Ohba, and published in Proceedings of INFOCOM '97, 16th Annual Joint Conference of IEEE Computer & Communications Societies, the entire disclosure of which is incorporated herein by reference, discloses a Queue-Length Based WFQ algorithm (QLWFQ) which maintains one common scheduling queue to hold credits for the VCs .
Figure 6 depicts this QLWFQ scheme. One scheduling queue 108 queues credits for the connection queues 106. Each connection queue 106 has a weight assigned thereto. Each connection queue also has a length measured by the number of cells stored therein. By virtue of having a credit (j) queued in the scheduling queue 108, connection queue (j) will have a cell dequeued therefrom when credit (j) reaches the head of scheduling queue 108. Thus, each cell time, the scheduler takes the first credit from scheduling queue 108, and serves one cell from the connection queue corresponding to that credit.
Credits are updated in two ways. First, when an incoming cell arrives, a connection(i) is determined for that cell. A credit for connection (i) is added to the scheduling queue 108 if the length of connection queue (i) is less than or equal to the weight assigned to connection queue (i) . Second, when an outgoing cell is taken from connection queue (j), the credit for connection queue (j) that is at the head of the scheduling queue is removed. If the length of connection queue (j) after dequeuing a cell therefrom is greater than or equal to the weight assigned to connection queue (j), then a credit for connection queue (j) is added to the back of the scheduling queue.
While the QLWFQ scheme reduces the computation complexity to 0(1), it may cause the head-of-line blocking due to bursty traffic, wherein a large number of credits for the same connection queue are added continuously to the scheduling queue. The other limitation of the QLWFQ scheme is that the scheduling queue must hold credits for all backlogged connections. For each connection, a number of credits up to the value of that connection's weight may be stored in the
scheduling queue. When the weight granularity is doubled, the memory to hold the credits for the scheduling queue is doubled. Therefore, QLWFQ is not sufficiently flexible to support a wide range of granularities . U.S. Patent No. 6,014,367, issued to Joffe, the entire disclosure of which is incorporated herein by reference, discloses a method that can provide a minimum service rate to each VC on a small selected time scale, with no limitation on the number of VCs it can support. There are two kinds of queues in the scheme, a waiting queue and a serving queue. The waiting queue becomes the serving queue when the present serving queue is empty. Both queues are FIFOs . The connections buffered in the serving queue are served one by one, and the scheduler decides on the basis of credits whether the VCs should reenter the serving queue or be moved to the waiting queue. The credit value is increased by the weight when the connection is moved to the waiting queue, and is decreased when the connection gets served by the scheduler. Under this scheme, the interval time between repeat service to a connection does not depend on its weight, but depends on the backlogged connections in the serving queue. In the high speed ATM network, where there are thousands of connections sharing one link, the latency under this scheme is quite high.
Another rate-based WFQ scheme is disclosed in U.S. Patent No. 6,130,878 issued to Charny, the entire disclosure of which is incorporated herein by reference. This scheme, which operates in the frequency domain, maintains a counter called "relative error" for each VC. Each cell time, the scheduler searches for the VC with the largest relative error, and serves that VC. Because this method involves sorting, there is a limitation on the number of VCs that can be supported in a high-speed ATM switch.
Yet another WFQ algorithm is disclosed by Chen et al . in Chen et al., "Design of a Weighted Fair Queuing Cell Scheduler for ATM Networks", Proceedings of Globecom 1998, the disclosure of which is incorporated herein by reference. However, under this scheme for scheduling cells, each connection can only be assigned a weight that is a power of two. As such, the granularity of service provided to the connections is poor.
Therefore, a need exists within the art for a scheduler which can fairly allocate bandwidth among numerous connections without an
excessive computation cost or a limit on how many connections the scheduler can support.
Summary of the Invention To satisfy this need and to solve the above-described shortcomings in the art, disclosed herein is a WFQ scheduler that is suitable for use with high-speed systems. The computation complexity is O(l), does not depend upon the number of VCs supported by the scheduler, and can be finished in constant time. Each VC has a connection queue in which to store incoming objects, and the present invention allocates bandwidth among the different connections based on their weights without any granularity restrictions. Also, because of the present invention's technique of interleaving service to the various connections, the present invention represents an improvement in latency and fairness with respect to the prior art. Because of the interleaved service among the connections, burstiness is avoided, and traffic is spread more evenly in time. Furthermore, the scheduler can be configured to dynamically adjust the step size of its serving counter according to the traffic received by the scheduler, thereby maximizing efficiency.
Accordingly, disclosed herein is a method of scheduling access to a common resource for a plurality of objects queued in a plurality of connection queues during a plurality of serving cycles, each serving cycle comprising a plurality of serving values, the method comprising: maintaining a serving counter configured to cycle through the serving values of the serving cycle; maintaining a scheduling queue set, the scheduling queue set comprising a plurality N of scheduling queues Q(N-l) through Q(0); assigning a scheduling weight to each scheduling queue Q (k) of the scheduling queue set; associating each serving value with a scheduling queue in a manner proportional to the scheduling weights assigned to the scheduling queues; for each connection queue CQ(i), (1) assigning a connection weight W1 thereto that is representative of a desired amount of service to be allotted to connection queue CQ(i) during each serving cycle;
for each CQ(i) that is backlogged, maintaining a token T(i) associated therewith; storing each T(i) in a queue of the scheduling queue set at least partially as a function of (i) W1, and (ii) the scheduling weights assigned to the scheduling queues; for each serving value S of the serving counter, (1) selecting the scheduling queue associated therewith, and (2) for each T(i) stored in the selected scheduling queue, (i) serving CQ(i) by dequeuing an object therefrom and providing the dequeued object to the common resource.
A connection queue becomes backlogged when a threshold number of objects are queued therein. Preferably, this threshold number is one; in which case, a backlogged connection queue is the equivalent of a nonempty connection queue . Preferably, the serving counter is an N bit binary serving counter, and each serving value S is an N bit value S=SN-1SN-2...S0. To associate each serving value with a scheduling queue in a manner proportional to the scheduling weights assigned to the scheduling queues, a scheduling queue Q(k) is associated with a serving value S if Sk is the least significant bit of S that equals a predetermined number. Preferably, this predetermined number is one. Thus, for a scheduling queue Q(k) having a scheduling weight of 2k assigned thereto, there will be 2k serving values associated therewith. Each serving value S associated with Q(k) will possess a unique bit pattern wherein bit Sk of each serving value S is the least significant bit of S equal to the predetermined number. By associating serving values with scheduling queues in this manner, a fair system of interleaved service can be provided to the connection queues . Further, each connection weight W1 is preferably an N bit value W1=W1 N-ιW1 -2...W1o • As stated, the present invention moves each token T(i) among the scheduling queues of the scheduling queue set at least partially as a function of both W1 and the scheduling weights assigned to the scheduling queues. Preferably, a token T(i) may only be stored in a scheduling queue eligible to store that token.
Scheduling queue eligibility for each token T(i) is also at least partially a function of W1 and the scheduling weights assigned to the scheduling queues . Preferably, a token is eligible to be stored in
scheduling queue Q (k) for each k where bit Wk 1 of W1 equals a predetermined number. The predetermined number is preferably one.
To implement arbitrary weight for the connections between 1 and 2N-1, a single token is preferably maintained for each backlogged connection queue and that single token is preferably moved among eligible scheduling queues to emulate multiple tokens . It is preferred that a single token be used for each backlogged connection because difficulty arises when managing multiple tokens for a single connection. For example, when a connection queue CQ(i) is no longer backlogged, it is much simpler to remove a single token for CQ(i) from the one of scheduling queues than it would be to remove multiple tokens from several scheduling queues. However, the use of a single token for each backlogged connection results in the need to develop a viable working plan for moving tokens for backlogged connections among scheduling queues such that backlogged connections receive the full amount of service made available thereto.
The specific token storage scheme employed by the present invention will depend upon the objects queued in the connection queues . To ensure that a backlogged connection queue receives the full amount of service available thereto (as determined from W1) when the objects are cells (which have a fixed length) , the present invention preferably handles each token T(i) by (1) initially storing T(i) in the scheduling queue having the highest scheduling weight of each scheduling queue that is eligible to store T(i) (which will be Q (k) where bit ^1 of W1 is the most significant bit of Wk 1 that equals the predetermined number), and (2) when CQ(i) is served and remains backlogged after service thereto, redistributing T(i) to the scheduling queue eligible to store T(i) that is next to be selected of each scheduling queue eligible to store T(i). When the objects scheduled by the present invention are variable length packets having a length comprised of one or more cells, the present invention also maintains a residual weight value RW1 for each connection queue CQ(i) . Each RW1 value is a running representation of how much of the desired service amount remains available to CQ(i) during the current serving cycle. Presuming that CQ(i) is backlogged, RW1 will vary throughout the serving cycle. Token storage for each T(i) in the variable length packet scheduling embodiment further depends upon RW1.
Each RW1 value is initially set equal to W1 and is adjusted downward when a packet is served from CQ(i), the downward adjustment being made in a manner proportional to the length of the served packet . If a RW1 value ever becomes negative during a given serving cycle, CQ(i) is no longer eligible to be served during the remainder of that serving cycle. Once service to a connection queue is complete for a serving cycle, RW1 is adjusted upward by W1. Thus, any over-service that a connection queue receives during a serving cycle (which may result when a relatively long packet is served) , is carried over to the next serving cycle to thereby keep service to each connection queue CQ(i) within the bounds set by W1.
As an example, if RW1 is 2 and W1 is 4, and a packet having a length of 3 cells is served from CQ(i), then the adjusted RW1 value becomes -1. In such a case, service to CQ(i) is complete for the remainder of the current serving cycle and RW1 is increased by W1 to a new value of 3. Thus, during the next serving cycle, CQ(i) is allotted service in the amount of three cells to account for the fact that it received service in the amount of 5 cells during the previous serving cycle. The present invention seeks to limit CQ(i) to 4 cells per serving cycle (its W1 value) on average throughout a plurality of serving cycles .
Preferably, in the variable length packet embodiment, the scheduling queue set comprises an active scheduling queue subset and an inactive scheduling queue subset. Each scheduling queue subset comprising a plurality N of scheduling queues Q(N-l) through Q(0) and a collection queue. Each serving cycle, the scheduler toggles which subset is active and which subset is inactive. Each token T(i) associated with a connection queue that is no longer eligible to be served during the current serving cycle is preferably moved to a queue of the inactive subset. Each token T(i) associated with a connection queue that remains eligible to be served during the current serving cycle is preferably moved to a scheduling queue of the scheduling queue set that is eligible to store T(i) and that will be selected the soonest of each scheduling queue eligible to store T(i).
Further, in the variable length packet scheduling embodiment, each serving cycle preferably comprises a plurality N of subrounds . Each subround SR comprises at least one serving value. During a given subround SR, the step size between successive serving values
will be 2SR~1. As a result of step size being a function of subround, each successive subround will result in a decrease in the number of scheduling queues eligible to be selected. For example, when N=4, Q(3) will be selected each odd serving value during subround 1. However, once S wraps around and the subround increases to subround 2, the serving values will be 0010, 0100, 0110, 1000, 1010, 1100, and 1110 - none of which result in the selection of Q(3) . To allow each CQ(i) to receive the full amount of service allotted thereto, the eligibility of scheduling queues to store tokens further depends upon subround number when the serving cycle includes subrounds. Another feature of the present invention which may be implemented in either the fixed length case or the variable length case is the ability to dynamically update the serving value of the serving counter to avoid the waste in bandwidth that occurs when empty scheduling queues are selected. If the scheduler allows the serving counter to blindly update its serving value in a single step incremental fashion (SNEXT=SC0RRBNT+1) , then bandwidth may be wasted as the serving counter takes on a serving value specifying the selection of a scheduling queue having no tokens stored therein. When a scheduling queue is empty, it is preferable that the serving counter skip a serving value that is associated with an empty scheduling queue and move quickly to a serving value that is associated with a nonempty scheduling queue . Toward this end, the serving counter may jump from a current serving value to a next serving value such that a serving value associated with an empty scheduling queue is skipped. As shown in Figure 10 below, the granularity of connection weights supported by the scheduler of the present invention will be 2N-1. To effectively double the number of connection weights supported by the scheduler, few modifications need to be made (the addition of a single scheduling queue) . When N=3, there will be 3 scheduling queues, and the connection weight values will be N bits long. The number of different levels of service (or connection weights) assigned to each connection queue will be 7 (or 23-l) . By adding a single scheduling queue, N will increase from 3 to 4, meaning that each connection weight value will go from being 3 bits long to 4 bits long, and the number of different connection weights that can be supported by the scheduler will jump from 7 to 15 (or 24- 1) . This feature of the present invention allows the scheduler to be scaled to support a large number of connection weights (a wide range
of QoS levels) with a minimal amount of modification, and without increasing the computational complexity.
It is thus an object of the present invention to provide a scheduler which can determine which connection queue should be served using only a minimal number of calculations . Because the calculations made by the present invention involves simple binary arithmetic, the scheduler is easily implementable in hardware. Also, the calculations made by the present invention can be finished in constant time, independent of the number of connection queues or scheduling queues supported by the scheduler.
It is a further object of the present invention to decrease the burstiness of traffic and improve latency and fairness performance by interleaving traffic from different connections . This fair allocation of bandwidth is cost-efficient and provides a finer granularity of fairness in scheduling traffic than conventional methods. Further, the bandwidth can be fairly allocated among the VCs based on their weights in any condition, whether the bandwidth is oversubscribed or undersubscribed.
It is a further object of the present invention to provide a highly scalable scheduler which can be scaled to support large numbers of connection weights without increasing the computation complexity and without requiring large modifications to the scheduler. By simply increasing the number of scheduling queues by one, a person can effectively double the number of connection weights supported by the scheduler.
It is yet a further object of the present invention to efficiently serve backlogged connection queues by dynamically adjusting the step size of the serving counter in accordance with the occupancy of each scheduling queue. These and other features and advantages of the present invention will be in part apparent, and in part pointed out, hereinafter.
Brief Description of the Drawings Figure 1 shows a conceptual model of a time stamp-based scheduling algorithm;
Figures 2 (a) and 2 (b) show examples of a Weighted Round Robin (WRR) scheduling algorithm;
Figure 3 shows an example of a deficit round robin (DRR) scheduling algorithm;
Figure 4 shows an example of a modified DRR scheduling algorithm; Figure 5 shows an example of a conventional Time-Wheel scheduling algorithm;
Figure 6 shows an example of a Queue Length-Based WFQ (QLWFQ) scheduling algorithm;
Figure 7 is an overview of the fixed length packet scheduler of the present invention;
Figure 8 is a table showing how scheduling queues are selected as a function of the N bit serving value;
Figure 9 is a table showing a 4 bit example of scheduling queue selection as a function of serving value; Figure 10 is a table showing a 4 bit example of connection weights, eligible scheduling queues for token storage as a function of connection weight, and maximum servings per cycle as a function of connection weight;
Figure 11 is a flowchart depicting how the controller creates and initially stores tokens in the case of scheduling fixed length packets;
Figure 12 is a flowchart illustrating how the controller selects scheduling queues according to the serving counter value, serves connection queues, and redistributes tokens in the case of scheduling fixed length packets;
Figure 13 shows how tokens can be stored in the scheduling queues;
Figure 14 illustrates the dual scheduling queue sets for the case of scheduling variable length packets; Figure 15 is a table illustrating scheduling queues that are eligible for selection as a function of subround number for the case of scheduling variable length packets;
Figure 16 is a table showing a 3 bit example of serving values and scheduling queue selection during a given serving cycle for the case of scheduling variable length packets;
Figure 17 is an overview of the variable length packet scheduler;
Figure 18 is a flowchart depicting how the controller creates and initially stores tokens in the case of scheduling variable length packets ;
Figure 19 is a flowchart illustrating how the controller selects scheduling queues according to serving value, serves connection queues, and redistributes tokens in the case of scheduling variable length packets;
Figure 20 is a flowchart illustrating the token redistribution process in the case of scheduling variable length packets; and Figure 21 is a flowchart illustrating how the controller processes tokens stored in the collection queue of the active scheduling queue set .
Detailed Description of the Preferred Embodiment A. Overview: Scheduler for Cells
Figure 7 shows the basic connection queue and scheduling queue structure of the cell scheduler of the present invention. Each cell has a fixed length such as the cells of ATM traffic. As can be seen in Figure 7, there are a plurality of connection queues 110. Each connection queue 110 is associated with a different connection, and incoming cells are stored therein by the cell distributor 112 on the basis of connection information contained within the cells. That is, if an incoming cell has information contained therein that specifies it should be passed through connection A, then the cell distributor 112 will buffer that cell in the connection queue associated with connection A.
The scheduler of the present invention also has a scheduling queue set comprising a plurality of scheduling queues 114, labeled Q(N-l) through Q(0) . Each scheduling queue Q(K) has an assigned scheduling weight of 2K. The value of N can be chosen depending upon the weight granularity and maximum weight that one wants the scheduler to support. In the preferred embodiment, wherein there are N scheduling queues, the scheduler can support connection weights ranging from 1 to 2N-1. Each scheduling queue 114 stores tokens (tokens are denoted as the circles resident in the scheduling queues 114 in Figure 7) that are associated with particular connection queues . Each scheduling queue 114 will have a waiting pool 118 and a mature pool 116, which may be implemented as FIFOs. Tokens are initially stored in the waiting pools 118, and are transferred to the
mature pools 116 upon selection of the scheduling queue. Preferably, the scheduling queues 114 only store tokens for backlogged connection queues . A connection queue becomes backlogged when a threshold number of cells are stored therein. In the preferred embodiment, where this threshold number is one, a backlogged connection queue is the equivalent of a non-empty connection queue. By storing only tokens for backlogged connection queues, the scheduler can avoid the wasteful situation of serving an empty connection queue.
A connection queue is serviced by the scheduler when the token associated therewith is dequeued from the scheduling queue in which it is stored. The selection of scheduling queues is done according to the serving value of a serving counter 120. Serving counter 120 maintains an N bit serving value. Each scheduling queue 114 is associated with a unique bit pattern of the serving value such that the serving values are associated with the scheduling queues in a manner proportional to the scheduling weights assigned thereto. Accordingly, depending upon which bit pattern is present in the serving value, the controller 122 can select the appropriate scheduling queue. The scheduling weight of a scheduling queue is then representative of how common that scheduling queue ' s bit pattern is within the serving value cycle. Preferably, each scheduling queue is selected according to the bit pattern of the serving value as follows: Q(N-l-a) will be selected if Sa is the least significant bit of SN-1SN-2 ... S^ equal to the predetermined value of 1. The application of this rule is shown in Figure 8 (as well as Figure 9) . Essentially, each scheduling queue is associated with a bit in the serving value and will be selected whenever that bit is the least significant bit in the serving value equal to a predetermined value (preferably 1) . In Figure 8, the each "x" represents a don't care bit value.
When a scheduling queue 114 is selected, all tokens stored in the waiting pool 118 of that scheduling queue are moved to the mature pool 116 of that scheduling queue. Each connection queue associated with a token stored in the mature pool 116 of the selected scheduling queue will then be served. Service to a connection queue comprises providing a cell stored in that connection queue (preferably only one) to the common resource. Further, it is preferable that each connection queue be served by the same number of cells when being served.
The controller 122 is responsible for controlling the main scheduling operations of the scheduler. As will be explained in more detail with reference to Figures 11 and 12, the main tasks of controller 122 are as follows : • selecting a scheduling queue according to the serving value of the serving counter 120;
• generating new tokens for connection queues that have become backlogged;
• initially distributing each new token T(i) into the waiting pool of the scheduling queue associated with the most significant bit in the connection weight value W1 assigned to CQ(i) that is equal to 1 (as explained below in connection with the formula for h1) ;
• serving each CQ identified by a token stored in the selected scheduling queue; • for each CQ being served, providing a cell therefrom to the common resource;
• redistributing the tokens stored in the selected scheduling queue upon completion of service to the CQs with tokens stored in the selected scheduling queue; and • updating the serving value such that the selection of an empty scheduling queue will be avoided (as explained below in connection with the formula for SNEXT) .
The controller 122 achieves fair allocation of bandwidth among the connections by distributing tokens to the appropriate scheduling queues as a function of the connection weight values assigned to the connection queues. As stated, each connection queue CQ(i) has an N bit connection weight value W1 = WN-1WN-2...W0 assigned thereto. Derived from each W1 will be an incremental value I1, and a value h1, as will be explained in more detail below. These parameters (W1, I1, and h1) can be stored in a table 124 (however, if desired the controller can calculate these values on the fly) , and are used in determining how tokens should be stored in the scheduling queues .
A different scheduling queue will be associated with each bit in W1. Preferably, the scheduling queue having the highest scheduling weight will be associated with the most significant bit in W1, the scheduling queue having the second-highest scheduling weight will be associated with the second-most significant bit in W1, and so
on, down to the least significant bit of W1, which will be associated with the scheduling queue having the lowest scheduling weight. From this correlation, the eligible scheduling queues into which a token T(i) associated with connection queue CQ(i) may be stored are those scheduling queues associated with a bit in W1 that is equal to a predetermined value (preferably 1). That is, Q (k) is eligible to store T(i), wherein k is each k where bit WR 1 of W1 equals one. Figures 9 and 10 show the inter-relationship among serving values, selected scheduling queues, connection weight values, and eligible scheduling queues for token placement.
Because of the interleaved pattern of scheduling queue selection shown in Figure 9, wherein Q(3) gets served every other serving value, Q(2) gets served every fourth serving value, Q(l) gets served every eighth serving value, and Q(0) gets served every sixteenth serving value, the scheduler of the present invention improves latency and fairness performance. Burstiness (where a single connection monopolizes access to the common resource for an extended period of time) can therefore be avoided due to the interleaved selection of scheduling queues . A complete serving cycle for the scheduler in the case of scheduling cells is the progression of serving values from 0001 to llll, which will be 15 values in the 4 bit example (S=0000 is preferably not used) . As can be determined from counting how many times each scheduling queue appears as the selected scheduling queue, the scheduling weight of Q(3) is 8, the scheduling weight of Q(2) is 4, the scheduling weight of Q(l) is 2, and the scheduling weight of Q(0) is 1. For the connection weight value W1 = W3W2WιW0, Q(3) will be associated with W3, Q{2) will be associated with W2, Q(l) will be associated with W1# and Q(0) will be associated with W0. In Figure 10, it can be seen how the connection weight assigned to a connection queue will affect how often that connection queue is served during a cycle. In heavy traffic conditions, where a connection queue remains backlogged for an entire cycle, its token will remain in the scheduling queues for the entire cycle. At any given time, a token T(i) for a backlogged connection queue CQ(i) can be stored in any of its eligible scheduling queues. As stated above, the eligible scheduling queues for T(i) are found by determining which scheduling queues are associated with bits in W1 that equal a
predetermined value (preferably 1) . That is, a token T(i) for CQ(i) is eligible to be stored in any Q(j) where W1 j=l.
Thus, as can be seen from Figure 10, when W1 = 1001, the eligible scheduling queues for storing T(i) are Q(3) (because W3=l) and Q(0) (because W0=l) . Because of the token distribution method explained in more detail below, once token T(i) is initially stored in a scheduling queue, it will always be stored in the eligible scheduling queue that will be selected the soonest. For the example where W1 = 1001, the maximum number of servings in a cycle for CQ(i) will be 9. The maximum serving number assumes that CQ(i) remains backlogged for an entire cycle, and that the token distribution method will optimally move T(i) between Q(3) and Q(0) to reap the maximum amount of service. Because Q(3) will be selected 8 times per cycle and Q(0) once per cycle (as seen in Table 2), CQ(i) will be served 9 times per cycle as T(i) is optimally moved between Q(3) and Q(0). Thus, from Figure 10, it should quickly be discernable that the following expression represents the maximum number of servings for CQ(i) in a serving cycle:
MaxServingscm = WiN-ι2N-1 + W'N-22N-2 + ... + WY21 + Wio2°
From Figures 9 and 10, the high scalability of the scheduler of the present invention can also be seen. Simply by adding another scheduling queue (and thereby increasing N by one) , the maximum weight that the scheduler can support is doubled. For example, in the examples of Figures 9 and 10, the number of scheduling queues is 4 (N=4) , and the number of actual connection weights supported is 15 (the connection weight value of 0000 translates to no servings per cycle and can thus be disregarded) . If the number of scheduling queues is increased from 4 to 5 (N=5) , it can easily be seen that the number of connection weights supported will increase from 15 to 31. However, the computation complexity will remain the same, with only the length of the cycle increasing. By increasing the maximum weight supported by the scheduler, finer granularity of service can be provided to the connections . A mathematical analysis of how present invention achieves WFQ scheduling is disclosed in the paper entitled "A Highly Scalable Dynamic Binary Based WFQ Scheduler". This paper, which is authored by the inventors herein, is included as Appendix A.
B. The Scheduling Algorithm for Cells
Figure 11 is a flowchart depicting how the controller 122 creates new tokens for the connection queues and then stores those new tokens in the scheduling queues. At step 1000, the controller waits for a new cell to arrive. At step 1002, the controller determines which connection queue CQ(i) is to receive the cell. This determination can be made in response to a signal provided by the cell distributor 112, some status indicator received from CQ(i), or a status check performed by the controller on CQ(i), or any other means of determining when a new cell has been stored in a CQ. Next, at step 1004, the controller checks whether QLi for CQ(i) is equal to zero. QL represents the number of cells stored in CQ(i). If the number of cells stored in CQ(i) is zero, then the controller will need to generate a token T(i) for CQ(i) (step 1006) to account for the fact that CQ(i) has transitioned from being nonbacklogged to being backlogged. Then, the controller will store the newly- generated token T(i) in scheduling queue Q (h1) (step 1008) . Thereafter, at step 1010, the controller will increment QLi. The controller will also reach step 1010 from step 1006 if it determines that Q >0 (meaning that steps 1006 and 1008 can be bypassed because T(i) already exists) .
As shown in step 1008 of Figure 11, once T(i) is created, the controller stores T(i) in the waiting pool of scheduling queue Qdi1) . The value of h1, which is a function of W1, can be either stored in table 124 or calculated on the fly according to W1. The formula below represents how h1 is calculated:
h' = max{r, for all r's where Wr' =1} (1)
wherein W1 is the connection weight value assigned to CQ(i) .
According to this formula, h1 represents the most significant bit in W1 that is equal to 1. Thus, the controller creates a token T(i) only when CQ(i) goes from being empty to backlogged (when a first cell is stored therein), and stores new tokens T(i) in the scheduling queue associated with the most significant bit in W1 that is equal to the predetermined value of 1 (basically, the eligible scheduling queue for T(i) having the highest scheduling weight).
Figure 12 is a flowchart depicting how the controller 122 serves connection queues and redistributes tokens. At step 1020, the
controller updates the serving values of the serving counter 120. This update can be a fixed single step incrementation (i.e., SNE T = ScuRRENτ+ ) • However, because the scheduling queue that will be selected according to such a fixed incremental update may be empty (have no tokens stored therein) , it is preferable to dynamically update the serving value S in a manner such that empty scheduling queues are skipped, as will be explained in more detail below.
After updating S (either incrementally or dynamically) , the controller, at step 1022, selects a scheduling queue according to S (as described above in connection with Figures 9 and 10) . Then, at step 1024, the controller moves all of the tokens stored in the waiting pool of the selected scheduling queue to the mature pool. At step 1026, the controller determines whether the mature pool of the selected scheduling queue is empty. If it is empty, the controller loops back to step 1020 and updates S. If it is not empty, the controller proceeds to step 1028 where it reads the first token T(i) stored in the mature pool. Thereafter, at step 1030, the controller 122 will serve CQ(i) by providing one cell stored in CQ(i) to the common resource. Once the controller has served CQ(i), it should decide how to redistribute T(i). T(i) will either be dropped (if CQ(i) is empty after having one cell dequeued therefrom) , or fed back to the waiting pool of the same scheduling queue or redistributed to another eligible scheduling queue according to the value of D1, which will be explained below (if CQ(i) remains backlogged after having one cell dequeued therefrom) . At step 1032, the controller will decrement QLi by one because of the fact that one cell was dequeued from CQ(i) at step 1030. Then, at step 1034, the controller will determine whether CQ(i) is still backlogged by checking whether QLi>0. If QLι=0, then the controller jumps to step 1046 and drops T(i) . CQ(i) will then have to wait for a new cell to be stored therein before a token is created therefor (see Figure 11) . If QLi>0, then the controller will proceed to step 1036 and begin deciding how to redistribute T(i) . At step 1036, the controller will calculate a distribution value D1 for T(i) . The distribution value is calculated according to the formula below:
wherein S is the serving value, and wherein I
1 is an N bit incremental value, I
N-
I 1 .... IQ
1, for CQ(i). Like h
1, I
1 is a function of W
1, and can be either stored in table 124 or calculated on the fly according to W
1. I
1 can be calculated according to the formula below:
l'f = l;l = 0 (for all m ≠ f ) (3) wherein j1 is calculated according to the formula:
f = N -1 -max{&, for all K's where Wk' = 1 } (4)
Once D1 is calculated at step 1036 according to formula 2, the controller proceeds to step 1038 and calculates a value for p. The value for p is calculated according to the formula:
? = N - l -min{t , for all t's where Dt' = 1 } (5)
Once p has been calculated, the controller checks whether Wp 1 = 1 (step 1040) . If Wp 1 = 1, then the controller proceeds to step 1042 where it stores T(i) in the waiting pool of Q(p) . If p 1 Φ 1, then the controller proceeds to step 1044 where it stores T(i) in the waiting pool of QCh1) .
The sum total of steps 1036 through 1044 is to redistribute the token T(i) for backlogged connection queue CQ(i) into an eligible scheduling queue that will be selected by the controller the soonest of each eligible scheduling queue.
From either step 1042, 1044, or 1046, the controller will loop back to step 1026 and check whether another token is stored in the selected scheduling queue. If so, the controller will read the next token stored therein (step 1028) and perform steps 1030 through 1046 accordingly. That process will continue until no more tokens are stored in the mature pool of the selected scheduling queue, at which time the controller will loop back to step 1020 and update the serving value. The division of each scheduling queue into a waiting pool and a mature pool is useful because the token redistribution process may often result in a token being fed back to the same scheduling queue.
To prevent the feedback token from being read out repeatedly during the same serving value, tokens are initially stored in the waiting pools, and only tokens stored in the mature pool will be read out for service.
Example 1 : Token Storage/Redistribution
An example will help illustrate how the token redistribution process is performed using formulas 1 through 5. In this example, we will set N=4 (as shown in Figures 9 and 10) . CQ(i) will have 4 cells stored therein. When the first cell arrives in CQ(i), a token T(i) will be created therefor and stored in the waiting pool of Q (h1) , as described in connection with Figure 11. The connection weight value W1 for CQ(i) will be Wi=W3 iW2 iW1 iW0 i = 0111. Thus, as explained in connection with Figure 10, the eligible scheduling queues for T(i) will be Q(2), Q(l) and Q(0) . Under formula 1, h1 will calculate out as hi=max{2,l, 0} = 2 (because w , W].1, WQ1 all equal 1). Thus, T(i) will initially be stored in Qdi1) = Q(2) . If we assume T(i) was first placed in Q(2) when S=S3S2S1S0 = 0101, then at the next update of S to S=0110, Q(2) will be selected (see Figure 9). When Q(2) is selected, T(i)will be read out of Q(2), and one cell in CQ(i) will be served. Once one cell in CQ(i) is served, CQ(i) will remain backlogged because there are still three cells stored therein. Thus, T(i) will need to be redistributed.
While the value of I1 is constant and is stored in memory 124, it is helpful to see how I1 is computed. As explained above, j1 must first be calculated according to formula 4 as follows : j1 = 4-l-max{2,l,0} = 4-1-2=1. Having calculated j1, I1 can be calculated according to formula 3 as follows : 1^=1; Ii 3=I1 2=Iio=0 → I1 = 0010.
Now that I1 is known, Dx is calculated according to formula 2 as follows :
D1 = S + I1 = 0110 + 0010 = 1000. From D1, a value for p can be calculated according to formula 5 as follows: p = 4 - 1 - min{3} = 4-1-3=0. Then, as follows from step 1040 in Figure 7, because Wi p=Wi 0=l, T(i) is moved to the waiting pool of Q (p) = Q(0) .
Because T(i) is moved to Q(0), T(i) will be located in the scheduling queue that will be the next-selected of the eligible scheduling queues for T(i) . When S is updated to S=0111, Q(3) will be selected, and the service process of Figure 12 will proceed for Q(3). Then, S will be updated to S=1000 and Q(0) will be selected. When Q(0) is selected, T(i) will once again be read out, and one cell will be dequeued from CQ(i). However, CQ(i) will remain backlogged because two cells will remain stored therein. Thus, T(i) will need to be redistributed once again. I1 is 0010 (as explained above) , and D1 will once again be calculated according to formula 2 as
D1 = S + I1 = 1000 + 0010 = 1010. Then, using formula 5 to calculate p, p will be determined as: p = 4 - 1- min{3,l} = 4-1-1=2. Because W1 P=W1 2=1, T(i) will be stored in the waiting pool of Q(p) = Q(2).
S will next be updated to S=1001, which results in Q(3) being selected. Upon the next S update, S will be 1010, which will result in Q(2) being selected. When Q(2) is selected, T(i) will once again be read out and one cell will be dequeued from CQ(i) . However, CQ(i) will remain backlogged because one cell will remain stored therein. Thus, T(i) will once again need to be redistributed.
During this redistribution, I1 will still be 0010. D1 will calculate out as Di=S+Ii = 1010 + 0010 = 1100. From D1, p will calculate out as p=4-l-min{3,2} = 1. Because Wx p = W = 1, T(i) will be stored in the waiting pool of Q (p) = Q(l) .
S will next be updated to S=1011, which results in Q(3) being selected. Upon the next S update, S will be 1100, which results in
Q(l) being selected. When Q(l) is selected, T(i) will once again be read out, and one cell will be dequeued from CQ(i) . CQ(i) will now be empty because no more cells are stored therein. Thus, T(i) will be dropped, and CQ(i) will not have an associated token until another cell is stored therein.
Example 2 : Token Storage/Redistribution In this second example, N will be 4, CQ(i) will have 4 cells stored therein, Wi=Wi 3Wi 2Wi 1Wi 0=1010, and T(i) will initially be stored in Q(h1) (as shown in Figure 11) . The value for h1 can be calculated according to formula 1 as h1 = max {3,l}=3. Thus, T(i) is initially stored in Q(3). The serving value will be 0011. When S = 0011, Q(3)
is selected, and T(i) will be read therefrom. One cell will be dequeued from CQ(i), leaving CQ(i) backlogged because three cells remain stored therein. Thus, T(i) will be redistributed according to D1. The value for j1 can be calculated according to formula 4 as: j1 = 4-1-max {3,1} = 0. Thus, under formula 3, I1 will be I1 3Ii 2IiιIio=0001. D1 will be calculated according to formula 2 as Di=S+I1 = 0011 + 0001 = 0100. Then, p will be calculated according to formula 5 as: p=4-l-min{2} = 1.
Because Wx p = 1 ! = 1, T(i) will be stored in the waiting pool of Q(p) = Q(D .
S will next be updated to 0100, which results in Q(l) being selected. Because Q(l) is selected, T(i) will be read out, and one cell will be dequeued from CQ(i) . CQ(i) will remain backlogged because two cells remain stored therein. Thus, T(i) will need to be redistributed. I1 is 0001, and D1 will be S+I1 = 0100 + 0001 = 0101. Having D1 known, p=4-l-min{2, 0} = 3. Because ^ = ^ = 1, T(i) will be stored in Q(p) = Q(3) . S will next be updated to 0101, which results in Q(3) being selected. Because Q(3) is selected, T(i) will once again be read out, and one cell will be dequeued therefrom. CQ(i) will remain backlogged because one cell remains stored therein. Thus, T(i) will need to be redistributed. D1 will be S+I1 = 0101 + 0001 = 0110. From D1, p = 4-l-min{2,l} = 2.
For this redistribution, ^W^Φ 1. Thus, as shown in Figure 12, the controller will jump to step 1044 from step 1042, and store T(i) in Q(h1)=Q(3). Q(3) is chosen as the appropriate scheduling queue to hold T(i) because it is the next eligible scheduling queue that will be served as the serving counter cycles through serving values .
S will next be updated to S=0110, which results in Q(2) being selected. After all connection queues having a token stored in Q(2) are served, S will be updated to S=0111, which results in Q(3) being selected. When Q(3) is selected, T(i) will once again be read out, and one cell will be dequeued from CQ(i) . CQ(i) will now be empty because no cells are left therein. Because CQ(i) is empty, T(i) will be dropped, and a new cell must arrive in CQ(i) for a new T(i) to be created.
Dynamic Updating of the Serving Counter Value
As briefly explained above, one feature of the present invention is how the serving counter can be dynamically updated according to whether the scheduling queues have tokens stored therein. In one operating mode, the serving value is incremented by one each service time (single step mode) . While the single step mode has an advantage in that it is easily implemented, it also has a drawback in that the controller will continue to cyclically select scheduling queues regardless of whether the scheduling queues have tokens stored therein. This wasteful result can be particularly troublesome if a high weight scheduling queue (which will be selected often) is empty.
To decrease the number of times that the controller selects an empty scheduling queue, a second operating mode for updating the serving counter is disclosed wherein the controller dynamically updates the serving value such that some of the empty scheduling queues are skipped. Under the dynamic mode, the serving value is updated according to the following formula:
wherein SNEXT represents the next serving value, SCURRENT represents the current serving value, xc represents the scheduling queue that has just been served, Q (xc) , and q represents the scheduling queue, Q(q), having the highest scheduling weight among non-empty scheduling queues .
An example will help illustrate the operation of formula 6. Referring to the 4 bit serving value example of Figure 9, the current serving value, SCURREN , will be 0010, meaning that xc will be 2 because Q(2) is selected when SCURRENT = 0010.
If the nonempty scheduling queue having the highest scheduling weight is Q(l), meaning that q=l, then it is desirable to update the serving counter such that the controller will skip selection of Q(3), which is empty (but is next-in-line for selection under the single step mode) . By dynamically updating the serving value using formula 6, SNEXT = 0010 + 24_1"max(2'l} = 0010 + 21 = 0010 + 0010 = 0100. By jumping from a serving valve of 0010 to a serving value of 0100, the controller will select Q(l), and will skip the selection of Q(3) .
While formula 6 does not guarantee that the serving value will be updated such that the serving value will always jump directly to the non-empty scheduling queue having the highest scheduling weight, formula 6 will always result in a decrease in the number of steps required to reach the nonempty scheduling queue by skipping at least one empty scheduling queue.
For example, using Figure 8 as the reference, if SCURRENT = 0010 (meaning xc=2) , and q=0 (meaning Q(0) is the non-empty scheduling queue having the highest scheduling weight) , then the serving counter will first be updated under formula 6 as follows:
SNEXT = 0010+24"1"π,ax{2'0} = 0010+21 = 0010 + 0010 = 0100. When SNE T becomes S URRE T/ Q(l) will be selected (see Figure 9) . However, because Q(l) is empty, the serving counter will once again need to be updated (see step 1026 of Figure 12) . This time, SNEXT will be calculated as follows (xc is now 1) :
SNEXT = 0100 + 24"1"max{1'0> = 0100 + 22 = 0100 + 0100 = 0110. When SNEXT becomes SCURRE T/ Q (2) will be selected (see Figure 9) . However, because Q(2) is empty, the serving counter will once again need to be updated. This time, SNEXT will be calculated as follows (xc is now 2) :
SNEX = 0110 + 24"1"max{2'2} = 0110 + 21 = 0110 + 0010 = 1000. When SNEXT becomes SCURRENT Q(0) will be selected. Because Q(0) has tokens stored therein, connection queues can be served.
Under this example, two intermediate serving values were required to jump from S=0010 to S=1000 so that a non-empty scheduling queue (Q(0)) could be selected. However, if a single step mode was used to update the serving counter, 5 intermediate steps would be required to go from S=0010 to S=1000 (see Figure 9) . Thus, even though two serving values were wasted on empty scheduling queues, formula 6 nevertheless represents a vast improvement over the rigid single step mode.
C . Tokens
Each token T(i) associated with a connection queue CQ(i) is some sort of connection queue identifier that allows the controller to determine which connection queue needs servicing. in the preferred embodiment, each token T(i) is a pointer to the memory location in CQ(i) where the next cell to be dequeued therefrom is stored.
Figure 13 depicts a possible token arrangement in a scheduling queue 114. It should be easily understood that the tokens of the present invention can be implemented in a variety of ways, and that the example of Figure 13 is illustrative rather than limiting. Scheduling queue 114 has a plurality of tokens 140 stored therein. Each token 140 has a cell pointer 142 and a next token pointer 144. Each cell pointer 142 identifies the memory location in the connection queue associated with that token where the next cell to be dequeued therefrom is stored. The cells in each connection queue can be stored together using linked lists as is known in the art. Each next token pointer 144 will identify the location in the scheduling queue 114 where the next token 140 is stored.
D. Scheduling Algorithm for Variable Length Packets The present invention may also perform scheduling for variable length packets; variable length packets being packets that have a length made up of a variable number of cells . The length of the packet is determined from the number of cells included therein. To support scheduling for variable length packets, the algorithm and scheduler described above in connection with fixed length packets is preferably altered as explained below.
First, referring to Figure 14, the scheduling queue set of the fixed length case is replaced with a scheduling queue set 150. Scheduling set 150 comprises a scheduling queue subset A and a scheduling queue subset B. Each subset includes N scheduling queues Q(N-l) through Q(0) as noted above in connection with the fixed length scheduling algorithm and a collection queue.
Also, the concept of subrounds is preferably introduced to the serving cycle. In the fixed length case described above, the serving cycle comprised the full roundtrip of the N bit serving counter from serving counter values of 1 to 2N-1. In the variable length case, a subround SR comprises the full roundtrip of the N bit serving counter from serving counter values of 1 to 2N-1. There will be N subrounds in a serving cycle. The default step size between serving counter values for the serving counter will be the 2SR"1 (SNEXT=SCURRENT+2SR"1) . Thus, the pool of scheduling queues eligible to be selected will decrease as the subround increases . Figures 15 and 16 illustrate this concept.
As shown in Figure 15, during subround 1 (SR=1) , all scheduling queues will be eligible for selection. The selection of scheduling queues as a function of serving counter value will be the same as described above in connection with the fixed length embodiment. However, when SR=2, the scheduling queue having the highest scheduling weight - Q(N-l) - is no longer eligible to be selected. This fact is due to the step size between serving values being equal to 2SR_1. Thus, the serving counter will not reach values that dictate the selection of Q(N-l) during subrounds subsequent to subround 1. This pattern repeats itself until subround N (SR=N) where only the scheduling queue with the lowest scheduling weight - Q(0) - is eligible for selection.
Figure 16 illustrates the serving values for a serving cycle and scheduling queue selected therefrom for an example where N=3. During the first subround (SR=1) , S will reach all values and each scheduling queue - Q(2), Q(l), and Q(0) - is selected accordingly. As can be seen, the step size between successive S values is equal to 1 (or 2SR_1 = 21"1 = 2°) . Once S wraps around, the next subround begins (SR=2) . This time, the step size is 2 (or 2SR_1 = 22"1 = 21) . Therefore, the serving values during subround 2 will be 010, 100, and 110. As can be seen, Q(2) will not be selected during subround 2. As S once again wraps around, the subround is increased to 3, which in this case is the final subround of the serving cycle because N=3. During subround 3, the step size is 4 (or 2SR~1 = 23"1 = 23) . During subround 3, the only scheduling queue to be selected is Q(0) .
During a given serving cycle, only one scheduling queue subset is active. By extension, the other scheduling queue subset is inactive. Tokens will be read only from active scheduling queues. Tokens are preferably distributed to either active scheduling queues, inactive scheduling queues, or the inactive collection queue. Upon the completion of each serving cycle, the active set of scheduling queues switches. Thus, at the completion of the serving cycle, the active scheduling queue subset becomes inactive and the inactive scheduling queue subset becomes active. Another addition to the algorithm for scheduling variable length packets is the concept of residual weight . Each connection will have a residual weight associated therewith. Thus, each connection queue CQ(i) will not only have a connection weight W1 assigned thereto, but also a residual weight RW1. While the
connection weight W1 is a constant value, the residual weight will initially be set equal to W1 but thereafter will be adjusted to reflect the amount of service given to CQ(i) in terms of the length of packets served from CQ(i) . Each RW1 preferably comprises N+l bits - RW nRw, ... RWiχRWio, . The N+lth bit (the most significant bit) will indicate whether RW is positive or negative (a 0 meaning positive and 1 meaning negative) and the lower N bits representing the residual weight.
Figure 17 illustrates the scheduler for the variable length packet embodiment. Incoming packets are received by the packet distributor 160. The packet distributor segments the received packets into cells and distributes those cells to their corresponding connection queue. The cells in each connection queue 110 make up packets. Each packet in a connection queue 110 is shown in Figure 17 by cells that abut each other. Thus, in the topmost connection queue 110, the packet at the head of the queue is 4 cells in length and the second packet is one cell in length. In the bottommost connection queue 110, the first two packets are each one cell in length while the third packet is two cells in length. As can be seen, the controller 122 will also maintain a residual weight RW for each connection queue 110.
The initial distribution of tokens into scheduling queues will proceed in an almost identical fashion as that of the fixed length embodiment with some adjustments to account for the use of subrounds in the serving cycle and the use of residual weight. Figure 18 is a flowchart illustrating how new tokens are created and initially distributed.
Steps 1050 through 1056 and step 1068 parallel steps 1000 through 1006 and step 1010 in Figure 11 except the focus is on packets rather than cells. If a packet's arrival in a connection queue CQ(i) causes that connection queue to transition from a nonbacklogged state to a backlogged state, then a new token T(i) is created for CQ(i). Steps 1058 through 1064 address determining the proper scheduling queue into which T(i) should be queued. In the fixed length embodiment, the appropriate scheduling queue is the scheduling queue having the highest scheduling weight among scheduling queues that are eligible for the token (eligibility being determined from W1) . However, in the variable length embodiment, RW1 is the appropriate measure of where a token should be queued because
the residual weight value is adjusted to reflect past service given to a connection queue. Also, because of the subround implementation, the appropriate scheduling queue into which to queue T(i) will also depend on SR given that it is possible that a particular scheduling queue will not be selected during the current subround.
Thus, at step 1058, the algorithm checks whether the serving cycle is in its final subround. If it is (SR=N) , then it is appropriate to queue T(i) in one of the scheduling queues in the inactive subset because that set is soon to become the active subset (only the scheduling queue having the lowest scheduling weight will be selected when SR=N) . Therefore, at step 1070, the algorithm identifies the particular scheduling queue having the highest scheduling weight from the scheduling queues that are eligible to store the token (Q(m1)). In this case, a scheduling queue Q (g) is eligible to store T(i) if RWg 1=l. The scheduling queue Q(g) of the inactive subset having the largest value for g (m1 being the largest of these g's) is selected as the scheduling queue into which to store T(i) (step 1072) .
If step 1058 results in a determination that the final subround of the serving cycle has not been reached, then the algorithm next checks whether the serving value is on its last value for the particular subround in which the serving cycle exists (step 1060) . If the most significant N+l-SR bits of S all equal 1 and the least significant SR-1 bits of S all equal 0, then the current value for S is the last value for S of the subround. For example, as can be seen from Figure 16, the last serving value for subround 1 is 111 (N+l- SR=3, SR-1=0) , the last serving value for subround 2 is 110 (N+l- SR=2, SR-1=1) . It should be noted that step 1060 will not be reached when SR=N. If S is on its last value for the subround, the algorithm preferably ensures that T(i) will not be queued in a scheduling queue not eligible for selection during the next subround (see Figure 15 for scheduling queue selection eligibility as a function of subround) . Thus, at steps 1062 and 1064, after the scheduling queue having the highest scheduling weight among eligible scheduling queues as determined from RW1 is identified (Q (m1) , where m1=max{g, for all g's where RWg^l}), the algorithm either stores T(i) in Qfm1) -- if Q (m1) is eligible for selection during the coming subround --, or stores T(i) in Q(N-(SR+D) -- if Q(m1) is not eligible for selection
during the coming subround. Q(N-(SR+D) represents the scheduling queue having the highest scheduling weight that will be selected during the coming subround. The value h1, where h1=min{m1, N-(SR+1)}, represents the result of this decision-making process. T(i) is stored in the waiting pool of Q (h1) in the active subset at step 1064.
If S is not on its last serving value for the subround, then the algorithm proceeds to step 1066 from step 1060. At steps 1066 and 1064, after the scheduling queue having the highest scheduling weight among eligible scheduling queues as determined from RW1 is identified (Q(m1)), the algorithm either stores T(i) in Q(m1) -- if Q (m1) is eligible for selection during the current subround --, or stores T(i) in Q(N-SR) -- if Q (m1) is not eligible for selection during the current subround. Q(N-SR) represents that scheduling queue having the highest scheduling weight that will be selected during the current subround. The value h1, where hl = min{mx, N-SR} , represents the result of this decision-making process. T(i) is stored in the waiting pool of Q (h1) in the active subset at step 1064. From steps 1054, 1064, and 1072, the algorithm proceeds to step 1068 where QLi is incremented. QL represents the number of packets queued in CQ (i) .
Figure 19 is a flowchart illustrating how the scheduling algorithm processes tokens stored in the scheduling queues to serve packets. At step 1080, the serving value is initialized to 0, the subround is initialized to 1, and scheduling queue subset A is rendered active. At step 1082, the serving value is incremented. This incrementation may be done using either the fixed step size of 2SR"1, or the dynamic step size described above in connection with formula 6. At step 1084, the algorithm checks whether S has wrapped around. If it has, then the subround is incremented (step 1086) .
Step 1088 is reached from either step 1084 (if S has not wrapped around) , or step 1086 (if S has wrapped around) . At step 1088, a scheduling queue is selected according to the serving value. Scheduling queue selection as a function of serving value proceeds as described above in connection with the fixed length embodiment (see Figure 8) .
After a scheduling queue is selected, all tokens stored in the waiting pool of the selected scheduling queue are transferred to the mature pool (step 1090) . If there were no tokens to move to the mature pool (as determined at step 1090) , then the algorithm proceeds to step 1094.
At step 1094, the subround number is checked to determine whether SR is equal to N (indicating that the serving cycle is complete) . If the serving cycle is complete, then at step 1096, the active scheduling queue subset becomes inactive and the inactive scheduling subset becomes active. Also, the subround number is reset to 1. From step 1096, the algorithm loops back to step 1082 to update the serving value. Also, if SR is not the final subround of the serving cycle (as determined by step 1094) , the algorithm will also loop back to step 1082. If step 1092 determines that the mature pool is not empty, then at step 1100, a token T(i) is dequeued from the selected scheduling queue. Thereafter, the packet queued at the head of CQ(i) is served (step 1102) and QLi is decremented by one (step 1104) . The algorithm next redistributes T(i) according to the algorithm of Figure 20 (step 1106) . Steps 1098 through 1106 are repeated until all tokens in the mature pool of the selected scheduling queue are processed. When all such tokens have been processed, the algorithm loops back to step 1082 and updates the serving value so that another scheduling queue may be selected. Figure 20 details the token redistribution process for the variable length packet embodiment. Generally, the algorithm of Figure 20 is tailored to:
• If a connection queue is no longer backlogged, that connection queue's token is dropped; • If a packet to be served from CQ(i) has a length Li larger than RW1, then CQ(i) must borrow service that has been allocated thereto for the next serving cycle, and CQ(i) is thus no longer eligible for service during the current serving cycle;
• If CQ(i) borrows only a portion of the next serving cycle's allocated service (which means that RW1-Li+W1>0) , then CQ(i) remains eligible for service during the next serving cycle, albeit with less service allocated thereto because of the loan made to accommodate the served packet. Since CQ(i) is eligible for service during the next serving cycle, T(i) is moved to a
scheduling queue in the inactive subset (the inactive subset will be the active subset during the next serving cycle) ; • If CQ(i) borrows all of the next serving cycle's allocated service (or perhaps also another subsequent serving cycle's service) (which means that RW1-Li+W1<0) , then CQ(i) is not eligible for service during the next serving cycle. As such, T(i) is moved to the collection queue of the inactive subset where it will remain until enough serving cycles pass to make up for the loan made to accommodate the packet with length L. The collection queue serves as a repository for the tokens of connection queues that are ineligible for service.
At step 1110, the token redistribution process for T(i) begins. First, the algorithm checks whether CQ(i) is empty after serving a packet therefrom (is QL± > 0?). If CQ(i) is empty, T(i) is dropped (step 1112) and the process is complete. If CQ(i) remains backlogged, a new location for T(i) must be identified.
At step 1114, the residual weight for CQ(i) is compared with the length of the packet just served from CQ(i) (L represents the length of that packet in terms of the number of cells that constitute the packet and is deter inable from the packet's header information). First, if RW1 is greater than i, this means that CQ(i) has not overused its allotted service amount during the current serving cycle. Second, if RW1 is less than L, this means that service to CQ(i) has exceeded its allotted service amount during the current serving cycle. Third, if RW1 = Li, this means the full service has been given to CQ(i) during the current serving cycle. Steps 1116 through 1132 address token redistribution for the first condition. Steps 1136 through 1140 and step 1132 address token distribution for the second and third conditions .
Step 1136 is reached when service to the packet queued at the head of CQ(i) results in service to CQ(i) equaling or exceeding the amount of service allotted thereto during the current serving cycle. When, a connection queue is overserved, it will not be served again until a subsequent serving cycle; which subsequent serving cycle will depend upon the degree of overservice that the connection queue has received. At step 1136, the residual weight value for CQ(i) is adjusted to reflect the packet served from CQ(i) (the residual weight
is decreased by packet length) and the fact the CQ(i) will not be served until a subsequent serving cycle (the residual weight is also increased by connection weight W) . Thus, at step 1136, RW1 is set equal to RW1 - LA + W1, where L± is the served packet's length in cells.
Step 1138 evaluates the degree of overservice given to CQ(i) . If the newly updated residual weight value for CQ(i) is a negative value or zero, this indicates that CQ(i) has been greatly overserved. As such, T(i) is moved to the collection queue of the inactive subset (step 1140) and the token redistribution process for T(i) is complete. T(i) will remain in the collection queue until enough serving cycles pass where RW1 is no longer negative.
If the newly updated residual weight value is a positive value, this means that CQ(i) may be served during the next serving cycle. At step 1130, the appropriate scheduling queue in the inactive subset for storing T(i) is identified by computing m1, wherein m1 = max{g, for all g's where RWg^l} . As noted above in connection with Figure 18, m1 represents the scheduling queue Q(m1) that has the highest scheduling weight among scheduling queues that are eligible to store T(i) . At step 1132, T(i) is moved to the waiting pool of Q (m1) in the inactive subset, and the token redistribution process for T(i) is complete .
Step 1116 is reached when service to the packet queued at the head of CQ(i) did not result in CQ(i) overusing the amount of service allotted thereto during the current serving cycle. When CQ(i) has not been overserved, the algorithm seeks to move T(i) to the eligible scheduling queue (as determined from RW1) that will be selected the soonest according to a subsequent serving value.
At step 1116, the subround number is compared with N. If SR=N, this means that the active subset of scheduling queues is about to toggle between Subset A and Subset B. Therefore, it is preferred that T(i) be moved to the waiting pool of the scheduling queue in the inactive subset (the inactive subset will soon be the active subset) that has the highest scheduling weight among eligible scheduling queues for the token. First, at step 1128, the value of RW1 is reset to W1. RW1 is reset because CQ(i) will not be served again until the next serving cycle given that T(i) has been moved to the inactive subset . Any allotted amount of service left unused by a connection queue during a serving cycle (i.e., credit) will not be passed to the
next serving cycle. However, as can be seen from step 1136 described above, deficits will be passed to the next serving cycle.
Next, at step 1130, the algorithm computes m1. As noted above, m1 represents the scheduling queue Q (m1) that has the highest scheduling weight among scheduling queues that are eligible to store T(i). At step 1132, T(i) is moved to the waiting pool of Q (m1) in the inactive subset, and the redistribution process for T(i) is complete .
If step 1116 results in a determination that the current subround is not the final subround of the serving cycle, this means that T(i) should be moved to one of the scheduling queues in the currently active subset. At step 1118, RW
1 is adjusted to reflect the service given to the packet that was queued at the head of CQ(i) (RW
1 is decremented by L
1) . Next, at step 1120, the algorithm begins the process of determining the appropriate scheduling queue for storing T(i) by determining whether the current serving value is the last serving value of the current subround. If S is the last serving value of the current subround, then the most significant N+l-SR bits will equal 1 and the least significant SR-1 bits will equal 0. If step 1120 results in a determination that S is on its last value for the subround, the algorithm preferably ensures that T(i) will not be queued in a scheduling queue not eligible for selection during the next subround (see Figure 15 for scheduling queue selection eligibility as a function of subround) . Thus, at steps 1122 and 1126, after the scheduling queue having the highest scheduling weight among eligible scheduling queues as determined from RW
1 is identified (Q (m
1) , the algorithm either stores T(i) in Q (m
1) (if Q (m
1) is eligible for selection during the coming subround), or stores T(i) in Q(N-(SR+1)) (if Q(m
1) is not eligible for selection during the coming subround). Q(N-(SR+U) represents the scheduling queue having the highest scheduling weight that will be selected during the coming subround. The value h
1, where
N-(SR+1)}, represents the result of this decision-making process. T(i) is stored in the waiting pool of Q (h
1) in the active set at step 1126. If step 1120 results in a determination that S is not on its last serving value for the subround, then the algorithm proceeds to step 1124. At steps 1124 and 1126, after the scheduling queue having the highest scheduling weight among eligible scheduling queues as determined from RW
1 is identified (Q (m
1) , the algorithm either stores
T(i) in Qdn
1) (if Q(m
1) is eligible for selection during the current subround) or stores T(i) in Q(N-SR) (if Q(m
x) is not eligible for selection during the current subround). Q(N-SR) represents that scheduling queue having the highest scheduling weight that will be selected during the current subround. The value h
1, where h
1= minjm
1, N-SR}, represents the result of this decision-making process. T(i) is stored in the waiting pool of Q (h
1) in the active set at step 1126.
Figure 21 illustrates the algorithm for processing tokens that are stored in the collection queue of the active subset. The process of Figure 21 can run in parallel with the steps of Figures 18-20. As can be seen from step 1150, the process of Figure 21 continues until no more tokens are stored in the collection queue. At step 1152, the token T(i) at the head of the collection queue is dequeued. At step 1154, the residual weight value for CQ(i) is increased by W1 to reflect the fact that no service will be given to CQ(i) during the next serving cycle. At step 1156, the algorithm determines whether the newly updated residual weight value for CQ(i) is greater than zero. If RW1 is greater than zero, then the algorithm moves T(i) to the waiting pool of Q (m1) in the inactive subset wherein m1 is computed as noted above (steps 1158 and 1160) . If RW1 is less than or equal to zero, then the algorithm moves T(i) to the collection queue of the inactive subset (step 1162) , where the token must again wait for the process of Figure 21 to be revisited before the possibility of distribution to the waiting pool of a scheduling queue exists. From either step 1158 or step 1160, the algorithm loops back to step 1150 to check whether the collection queue is empty.
E . Other Considerations
The scheduling method of the present invention has been described in the context of scheduling objects such as cells or variable length packets traveling through a plurality of connections and competing for access to a common resource, such as a switch fabric input, a switch fabric output, or a multiplexer. However, it should be understood that the scheduling method of the present invention is not limited in its applications to data traffic over networks, and can be implemented as a general type scheduler where the common resource can be any type of device or link for which a
plurality of connections compete for access, including but not limited to a CPU, shared database, etc.
Also, the relationship between scheduling queue selection and serving values has been described using 1 ' s as the bit value from which bit patterns are set. However, it should be understood that a person could use a 0 as the predetermined value from which bit patterns are set. For example, after each scheduling queue is associated with a bit in the serving value, the controller can be configured to select a scheduling queue when the least significant bit in the serving value equal to 0 is the bit associated with the selected scheduling queue. In such a configuration, the unused state would be S=lll...lll. Similarly, the eligible scheduling queues for the placement of token T(i) associated with CQ(i) can be determined from the bits in 2N-1-W1 or 2N-1-RW1 that are equal to the predetermined value of 0 rather than 1.
Furthermore, the residual weight values maintained for each connection queue have been described wherein RW1 tracks how much service remains available to CQ(i) by initially being set equal to W1 and thereafter being decremented by the length of each packet served from CQ(i) . It should be readily understood by those of ordinary skill in the art that RW1 can also be maintained such that it is initially set to zero and is increased when CQ(i) is served by adding the length of the served packet thereto. In such cases, the decision-making process for token redistribution will partly depend on whether RW1 has reached W1 rather than whether RW1 has reached 0. It is easily understood that one having ordinary skill in the art would recognize many modifications and variations within the spirit of the invention that may be made to the particular embodiments described herein. Thus, the scope of the invention is not intended to be limited solely to the particular embodiments described herein, which are intended to be illustrative. Instead, the scope of the invention should be determined by reference to the claims below in view of the specification and figures, together with the full scope of equivalents allowed under applicable law.