|Publication number||US7409316 B1|
|Application number||US 11/613,281|
|Publication date||Aug 5, 2008|
|Filing date||Dec 20, 2006|
|Priority date||Nov 12, 2003|
|Also published as||US6973415, US7107187, US7243049, US7437281|
|Publication number||11613281, 613281, US 7409316 B1, US 7409316B1, US-B1-7409316, US7409316 B1, US7409316B1|
|Inventors||Dean Lee Saghier, Brian Washburn, Scott Schroder|
|Original Assignee||Sprint Communications Company L.P.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (11), Referenced by (10), Classifications (17), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application is a continuation of U.S. application Ser. No. 11/344,502 filed Jan. 31, 2006, now pending, entitled METHOD FOR PERFORMANCE MONITORING AND MODELING, which is a continuation of U.S. application Ser. No. 11/056,075 filed Feb. 11, 2005, now pending, entitled METHOD FOR PERFORMANCE MONITORING AND MODELING, which is a divisional of U.S. application Ser. No. 10/914,405, now U.S. Pat. No. 6,973,415, filed Aug. 9, 2004, now pending, entitled SYSTEM AND METHOD FOR MONITORING AND MODELING SYSTEM PERFORMANCE, which is a continuation of U.S. application Ser. No. 10/706,707 filed Nov. 12, 2003, now pending, entitled METHOD FOR MODELING SYSTEM PERFORMANCE.
The present invention relates to the modeling of systems comprising computer software operating on computer hardware. More particularly, the present invention relates to real-time collection of system metrics and the systems and methods for the modeling of system performance parameters as non-linear functions for use in predicting system performance, identifying circumstances at which system performance will become unacceptable, and issuing alarms when system performance is near or beyond unacceptable conditions.
Computing systems have become an integral part of business, government, and most other aspects of modern life. Most people are likely regrettably familiar with poor performing computer systems. A poor performing computer system may be simply poorly designed and, therefore, fundamentality incapable of performing well. Even well-designed systems will perform poorly, however, if adequate resources to meet the demands placed upon the system are not available. Properly matching the resources available to a system with the demand placed upon the system requires both accurate capacity planning and adequate system testing to predict the resources that will be necessary for the system to function properly at the loads expected for the system.
Predicting the load that will be placed upon a system may involve a number of issues, and this prediction may be performed in a variety of ways. For example, future load on a system may be predicted using data describing the historical change in the demand for the system. Such data may be collected by monitoring a system or its predecessor, although such historical data may not always be available, particularly for an entirely new system. Other methods, such as incorporating planned marketing efforts or other future events known to be likely to occur, may also be used. The way in which system load is predicted is immaterial to the present invention.
Regardless how a prediction of future system load is made, a system must have adequate resources to meet that demand if the system is to perform properly. Determining what amount of resources are required to meet a given system demand may also be a complex problem. Those skilled in the art will realize that system testing may be performed, often before a system is deployed, to determine how the system will perform under a variety of loads. System testing may allow system managers to identify the load at which system performance becomes unacceptable, which may coincide with a load at which system performance becomes highly nonlinear. One skilled in the art will also appreciate that such testing can be an enormously complex and expensive proposition, and will further realize that such testing often does not provide accurate information as to at what load a system's performance will deteriorate. One reason for the expense and difficulty of testing is the large number of tests necessary to obtain a reasonably accurate model of system performance.
One skilled in the art will likely be familiar with the modeling of a system's performance as a linear function of load. One skilled in the art will further realize, however, that a linear model of system performance as a function of load is often a sufficiently accurate depiction of system performance within only a certain range of loads, with the range of loads within which system performance is substantially linear varying for different systems. System performance often becomes non-linear at some point as the load on the system increases. The point at which system performance becomes nonlinear may be referred to as the point at which the linear model breaks down. The load at which a system's performance begins to degrade in a non-linear fashion may be referred to as the knee. At the knee, system throughput increases more slowly while response time increases more quickly. At this point system performance suffers severely, but identifying the knee in testing can be difficult. Accordingly, while a basic linear model theoretically can be obtained with as little as two data points, additional data points are necessary to determine when a linear model of system performance will break down. Obtaining sufficient data points to determine when a linear model of system performance breaks down often requires extensive testing. At the same time, such testing may not yield an accurate model of system performance, particularly as the system moves beyond a load range in which its performance is substantially linear.
The collection of system metrics in a production environment may be used to monitor system performance. System metrics collected in a production environment may also be used to model system performance. However, linear modeling of system performance using system metrics collected in a production environment will not e likely to yield a better prediction of the system's knee unless the system operates at or beyond that point. Of course, one skilled in the art will appreciate that the purpose of system testing and system modeling is to avoid system operation at and beyond the knee, meaning that if such data is available the modeling and monitoring has already substantially failed.
A further challenge to using system metrics collected in a production environment is the burden of collecting the metrics. Simply put, collecting system metrics consumes resources. The system to be monitored, and/or associated systems operating with it, must measure, record, and process metrics. Particularly when a system is already facing a shortage of resources, the increased cost of monitoring the system's metrics must occur in an efficient fashion and provide significant benefit to be justified.
The present invention provides systems and methods to collect metrics from a system operating in a production environment. The collected metrics may be used as a plurality of data points to model system performance by fitting a non-linear curve to the data points. The use of a non-linear curve may better identify the load at which a system's operation will become unacceptable. Systems and methods in accordance with the present invention may also identify correlations between measured system metrics, which may be used to develop further models of system performance. The present invention may also be utilized to identify a point of interest in system performance for use in monitoring a system in a production environment so that an alarm may issue if system performance exceeds predetermined parameters around the point of interest.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
The present invention provides systems and methods for monitoring system performance, identifying correlations between system metrics, modeling system performance, identifying acceptable operating parameters, and issuing alarms if acceptable operating parameters are exceeded, wherein a system comprises computer software operating on computer hardware. The present invention may be used in conjunction with a system comprising any computer software operating on any computer hardware. One example of such a system is an order processing system that receives orders input into a user interface, processes that information, and then provides pertinent information to persons or systems responsible for filling the orders. However, any system comprising software operating on hardware may be monitored and/or modeled using systems and methods in accordance with the present invention. In systems and methods in accordance with the present invention, system metrics may be measured and used to model system performance. For example, data points for system throughput may be obtained at a plurality of loads, and system performance may then be modeled by fitting a non-linear curve to the data points to obtain a non-linear model of system throughput as a function of load. One skilled in the art will appreciate that the data points used in accordance with the present invention to model system performance may be obtained in a variety of ways. By way of example, data points may be obtained through system testing or through monitoring system performance while the system is in use. A system that is in use may be described as being in a production environment. One skilled in the art will appreciate that numerous methods, procedures, techniques, and protocols exist or may be developed for system testing, and that any of these may be used in system testing to obtain data points for use in accordance with the present invention. Likewise, one skilled in the art will appreciate that a variety of methods, procedures, techniques, and protocols exist or may be developed for system monitoring in addition to those described herein, and that any of these may be used to monitor a system operating in its production environment to obtain data points for use in accordance with the system modeling aspects of the present invention.
In the examples described herein for
Referring now to
Referring now to
Referring now to
One skilled in the art will appreciate that a relationship between units of response time and units of throughput were defined to enable response time and throughput to be illustrated on a single graph in
Graphs such as the graph illustrated in
Referring now to
In step 514 the system's response time may be tested using a plurality of loads to obtain a plurality of response time data points. Any testing procedure may be used in step 514. Step 514 may be performed in conjunction with step 512, although these steps may also be performed separately. The response time data points obtained in step 514 are utilized in step 518 to model the system's response time as a function of load by fitting a non-linear curve to the plurality of response time data points. It should be appreciated that the type of non-linear curve fit to the response time data points may vary, but may appropriately comprise an exponential curve. It should be further appreciated that a variety of methods may be used to fit a non-linear curve to the response time data points. Alternatively, a larger number of data points, such as five, or seven as illustrated in
In step 520 a maximum acceptable response time may be defined. For example, users of the system may determine that a response time greater than a given predetermined amount, such as five seconds, is unacceptable. Thus, in this example, the maximum acceptable response time would be five seconds. Using the non-linear curve modeling the system's response time as a function of load, step 522 may determine the maximum system load as the load at which the maximum acceptable response time is reached. Step 524 may then determine the maximum system throughput using the model of the system's throughput to determine the throughput for the system at the maximum load. Step 520, step 322, and step 524 allow a system's operator to obtain information regarding the maximum capabilities of the system in operation. However, one or more of step 520, step 522, and step 524 may be omitted from methods in accordance with the present invention.
In step 530 a linear relationship may be defined between response time and throughput. This definition in step 530 may be used in step 532 of displaying a graph of the throughput model curve and the response time model curve in a single graph. If step 530 is omitted, step 532, if performed may display multiple graphs.
Step 530 may further permit step 534 of calculating the distance between the throughput model curve and the response time model curve. This distance may be monitored in system operation and an alarm may be issued if the distance falls below a threshold amount. The distance calculated in step 534 may be used in step 536 to determine the optimal load for the system. For example, the optimal load may be the load at which the distance between the curves is maximized. Optimal load may be defined in other ways, as well, such as the load at which a desired throughput or response time is attained or the load of which system utilization is reached. An optimal range may be defined around the optimal load for use in monitoring system performance and issuing an alarm should system performance exceed the optimal range.
Referring now to
In step 614 the system's response time is measured at predetermined times to obtain response time at the loads existing at the measurement times. As with step 612, step 614 may be performed at a variety of times, such as on a daily, hourly, weekly, or other basis as appropriate for the system in question. In step 615 the response time data points may be stored. Step 615 may store the response time data points to a hard drive, in computer memory, or in any other fashion. In step 618 the response time data points may be used to model the system's response time as a function of load by fitting a non-linear curve to the stored system response data points. It should be noted that a variety of non-linear curves may be used, such as an exponential curve. One skilled in the art will realize that a variety of curve fitting methodologies may be used. It should be further noted that step 618 may be performed at a variety of times. For example, step 618 may be performed at predetermined times, such as on a daily or weekly basis. Alternatively, step 618 may be performed every time a predetermined number of new response time data points have been stored in step 615. Fore example, step 618 may be performed when one, ten, one hundred, or some other number of new data points have been stored in step 615. Whatever timing is used to perform step 616, it may be expected that as additional response time data points are added the curve modeled in step 618 will increasingly and accurately reflect system response time as a function of load. Step 618 may use every stored response time data point, or it may use a subset of stored response time data points, such as the data points for the last week of operation.
In step 620 a maximum acceptable response time may be defined. The maximum acceptable response time may be a predetermined amount of time within which a response must be made by the system for system performance to be deemed acceptable. For example, a maximum acceptable response time of five seconds may be used. If system response time is being monitored step 621 may issue an alarm if the maximum acceptable response time is reached or exceeded. Such an alarm may indicate that the system requires additional resources or adjustments to function properly. Alternatively, step 621 may issue an alarm when response time reaches a predetermined percentage of the maximum acceptable response time, such as, for example, eighty percent.
Based upon the maximum acceptable response time defined in step 620 and the model of the system's response time as a function of load created in step 618, step 622 may determine the maximum system load as the load at which the maximum acceptable response time is reached. In step 623 an alarm may be issued if the maximum system load is reached. Alternatively, step 623 may issue an alarm if a predetermined percentage of the maximum system load is reached, such as, for example, eighty percent. In step 624 the maximum system load determined in step 622 and the model of the system's throughput as a function load created in step 616 may be used to determine the maximum system throughput as the throughput at the maximum system load. In step 625 an alarm may be issued if the maximum acceptable response time is reached. Alternatively, step 625 may issue an alarm if a predetermined percentage of the maximum throughput is reached, for example eighty percent.
In step 630 a relationship may be defined between response time and throughput. The relationship defined in step 630 may be a linear relationship. In step 632 a graph may be displayed of the throughput model curve and the response time model curve. Step 632 may display both curves in a single graph through a graphical user interface. If step 630 is omitted, step 432 may display the curves in multiple graphs.
The relationship between throughput and response time defined in step 630 may also be used to calculate a distance between the throughput model curve and the response time model curve in step 634. Using distance as calculated in step 634, step 636 may determine an optimal load as the load at which the distance between the curves is maximized. Optimal load may be defined in other ways, as well, such as the load at which a desired throughput or response time is attained or the load at which a given system utilization is reached. Step 640 may define an optimal load range around the optimal load. In step 641 an alarm may be issued if the optimal load range defined in step 640 is exceeded.
Of course, methods in accordance with the present invention, such as method 500 and method 600, may be used to model network parameters other than system throughput and system response time. Methods in accordance with the present invention may be used to measure a first network parameter, model the first network parameter as a non-linear curve, measure a second network parameter, and model the second network parameter as a non-linear curve. Measuring the first network parameter and the second network parameter may comprise testing the system, measuring the network parameters during system operation, or a combination thereof. A relationship may be defined between the first network parameter and the second network parameter. Such a relationship may allow a distance to be determined between the curve modeling the first network parameter and the curve modeling the second network parameter. Such a relationship may also allow the display of the curve modeling the first network parameter and the curve modeling the second network parameter on a single graph.
It should be appreciated that example method 300 and example method 400 are exemplary methods in accordance with the present invention, and that many steps discussed therein may be omitted, while additional steps may be added. The methods in accordance with the present invention are not limited to any particular way of obtaining data points, whether through testing or monitoring a system in a production environment, nor are they limited to any particular method for fitting a non-linear curve to the data points obtained.
Referring now to
Referring now to
The first step in the invention is to build a logadaptor process 820. The logadaptor process 820 creates a set of files to reside on the plurality of systems. These files are used to collect metrics and hold threshold data. The logadaptor process 820 may be built through a set of initialization steps and may run on the system 710. Subsequently, it is replicated onto the rest of the systems throughout the network where it runs independently on those systems. Essentially, one can have one logadapter process 820 running on each system 710 in the network. The logadaptor process 820 can run on all of the systems or can run on a select number of systems. The logadaptor process 820 can be replicated onto other systems via such devices such as a chron process or a daemon process.
The logadaptor process 820 takes inputs 830 from a user to set up the initial conditions for the types of applications or systems to be monitored and the types of metrics to be collected. As the logadaptor process 820 begins to run on the plurality of systems, it will automatically generate configuration files 840 pertaining to threshold data and alarm data to run on a particular system. The logadaptor process 820 may perform this function in two ways: It can monitor applications on the system on which it is running, or it can monitor applications on remote systems. Once the configuration files 840 are generated automatically, the configuration files 840 initiate the function for collecting metrics data. The logadapter process 820 stores the metric data in files 850 which subsequently get transferred to a database 880 (same as 795).
It should be further realized that a variety of actions may be taken if an alarm is issued in accordance with the present invention for a system in a production environment. Additional computing resources may be added, such as additional servers or additional system memory, the software of the system may be improved and modified to enhance efficiency, or some combination of the two may be taken. Alternatively, steps may be taken to reduce load on the system to facilitate better system performance. The steps taken by a system operator due to information obtained in practicing methods in accordance with the present invention are immaterial.
Referring now to
The system metric data may be analyzed to identify correlations between the system metrics in identification step 720. Step 720 may be performed at a central processing hub. Identification step 720 may be particularly advantageous when a large number of metrics are measured, not all of which have known correlations between them. In identification step 720 various metrics data may be analyzed over a given period of time to determine whether a mathematical relationship exists between a pair of metrics, such as system load and processor utilization. The identification of correlations between system metrics may then be used to provide more accurate models of system performance. Step 720 may identify pairs of metrics having an identified correlation between them. For example, a primary pair of system metrics may comprise a first system metric and a second metric having a correlation between them. By way of further example, a secondary pair of system metrics may comprise a third system metric and a fourth system metric. There may be metrics common between, for example, a primary pair of system metrics and a secondary pair of system metrics, such that the primary pair of system metrics comprises a first system metric and a second system metric and the secondary pair of system metrics comprises a third system metric and the second system metric. Also, a correlation may be identified, for example, between the first system metric and the third system metric in this example. One skilled in the art will appreciate that any number of pairs of any number of system metrics having correlation between them and that the designation of a pair as primary, secondary, tertiary, etc. and that the designation of a metric as first, second, third etc. are immaterial to the present invention.
Step 720 may alternatively identify system metrics having a correlation with a predetermined system metric. In many uses, a system metric such as system load may be the predetermined system metric with which other system metrics are analyzed for correlations, although system metrics are analyzed for correlations, although any system metric maybe used for this purpose.
In step 725 system performance is modeled as a non-linear relationship between system metrics. The model constructed in modeling step 725 may utilize correlations identified in identification step 720 or may use predetermined metrics identified by a system administrator or others through prior experience.
In step 730 unacceptable system operation ranges may be identified. For example, a model constructed in modeling step 725 may indicate that a certain monitored system metric, such as system response time, may reach an unacceptable range when another system metric, such as system load, reaches a given point. Step 730 may identify a variety of unacceptable system operation ranges for each pair of metrics modeled, and may further identify unacceptable operation ranges for more than one pair of metrics. For example, varying degrees of unacceptable system response time may be identified. The degree to which each identified range is unacceptable may increase, from a moderately unacceptable level that requires prompt attention to correct to a fully unacceptable response time which requires immediate corrective action.
In step 735 an optimal system operation range may be identified using the model constructed in modeling step 725. Methods such as those described above that maximize the distance between curved modeling to different metrics to as a function of load may be used to identify an optimal system operation range in step 735.
Alarm thresholds may be defined in step 740. The alarm thresholds defined in step 740 may be based upon one or more unacceptable system operation ranges identified in step 730 and/or an optimal system operation range identified in step 735. The alarms defined in step 740 may constitute varying degrees and may be based upon different metrics. For example, an alarm may be defined to trigger if system metrics leave the optimal system operation range defined in step 735. Such an alarm may be of a low level, given that the performance of the monitored system may be non-optimal but may remain entirely acceptable to users. A higher level of alarm may then issue if one or more system metric enters into an unacceptable system operation range. If varying degrees of unacceptable system operation ranges were identified in step 730, correspondingly differing alarms may be defined in step 740.
In step 745 alarms may be issued if alarm thresholds are met. Step 745 may operate based upon the alarm thresholds defined in step 740 and the system metrics collected in step 715. Alternatively, a system may periodically receive alarm thresholds defined in step 740 and may issue an alarm if the systems recorded metrics recorded in step 710 meet or exceed an alarm threshold.
Referring now to
Referring now to
System 900 includes component 910. Component 910 includes a log adaptor 912. Log adaptor 912 may operate on a server on other computing device and may execute process in accordance with software implementing the methods of the present invention. Log adapter 912 may relay upon manually created configuration files 915 in operation. Manually generated files 915 may include a DSI configuration file 916. The DSI configuration file 916 may comprise lines describing the type-delimited metrics to collect, the type-delimited metric format, the path to the log set, the value at which to trigger an alarm or a metric (which may be left blank to turn of alarming), the application name, the application designation, the open view OPC message group, to indicate whether DSI logging is on or off, and settings for specific files such as the maximum indexes per record per hour per summarization or per retention. Manually generated files 915 may further comprise an IDS configuration file 916 to set to initial type-delimited values, the first being the starting transaction ID number and the second being the starting metric ID number to use when generating new specification files. Manually generated files may further include the OBCA client configuration file 918.
Automatically generated files 920 may also be part of system 910. Automatically generated files 920 may include a class configuration file 921 that contains one line per transaction with the short transaction name from the transaction configuration file 922. Transaction configuration file 922 may be a translation file to accommodate the 18-character limit in MeasureWare log file set names. Each line of the translation configuration file 922 may contain one line per transaction that has two values that are type-delimited. The first value may be a potentially shortened value of the full transaction name that is within the 18-character maximum followed by the full transaction name. The long name of a transaction may be the name actually used for a log file, with the short name being used in the class configuration file 921 for use by MeasureWare. The time threshold configuration file 923 may hold the average service time threshold violation values per transaction for levels of warnings such as minor, major, critical, or other levels desired by a user. An error threshold configuration file 924 may also be used, but may be omitted. A percent threshold configuration file 925 also may be optionally included. A previous alarm configuration file 926 may be included to maintain historical alarm information.
Log adapter 912 may receive manually generated files 915 and may operate in the generation of automatically generated files 920.
One skilled in the art will appreciate that methods in accordance with the present invention may be implemented using computer software. Such software may take the form of computer readable code embodied on one or more computer readable media. Software implementing the present invention may operate independently, but may also be incorporated with system testing software or system monitoring software. Any software language may be used to implement methods in accordance with the present invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5375070 *||Mar 1, 1993||Dec 20, 1994||International Business Machines Corporation||Information collection architecture and method for a data communications network|
|US5799154 *||Jun 27, 1996||Aug 25, 1998||Mci Communications Corporation||System and method for the remote monitoring of wireless packet data networks|
|US6643613 *||Jul 1, 2002||Nov 4, 2003||Altaworks Corporation||System and method for monitoring performance metrics|
|US6658467 *||Sep 8, 1999||Dec 2, 2003||C4Cast.Com, Inc.||Provision of informational resources over an electronic network|
|US6738933 *||Oct 19, 2001||May 18, 2004||Mercury Interactive Corporation||Root cause analysis of server system performance degradations|
|US6782421 *||Jul 9, 2002||Aug 24, 2004||Bellsouth Intellectual Property Corporation||System and method for evaluating the performance of a computer application|
|US6816798 *||Dec 22, 2000||Nov 9, 2004||General Electric Company||Network-based method and system for analyzing and displaying reliability data|
|US6898556 *||Sep 26, 2003||May 24, 2005||Mercury Interactive Corporation||Software system and methods for analyzing the performance of a server|
|US20030079160 *||Jul 18, 2002||Apr 24, 2003||Altaworks Corporation||System and methods for adaptive threshold determination for performance metrics|
|US20030202638 *||Jun 18, 2002||Oct 30, 2003||Eringis John E.||Testing an operational support system (OSS) of an incumbent provider for compliance with a regulatory scheme|
|US20050197806 *||Oct 22, 2004||Sep 8, 2005||Fisher-Rosemount Systems, Inc.||Configuration system and method for abnormal situation prevention in a process plant|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7539608 *||Oct 9, 2002||May 26, 2009||Oracle International Corporation||Techniques for determining effects on system performance of a memory management parameter|
|US8161058 *||Mar 27, 2008||Apr 17, 2012||International Business Machines Corporation||Performance degradation root cause prediction in a distributed computing system|
|US8548790 *||Jan 7, 2011||Oct 1, 2013||International Business Machines Corporation||Rapidly determining fragmentation in computing environments|
|US8560894 *||Jan 14, 2011||Oct 15, 2013||Fujitsu Limited||Apparatus and method for status decision|
|US8661299 *||May 31, 2013||Feb 25, 2014||Linkedin Corporation||Detecting abnormalities in time-series data from an online professional network|
|US8805647 *||Dec 22, 2008||Aug 12, 2014||Netapp, Inc.||Evaluating and predicting computer system performance using kneepoint analysis|
|US20080177698 *||Mar 27, 2008||Jul 24, 2008||International Business Machines Corporation||Performance Degradation Root Cause Prediction in a Distributed Computing System|
|US20110185235 *||Jul 28, 2011||Fujitsu Limited||Apparatus and method for abnormality detection|
|US20120179446 *||Jan 7, 2011||Jul 12, 2012||International Business Machines Corporation||Rapidly determining fragmentation in computing environments|
|US20120290263 *||Dec 22, 2008||Nov 15, 2012||George Smirnov||Evaluating And Predicting Computer System Performance Using Kneepoint Analysis|
|U.S. Classification||702/182, 702/186, 702/179, 702/184, 702/183, 709/223, 703/22, 709/224, 703/21|
|International Classification||G06F15/00, G05B23/02, G06F19/00|
|Cooperative Classification||G06Q10/0639, G06Q10/06, G05B23/0294, G05B23/024|