WO2008122459A2 - Monitoring reliability of a digital system - Google Patents

Monitoring reliability of a digital system Download PDF

Info

Publication number
WO2008122459A2
WO2008122459A2 PCT/EP2008/052021 EP2008052021W WO2008122459A2 WO 2008122459 A2 WO2008122459 A2 WO 2008122459A2 EP 2008052021 W EP2008052021 W EP 2008052021W WO 2008122459 A2 WO2008122459 A2 WO 2008122459A2
Authority
WO
WIPO (PCT)
Prior art keywords
digital system
frequency
maximum frequency
rate
change
Prior art date
Application number
PCT/EP2008/052021
Other languages
French (fr)
Other versions
WO2008122459A3 (en
Inventor
Dae Ik Kim
Jonghae Kim
Moon Ju Kim
James Randal Moulic
Hong Hua Song
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited filed Critical International Business Machines Corporation
Priority to KR1020097015211A priority Critical patent/KR101114054B1/en
Priority to JP2010502470A priority patent/JP5181312B2/en
Publication of WO2008122459A2 publication Critical patent/WO2008122459A2/en
Publication of WO2008122459A3 publication Critical patent/WO2008122459A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the present invention relates in general to the field of failure prediction and, more specifically, to a reliability measurement and warning method, system and computer program product for a digital system.
  • Failure rates of individual components making up a digital system are fundamentally related to various parameters, including operating temperatures, as well as scaling of the digital system and interconnect geometries.
  • burn- in testing of digital systems attempts to predict a lifecycle for a given type of digital system, it does not provide aging information for each specific digital system of the type being manufactured.
  • a customer or user may uncover a problem with a digital system only after a catastrophic system failure. While catastrophic failure of a digital system is readily recognizable, a "soft" failure (where there may be significant degradation in digital system performance or reliability) may go unnoticed, which implies that such aging of the digital system may cause undetected errors in computation and data, from which it is difficult to recover.
  • Presented herein is an approach for actively monitoring or measuring aging, and hence reliability, of a specific digital system and for issuing a warning signal if, for example, degradation of operation of the system exceeds a specified threshold.
  • a method of monitoring the reliability of a digital system includes: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
  • a system of monitoring reliability of digital system includes control logic adapted to periodically determine a maximum frequency of operation of the digital system, and to generate a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
  • an article of manufacture which includes at least one computer-usable medium having computer-readable program code logic to facilitate monitoring of reliability of a digital system.
  • the computer-readable program code logic when executing performing the following: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or
  • the present invention provides a method of monitoring reliability of a digital system, the method comprising: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
  • the present invention provides a method further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
  • the present invention provides a method further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employing the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
  • the present invention provides a method further comprising employing multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
  • the present invention provides a method further comprising employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and estimating the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
  • the present invention provides a method further comprising employing multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
  • the present invention provides a method further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
  • the present invention provides a method wherein the first predefined threshold is a first predefined threshold frequency which is greater than the warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
  • the first predefined threshold is a first predefined threshold frequency which is greater than the warning threshold frequency
  • the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
  • the present invention provides a method further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a rate of change in the difference between measured maximum frequencies of operation of the digital system is above a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
  • the present invention provides a method wherein the dynamically adjusting of the sampling period further comprises determining whether the rate of change in the difference between measured maximum frequencies of operation of the digital system is greater than a second predefined threshold rate, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system, wherein the second predefined threshold is greater than the first predefined threshold.
  • the present invention provides a method further comprising controlling a sampling period for the periodically determining of the maximum frequency of operation of the digital system, wherein controlling the sampling period comprises: estimating a time interval from a most recent determination of maximum frequency of operation of the digital system to the digital system reaching a maximum frequency of operation equal to the warning threshold frequency; employing the estimated time interval in setting a next sampling period for determining a maximum frequency of operation of the digital system; determining whether the next sampling period is less than a previous sampling period employed in the periodically determining of the maximum frequency of operation of the digital system; and if so, increasing the sampling period to increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
  • the present invention provides a system of monitoring reliability of a digital system, the system comprising: control logic adapted to periodically determine a maximum frequency of operation of the digital system, and generate a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
  • the present invention provides a system wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
  • the present invention provides a system wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employ the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
  • the present invention provides a system wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
  • the present invention provides a system wherein the control logic is further adapted to employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and to estimate the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
  • the present invention provides a system wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
  • the present invention provides for a system wherein the control logic is further adapted to dynamically adjust a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
  • the present invention provides a system wherein the first predefined threshold is a first predefined threshold frequency which is greater than the predefined warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the predefined warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
  • the first predefined threshold is a first predefined threshold frequency which is greater than the predefined warning threshold frequency
  • the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the predefined warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
  • the present invention provides a computer program product loadable into the internal memory of a digital computer, comprising software code portions for performing, when said product is run on a computer, to carry out the invention as described above.
  • FIG. 1 is a graph of a typical digital system life cycle, illustrating hard failure and soft aging, both of which can be identified in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 2 depicts one embodiment of a digital system and control logic implementing reliability monitoring and warning signal generation, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 2A is an alternate embodiment of a digital system and control logic implementing reliability monitoring and signal warning signal generation, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 3 is a flowchart of one embodiment of logic for periodically determining a maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 4 is a flowchart of one embodiment of logic for tracking maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 5 graphically depicts periodically determining the maximum frequency of operation of a digital system, and signaling a warning when the maximum frequency of operation falls below a predefined warning threshold frequency of operation, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 6 is a flowchart of one embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 7 is a flowchart of an alternate embodiment of logic for performing trend analysis and for generating a warning signal, in accordance with an aspect of a preferred embodiment of the present invention.
  • FIG. 8 is a flowchart of an alternate embodiment of logic for trend analysis of a digital system and for generating a warning system based thereon, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 9 is a flowchart of a further embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention.
  • FIG. 10 is a flowchart of an alternate embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 11 is a flowchart of one embodiment of logic implementing a variable sampling period for determining maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 12 is a flowchart of an alternate embodiment of logic implementing a variable sampling period for determining a maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention
  • FIG. 13 is a flowchart of an alternate embodiment of logic implementing sampling period analysis for determining a sample time for next maximum frequency of operation measurement of a digital system, in accordance with an aspect of a preferred embodiment of the present invention.
  • FIG. 14 depicts one embodiment of a computer program product to incorporate one or more aspects of preferred embodiments of the present invention.
  • the "digital system” refers to any digital system or circuit, and includes, for example, a processor, as well as simple or complex non-processor based digital logic, memory, etc.
  • the digital system is a microprocessor
  • the specified threshold is a predefined acceptable level for the maximum frequency of operation of the digital system.
  • a technique for periodically determining a maximum frequency of operation of a digital system and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
  • FIG. 1 depicts a typical digital system lifecycle model. This diagram illustrates that the digital system has a higher maximum frequency of operation (F MAX ) than a specified (i.e., required) maximum frequency of operation for the digital system (F SPEC ) when manufactured and beginning its lifecycle. As the digital system ages, several factors may degrade system performance, and hence decrease maximum operating frequency as a result. Factors which degrade digital system performance depend upon the particular system at issue and the environment within which the system is used. For example, if the digital system comprises a processor, aging can be caused by a variety of factors including hot election, electromigration and thermal expansion of the digital system.
  • FIG. 1 Two failure modes are illustrated in FIG. 1.
  • a hard failure illustrated by the dashed lines, is representative of an abrupt failure of the digital system (e.g., resulting from aging of the digital system).
  • Soft aging is also shown wherein operation of the digital system gradually decreases to a level at or below the manufacturer specified maximum frequency of operation (F SPEC ). Due to the gradual nature of this aging, the soft aging failure may go unnoticed, which implies that such aging may cause undetected errors in computation and data.
  • F SPEC manufacturer specified maximum frequency of operation
  • FIGS. 2 & 2 A depict embodiments of a digital system and control logic implementing reliability monitoring and warning signal generation, in accordance with an aspect of the present invention.
  • a digital system 200 is shown to be driven by an (optionally) adjustable supply voltage 220 and an adjustable clock rate 210.
  • Control logic 230 senses and controls the adjustable supply voltage and clock rate in order to periodically measure the digital system's maximum operating frequency.
  • F SPEC manufacturer specified maximum frequency of operation
  • F MAX maximum operating frequency
  • Degradation of F MAX is an indication of processor soft aging.
  • the system's age and lifetime can be measured by, for example, sweeping the supply voltage (VDD) and clock frequency (F CLK ), sending worst-case test instructions, and finding a maximum successful frequency at a given voltage (or over several voltages). These measurements can be saved to facilitate the control logic performing trend analysis on the maximum frequency of operation of the digital system for proactively issuing a warning signal prior to failure of the digital system.
  • the collected data can also be used to estimate a digital systems' status or age, and the rate of aging.
  • the digital system 200' is assumed to comprise a processor and to be capable of incorporating therein control logic 230' for reliability monitoring and signal warning generation of the digital system.
  • An adjustable clock frequency 210 and, in one embodiment, an adjustable supply voltage VDD 220 are again provided, which are sensed and controlled by control logic 230' during reliability monitoring, as described herein.
  • FIG. 3 depicts one embodiment of a testing protocol of control logic 230/230' for periodically determining a maximum frequency of operation of a digital system, in accordance with an aspect of the present invention.
  • the adjustable clock frequency (F CLK ) is set to the manufacturer specified maximum frequency of operation (F SPEC ) for the digital system 305.
  • a test instruction vector is sent to the digital system and results are analyzed 310.
  • Logic determines whether the test passed 315. If so, then the clock frequency (F CLK ) is raised 320, and a test instruction vector is again sent to the digital system and analyzed 325 to determine whether the test passed at the raised clock frequency 330.
  • the clock frequency is lowered 345, and the test is re-executed to determine if the digital system passes 355. This process continues until the clock frequency is low enough that the digital system passes the test, and the maximum frequency of operation of the digital system (FMAX) is again recorded as the highest passing clock frequency (F C LK) 335 and returned as the F MAX 340 for the digital system.
  • the gradient by which F CLK is raised when the test passes 320 may be the same or different than the gradient by which the clock frequency (F CLK ) is lowered when failing the test 345.
  • the gradient(s) selected may depend upon the age of the digital system and the accuracy desired. As explained further herein, this accuracy can be noted and changed as the digital system ages.
  • FIG. 4 depicts one embodiment of logic for tracking maximum frequency of operation of a digital system, in accordance with an aspect of the present invention.
  • the current maximum frequency of operation (F MAX ) is read at a time T K
  • reading of the current maximum frequency of operation is synonymous with measuring the current maximum frequency of operation using the protocol described above in connection with FIG. 3.
  • the logic also fetches the previous maximum frequency of operation of the digital system (FMAX (FK-O) a * time T ⁇ -i 420. This previous maximum frequency of operation can be retrieved from trend database 455, which is accessible by the control logic.
  • the difference (D K ) between the previous maximum frequency of operation and the current maximum frequency of operation is determined, and the rate of change (R K ) in the difference is calculated 440.
  • the measured maximum frequency of operation of the digital system (F K ), at time T K is then recorded in the trend database, along with the rate of change (R K ) in the difference (D K ) between measured maximum frequencies of operation of the digital system 450.
  • trend analysis 460 may be performed, either commensurate with each periodic determination of the maximum frequency of operation of the digital system, or at some other specified interval.
  • the currently measured maximum frequency of operation of the digital system (F MAX ) is compared against a predefined warning threshold frequency of operation (F WARN ) for the digital system.
  • the predefined warning threshold frequency (F WARN ) may be greater than or equal to the manufacturer specified maximum frequency of operation for the digital system (F SPEC ).
  • the warning threshold frequency of operation is above the manufacturer specified minimum frequency of operation, and when the maximum frequency of operation of the digital system drops to or below the warning threshold frequency of operation, a warning signal is generated by the control logic and sent, for example, to an operating system of the digital system.
  • the warning signal indicates that the maximum frequency of operation of the digital system may be slower than the manufacturer specified maximum frequency of operation (F SPEC ) in the near future (e.g., due to continued soft aging of the system, or resulting from a hard failure due to aging of the system).
  • the warning signal may also be provided to a user of the digital system so than an appropriate procedure, such as shutdown, can be taken.
  • the sampling rate for determining the maximum frequency of operation for the digital system may also be increased to more accurately monitor the digital system's status.
  • FIGS. 6-10 depict various further embodiments for performing trend analysis on the maximum frequency of operation of the digital system, in accordance with aspects of the present invention.
  • FIG. 6 presents a trend analysis approach wherein instead of employing the maximum frequency of operation of the digital system in comparison with a prespecified threshold to generate the warning signal, the rate of change (R K ) in the difference between measured maximum frequencies of operation of the digital system is employed, in accordance with an aspect of the present invention.
  • a most recent rate of change (R K ) in the difference between measured maximum frequencies of operation of the digital system is retrieved 600 from the trend database 455.
  • This recent rate of change (R K ) is then compared against a specified threshold rate of change (R TH ) in the difference between measured maximum frequencies of operation 610.
  • the specified threshold rate of change (R TH ) may be chosen based on historical aging information for the type of digital system being monitored.
  • FIG. 7 depicts an alternate trend analysis embodiment wherein N most recent rates of change in the difference between measured maximum frequencies of operation are fetched 700 from the trend database 455. From these values, a next rate of change (R' ⁇ + i) in the difference between measured maximum frequencies of operation is estimated 710. This estimated next rate of change (R' ⁇ + i) in the difference between measured maximum frequencies of operation can be determined employing conventional linear model estimation, such as a linear N-order model, wherein a linear prediction is made from the previous N rate of change determinations.
  • the next maximum frequency of operation of the digital system (F' ⁇ + i) is estimated 730, after which the logic determines whether the estimated maximum frequency of operation of the digital system (F' ⁇ + i) is less than the predefined warning threshold frequency (F WARN ) 740. If so, then a warning signal is generated 750, which completes trend analysis 760. Assuming that the estimated next frequency of operation of the digital system (F' ⁇ + i) is greater than the predefined warning threshold frequency (F WARN ), then no warning signal is generated, and trend analysis is finished 760.
  • the logic of FIG. 8 is similar to the logic of FIG. 7, with the exception that the next maximum frequency of operation of the digital system (F' ⁇ + i) is estimated directly from N prior saved measured maximum frequencies of operation of the digital system. Specifically, the most recent N measured maximum frequencies of operation of the digital system are fetched 800 from the trend database 455, and from these values, the next maximum frequency of operation of the digital system (F' ⁇ + i) is estimated 810 using, for example, linear N-model analysis 720. If the estimated next maximum frequency of operation of the digital system (F' ⁇ + i) is less than the predefined warning threshold frequency (F WARN ) 820, then a warning signal is generated 830, thereby completing trend analysis 840. No warning signal is generated if the estimated next maximum frequency of operation of the digital system is above the warning threshold frequency.
  • the predefined warning threshold frequency F WARN
  • FIG. 9 depicts an alternate approach wherein historical aging information saved in a historical aging database (or memory) 915 is employed in estimating a next rate of change in the difference between measured maximum frequencies of operation of the digital system in place of a linear N-order model, such as employed in the processing of FIG. 7.
  • the historical aging information may comprise a database of aging information gathered through conventional burn- in testing on the type of digital system being monitored.
  • the historical aging information could be derived from measuring aging of other digital systems of the particular type as the current digital system being monitored.
  • this historical aging information may provide a more accurate estimate of a next maximum frequency of operation and/or a next rate of change in the difference between measured maximum frequencies of operation than a linear progression model.
  • the most recent N determined rates of change in the difference between measured maximum frequencies of operation of the digital system are retrieved 900 from the trend database 455 and employed with the historical aging information or nominal aging model (from database 915) to estimate a next rate of change (R' ⁇ + i) in the difference between measured maximum frequencies of operation of the digital system 910.
  • a next maximum frequency of operation of the digital system is estimated 920, for example, by adding to the prior maximum frequency of operation of the digital system the estimated rate of change in the difference between the measured maximum frequencies of operation multiplied by the difference in time between measurements.
  • the estimated next maximum frequency of operation (F' ⁇ + i) is then compared against the predefined warning threshold frequency (F WARN ) 925, and a warning signal is generated if the estimated next maximum frequency of operation is below the warning threshold 930. Otherwise, no warning signal is generated and trend analysis is complete 935.
  • FIG. 10 depicts an alternate analysis approach wherein N recently measured maximum frequencies of operation are retrieved 1000 from the trend database 455 and employed to directly estimate a next maximum frequency of operation of the digital system (F' ⁇ + i) 1010 employing historical aging information from the historical aging database 915.
  • the estimated next maximum frequency of operation of the digital system is then compared against the predefined warning threshold frequency 1020, and if less, a warning signal is generated 1030. Otherwise, no warning signal is generated and trend analysis is complete 1040, that is, until a next analysis interval.
  • FIGS. 11-13 depict alternate embodiments for analyzing and dynamically adjusting the sampling period employed by the control logic in periodically determining the maximum frequency of operation of the digital system.
  • the most recent measured maximum frequency of operation of the digital system 1100 is, for example, fetched from the trend database 455 and compared against a first predefined threshold (F THI ) 1110.
  • the first predefined threshold is a frequency threshold that is greater than the predefined warning threshold frequency. If the maximum frequency of operation is above the first predefined threshold, no action is necessary and sampling period analysis is complete 1130. However, if the most recently determined maximum frequency of operation is below the first predefined threshold, then the sampling period is adjusted to a new sampling period Pi 1120. Sampling period analysis might then be complete (not shown), or alternatively, the most recent measured maximum frequency of operation of the digital system may be compared against a second predefined threshold (Fim) 1140.
  • This second predefined threshold is a frequency that may be, in one embodiment, equal to the predefined warning threshold frequency (F WARN ). If the most recent measured maximum frequency of operation is again above the second threshold frequency, then sampling period analysis is finished 1130 and the new sampling period Pi is employed. However, if the most recent measured maximum frequency of operation (F MAX (F K )) is also less than the second predefined threshold (F TH2 ), then the sampling period employed by the control logic via periodically determining the maximum frequency of operation is set to sample a period P 2 1150. This example assumes that sample period P 2 is less than sample period P 1 , thereby providing a greater sampling rate for the periodically determining of the maximum frequency of operation of the digital system.
  • F WARN predefined warning threshold frequency
  • FIG. 12 depicts an alternate embodiment to the protocol of FIG. 11.
  • the rate of change in the difference between measured maximum frequencies of operation of the digital system is employed in adjusting the sample period for the periodically determining of the maximum frequency of operation.
  • a most recent rate of change R K is, for example, retrieved 1200 from the trend database 455 and is compared to a first predefined rate of change threshold (RTHI) 1210. If the most recent rate of change (RK) is less than the first predefined rate of change threshold (R THI ), then no action is taken and sampling period analysis is complete 1230. However, if it is less than or equal to the first threshold, then the sampling period is adjusted to sample period Pi 1220, which (in one embodiment) completes adjustment of the sample period 1230.
  • RTHI rate of change threshold
  • the most recent determined rate of change R K is further compared against a second predefined rate of change threshold (R TH2 ) 1240. If the most recent rate of change R K is less than the second predefined rate of threshold change, then the sampling period remains at period Pi and analysis is finished 1230. Otherwise, the sampling period is set to a second sampling period
  • sample period P 2 is smaller than sample period P 1 , meaning that the sampling rate of the periodically determined maximum frequency of operation of the digital system is greater.
  • FIG. 13 depicts an alternate processing approach for determining a next time in which to sample the maximum frequency of operation of the digital system.
  • the most recently determined rate of change in the difference between measured maximum frequencies of operation, as well as the most recently measured maximum frequency of operation of the digital system are retrieved 1300 from the trend database 455 and used to estimate a time interval (T' ⁇ + i) for when an estimated maximum frequency of operation of the digital system (F' ⁇ + 0 will be equal to the warning threshold frequency of operation 1310.
  • This estimate can be obtained using either historical aging information on the digital system type, for example, retrieved from a historical aging database 915, or by linear progression analysis using a linear N-order model 720.
  • the estimated sampling time at which the estimated maximum frequency of operation of the digital system will be at the predefined warning threshold frequency is then used to determine an estimated sampling period to arrive at that predefined warning threshold frequency 1320. This estimated sampling period
  • (P' ⁇ + i) is compared against the previously employed sampling period P K used in measuring the most recent maximum frequency of operation of the digital system. If the previously employed sampling period is greater than the estimated sampling period to arrive at the predefined warning threshold frequency, then the sampling time employed for the next measurement of the maximum frequency of operation of the digital system is the prior sampling time plus the estimated sampling period until the maximum frequency of operation reaches the predefined warning threshold frequency 1340. Alternatively, if the previously employed sampling period is less than the estimated sampling period until the maximum frequency of operation reaches the predefined warning threshold frequency (P' ⁇ + i), then the next sampling time is the prior sampling time plus the previously employed sampling period
  • sampling period analysis is finished 1350.
  • the sampling period for determining the maximum frequency of operation of the digital system may be changed with aging of the digital system.
  • either the measured maximum frequency of operation of the digital system or the estimated maximum frequency of operation of the digital system, or the actual rate of change in the difference between measured maximum frequencies of operation of the digital system may be employed in evaluating whether to issue a warning signal.
  • more than one warning threshold frequency and/or more than one rate of change threshold frequency may be employed, for example, in either generating different levels of warning signals, or dynamically adjusting the sampling period employed in the periodically monitoring of the maximum frequency of operation of the digital system.
  • the approach presented herein does not require burn- in testing of the digital system, and is based on measurements derived from the actual digital system itself, rather than historical data for the particular type of digital system.
  • the protocols presented are an in situ aging prediction and warning signal generation technique.
  • the approach may be utilized for a wide variety of digital systems, including processor based systems, as well as non-processor based systems.
  • One or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media.
  • the media has therein, for instance, computer readable program code means or logic (e.g., instructions, code, commands, etc.) to provide and facilitate the capabilities of the present invention.
  • the article of manufacture can be included as a part of a computer system or sold separately.
  • a computer program product 1400 includes, for instance, one or more computer usable media 1402 to store computer readable program code means or logic 1404 thereon to provide and facilitate one or more aspects of the present invention.
  • the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • Examples of a computer readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
  • Examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • a sequence of program instructions or a logical assembly of one or more interrelated modules defined by one or more computer readable program code means or logic direct the performance of one or more aspects of the present invention.
  • a data structure of readily accessible units of memory is provided.
  • the data structure includes designations (e.g., addresses) of one or more units of memory (e.g., pages) that while in the data structure do not need address translation or any other test to be performed in order to access the unit of memory.
  • This data structure can be used in any type of processing environment including emulated environments.
  • one or more aspects of the present invention can be included in environments that are not emulated environments. Further, one or more aspects of the present invention can be used in emulated environments that have a native architecture that is different than the one described above and/or emulates an architecture other than the z/ Architecture ® .
  • Various emulators can be used. Emulators are commercially available and offered by various companies. Additional details relating to emulation are described in Virtual Machines: Versatile Platforms For Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design), Jim Smith and Ravi Nair, June 3, 2005, which is hereby incorporated herein by reference in its entirety.
  • I/O devices can be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
  • the capabilities of one or more aspects of the present invention can be implemented in software, firmware, hardware, or some combination thereof.
  • At least one program storage device readable by a machine embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.

Abstract

Method, system and computer program are provided for continually monitoring reliability of a digital system and for issuing a warning signal if digital system operation degrades to or past a specified threshold. The technique includes periodically determining a maximum frequency of operation of the digital system, and generating a warning signal indicative of a reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation of the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.

Description

RELIABILITY OF A DIGITAL SYSTEM
Technical Field
The present invention relates in general to the field of failure prediction and, more specifically, to a reliability measurement and warning method, system and computer program product for a digital system.
Background of the Invention
Failure rates of individual components making up a digital system such as an integrated circuit (or larger system) are fundamentally related to various parameters, including operating temperatures, as well as scaling of the digital system and interconnect geometries. Although burn- in testing of digital systems attempts to predict a lifecycle for a given type of digital system, it does not provide aging information for each specific digital system of the type being manufactured. Currently, a customer or user may uncover a problem with a digital system only after a catastrophic system failure. While catastrophic failure of a digital system is readily recognizable, a "soft" failure (where there may be significant degradation in digital system performance or reliability) may go unnoticed, which implies that such aging of the digital system may cause undetected errors in computation and data, from which it is difficult to recover.
Summary of the Invention
Presented herein is an approach for actively monitoring or measuring aging, and hence reliability, of a specific digital system and for issuing a warning signal if, for example, degradation of operation of the system exceeds a specified threshold.
In one aspect, a method of monitoring the reliability of a digital system is provided. This method includes: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
In another aspect, a system of monitoring reliability of digital system is provided. The system includes control logic adapted to periodically determine a maximum frequency of operation of the digital system, and to generate a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
In a further aspect, an article of manufacture is provided which includes at least one computer-usable medium having computer-readable program code logic to facilitate monitoring of reliability of a digital system. The computer-readable program code logic when executing performing the following: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or
(ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
Further, additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. Viewed from a first aspect, the present invention provides a method of monitoring reliability of a digital system, the method comprising: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
Preferably, the present invention provides a method further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
Preferably, the present invention provides a method further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employing the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
Preferably, the present invention provides a method further comprising employing multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
Preferably, the present invention provides a method further comprising employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and estimating the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
Preferably, the present invention provides a method further comprising employing multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
Preferably, the present invention provides a method further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
Preferably, the present invention provides a method wherein the first predefined threshold is a first predefined threshold frequency which is greater than the warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
Preferably, the present invention provides a method further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a rate of change in the difference between measured maximum frequencies of operation of the digital system is above a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
Preferably, the present invention provides a method wherein the dynamically adjusting of the sampling period further comprises determining whether the rate of change in the difference between measured maximum frequencies of operation of the digital system is greater than a second predefined threshold rate, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system, wherein the second predefined threshold is greater than the first predefined threshold.
Preferably, the present invention provides a method further comprising controlling a sampling period for the periodically determining of the maximum frequency of operation of the digital system, wherein controlling the sampling period comprises: estimating a time interval from a most recent determination of maximum frequency of operation of the digital system to the digital system reaching a maximum frequency of operation equal to the warning threshold frequency; employing the estimated time interval in setting a next sampling period for determining a maximum frequency of operation of the digital system; determining whether the next sampling period is less than a previous sampling period employed in the periodically determining of the maximum frequency of operation of the digital system; and if so, increasing the sampling period to increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system. Viewed from a second aspect, the present invention provides a system of monitoring reliability of a digital system, the system comprising: control logic adapted to periodically determine a maximum frequency of operation of the digital system, and generate a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
Preferably, the present invention provides a system wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
Preferably, the present invention provides a system wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employ the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
Preferably, the present invention provides a system wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
Preferably, the present invention provides a system wherein the control logic is further adapted to employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and to estimate the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
Preferably, the present invention provides a system wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
Preferably, the present invention provides for a system wherein the control logic is further adapted to dynamically adjust a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining. Preferably, the present invention provides a system wherein the first predefined threshold is a first predefined threshold frequency which is greater than the predefined warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the predefined warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
Viewed from a third aspect, the present invention provides a computer program product loadable into the internal memory of a digital computer, comprising software code portions for performing, when said product is run on a computer, to carry out the invention as described above.
Brief Description of the Drawings
Embodiments of the invention are described below in detail, by way of example only, with reference to the accompanying drawings in which:
FIG. 1 is a graph of a typical digital system life cycle, illustrating hard failure and soft aging, both of which can be identified in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 2 depicts one embodiment of a digital system and control logic implementing reliability monitoring and warning signal generation, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 2A is an alternate embodiment of a digital system and control logic implementing reliability monitoring and signal warning signal generation, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 3 is a flowchart of one embodiment of logic for periodically determining a maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention; FIG. 4 is a flowchart of one embodiment of logic for tracking maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 5 graphically depicts periodically determining the maximum frequency of operation of a digital system, and signaling a warning when the maximum frequency of operation falls below a predefined warning threshold frequency of operation, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 6 is a flowchart of one embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 7 is a flowchart of an alternate embodiment of logic for performing trend analysis and for generating a warning signal, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 8 is a flowchart of an alternate embodiment of logic for trend analysis of a digital system and for generating a warning system based thereon, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 9 is a flowchart of a further embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 10 is a flowchart of an alternate embodiment of logic for trend analysis of a digital system and for generating a warning signal based thereon, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 11 is a flowchart of one embodiment of logic implementing a variable sampling period for determining maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention; FIG. 12 is a flowchart of an alternate embodiment of logic implementing a variable sampling period for determining a maximum frequency of operation of a digital system, in accordance with an aspect of a preferred embodiment of the present invention;
FIG. 13 is a flowchart of an alternate embodiment of logic implementing sampling period analysis for determining a sample time for next maximum frequency of operation measurement of a digital system, in accordance with an aspect of a preferred embodiment of the present invention; and
FIG. 14 depicts one embodiment of a computer program product to incorporate one or more aspects of preferred embodiments of the present invention.
Detailed Description of the Invention
As noted, presented herein are a method, system and program product for actively monitoring or measuring aging, and hence reliability, of a specific digital system, and for issuing a warning signal if, for example, degradation of operation of the system exceeds a prespecified threshold. The "digital system" refers to any digital system or circuit, and includes, for example, a processor, as well as simple or complex non-processor based digital logic, memory, etc. As one specific example, the digital system is a microprocessor, and the specified threshold is a predefined acceptable level for the maximum frequency of operation of the digital system.
More particularly, presented a technique for periodically determining a maximum frequency of operation of a digital system, and generating a warning signal indicative of reliability degradation of the digital system if at least one of: (i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system. These and other aspects of the present invention are described below with reference to FIGS. 1-14.
FIG. 1 depicts a typical digital system lifecycle model. This diagram illustrates that the digital system has a higher maximum frequency of operation (FMAX) than a specified (i.e., required) maximum frequency of operation for the digital system (F SPEC) when manufactured and beginning its lifecycle. As the digital system ages, several factors may degrade system performance, and hence decrease maximum operating frequency as a result. Factors which degrade digital system performance depend upon the particular system at issue and the environment within which the system is used. For example, if the digital system comprises a processor, aging can be caused by a variety of factors including hot election, electromigration and thermal expansion of the digital system.
Two failure modes are illustrated in FIG. 1. First a hard failure, illustrated by the dashed lines, is representative of an abrupt failure of the digital system (e.g., resulting from aging of the digital system). Soft aging is also shown wherein operation of the digital system gradually decreases to a level at or below the manufacturer specified maximum frequency of operation (FSPEC). Due to the gradual nature of this aging, the soft aging failure may go unnoticed, which implies that such aging may cause undetected errors in computation and data. Once the maximum frequency of operation of the digital system (FMAX) is known to fall below the manufacturer specified minimum frequency of operation of the digital system (FSPEC) (meaning that the digital system fails to operate at the required conditions (e.g., due to a hard failure or soft failure)), then the system must be replaced or repaired. Although a hard failure is readily recognizable, an accumulated aging effect with the system operating at or near the manufacturer specified maximum frequency of operation (FSPEC) might result in a single bit error in a block of data, and it is hard to detect occurrence of such an error employing a test instruction vector. This traditionally makes it difficult to distinguish the boundary between good and bad data results in an aging digital system.
FIGS. 2 & 2 A depict embodiments of a digital system and control logic implementing reliability monitoring and warning signal generation, in accordance with an aspect of the present invention. In FIG. 2, a digital system 200 is shown to be driven by an (optionally) adjustable supply voltage 220 and an adjustable clock rate 210. (Conventionally, a fixed power supply voltage and fixed clock frequency are provided to a digital system for operation.) Control logic 230 senses and controls the adjustable supply voltage and clock rate in order to periodically measure the digital system's maximum operating frequency. As illustrated in the digital system lifecycle model of FIG. 1, a digital system qualified for a manufacturer specified maximum frequency of operation (FSPEC) actually has a maximum operating frequency (FMAX) which is higher than FSPEC when manufactured. Degradation of FMAX is an indication of processor soft aging. Thus, the system's age and lifetime can be measured by, for example, sweeping the supply voltage (VDD) and clock frequency (FCLK), sending worst-case test instructions, and finding a maximum successful frequency at a given voltage (or over several voltages). These measurements can be saved to facilitate the control logic performing trend analysis on the maximum frequency of operation of the digital system for proactively issuing a warning signal prior to failure of the digital system. The collected data can also be used to estimate a digital systems' status or age, and the rate of aging.
In FIG. 2A, the digital system 200' is assumed to comprise a processor and to be capable of incorporating therein control logic 230' for reliability monitoring and signal warning generation of the digital system. An adjustable clock frequency 210 and, in one embodiment, an adjustable supply voltage VDD 220 are again provided, which are sensed and controlled by control logic 230' during reliability monitoring, as described herein.
FIG. 3 depicts one embodiment of a testing protocol of control logic 230/230' for periodically determining a maximum frequency of operation of a digital system, in accordance with an aspect of the present invention. Upon power-up of the digital system 300, the adjustable clock frequency (FCLK) is set to the manufacturer specified maximum frequency of operation (FSPEC) for the digital system 305. A test instruction vector is sent to the digital system and results are analyzed 310. Logic then determines whether the test passed 315. If so, then the clock frequency (FCLK) is raised 320, and a test instruction vector is again sent to the digital system and analyzed 325 to determine whether the test passed at the raised clock frequency 330. This process continues until the results from the test instruction vector no longer pass, after which the maximum measured frequency of operation of the digital system (FMAX) is recorded as the highest clock frequency (FCLK) with passing results 335. This measured maximum frequency of operation of the digital system (FMAX) is then returned 340, for example, for saving in a trend database (or memory) 445 (see FIG. 4).
Assuming that the executed test did not pass with the clock frequency set to the manufacturer specified maximum frequency of operation of the digital system (FSPEC), then the clock frequency is lowered 345, and the test is re-executed to determine if the digital system passes 355. This process continues until the clock frequency is low enough that the digital system passes the test, and the maximum frequency of operation of the digital system (FMAX) is again recorded as the highest passing clock frequency (FCLK) 335 and returned as the FMAX 340 for the digital system.
Note that in the FMAX search protocol of FIG. 3, the gradient by which FCLK is raised when the test passes 320 may be the same or different than the gradient by which the clock frequency (FCLK) is lowered when failing the test 345. The gradient(s) selected may depend upon the age of the digital system and the accuracy desired. As explained further herein, this accuracy can be noted and changed as the digital system ages.
FIG. 4 depicts one embodiment of logic for tracking maximum frequency of operation of a digital system, in accordance with an aspect of the present invention. Upon power-up of the digital system 400, the current maximum frequency of operation (FMAX) is read at a time TK
410. In one embodiment, reading of the current maximum frequency of operation is synonymous with measuring the current maximum frequency of operation using the protocol described above in connection with FIG. 3. The logic also fetches the previous maximum frequency of operation of the digital system (FMAX (FK-O) a* time Tκ-i 420. This previous maximum frequency of operation can be retrieved from trend database 455, which is accessible by the control logic. The difference (DK) between the previous maximum frequency of operation and the current maximum frequency of operation is determined, and the rate of change (RK) in the difference is calculated 440. The measured maximum frequency of operation of the digital system (FK), at time TK is then recorded in the trend database, along with the rate of change (RK) in the difference (DK) between measured maximum frequencies of operation of the digital system 450. After this, trend analysis 460 may be performed, either commensurate with each periodic determination of the maximum frequency of operation of the digital system, or at some other specified interval.
In a simplest method of trend analysis, the currently measured maximum frequency of operation of the digital system (FMAX) is compared against a predefined warning threshold frequency of operation (F WARN) for the digital system. The predefined warning threshold frequency (FWARN) may be greater than or equal to the manufacturer specified maximum frequency of operation for the digital system (FSPEC). In the lifecycle illustration of FIG. 5, the warning threshold frequency of operation is above the manufacturer specified minimum frequency of operation, and when the maximum frequency of operation of the digital system drops to or below the warning threshold frequency of operation, a warning signal is generated by the control logic and sent, for example, to an operating system of the digital system. In this embodiment, the warning signal indicates that the maximum frequency of operation of the digital system may be slower than the manufacturer specified maximum frequency of operation (FSPEC) in the near future (e.g., due to continued soft aging of the system, or resulting from a hard failure due to aging of the system). At this point, the warning signal may also be provided to a user of the digital system so than an appropriate procedure, such as shutdown, can be taken. As explained further below, when the maximum frequency of operation of the digital system is at or below the warning threshold frequency of operation of the digital system, the sampling rate for determining the maximum frequency of operation for the digital system may also be increased to more accurately monitor the digital system's status.
FIGS. 6-10 depict various further embodiments for performing trend analysis on the maximum frequency of operation of the digital system, in accordance with aspects of the present invention.
FIG. 6 presents a trend analysis approach wherein instead of employing the maximum frequency of operation of the digital system in comparison with a prespecified threshold to generate the warning signal, the rate of change (RK) in the difference between measured maximum frequencies of operation of the digital system is employed, in accordance with an aspect of the present invention. Specifically, a most recent rate of change (RK) in the difference between measured maximum frequencies of operation of the digital system is retrieved 600 from the trend database 455. This recent rate of change (RK) is then compared against a specified threshold rate of change (RTH) in the difference between measured maximum frequencies of operation 610. By way of example, the specified threshold rate of change (RTH) may be chosen based on historical aging information for the type of digital system being monitored. If the recently determined rate of change (RK) is greater than the specified threshold rate of change (RTH), then a warning signal is generated 620, thereby completing trend analysis 630. However, if the rate of change in the difference between measured maximum frequencies of operation of the digital system is below the specified rate of change threshold, then no warning signal is generated and trend analysis is complete, allowing the digital system to return to normal operation.
FIG. 7 depicts an alternate trend analysis embodiment wherein N most recent rates of change in the difference between measured maximum frequencies of operation are fetched 700 from the trend database 455. From these values, a next rate of change (R'κ+i) in the difference between measured maximum frequencies of operation is estimated 710. This estimated next rate of change (R'κ+i) in the difference between measured maximum frequencies of operation can be determined employing conventional linear model estimation, such as a linear N-order model, wherein a linear prediction is made from the previous N rate of change determinations. From this estimated rate of change, the next maximum frequency of operation of the digital system (F'κ+i) is estimated 730, after which the logic determines whether the estimated maximum frequency of operation of the digital system (F'κ+i) is less than the predefined warning threshold frequency (FWARN) 740. If so, then a warning signal is generated 750, which completes trend analysis 760. Assuming that the estimated next frequency of operation of the digital system (F'κ+i) is greater than the predefined warning threshold frequency (FWARN), then no warning signal is generated, and trend analysis is finished 760.
The logic of FIG. 8 is similar to the logic of FIG. 7, with the exception that the next maximum frequency of operation of the digital system (F'κ+i) is estimated directly from N prior saved measured maximum frequencies of operation of the digital system. Specifically, the most recent N measured maximum frequencies of operation of the digital system are fetched 800 from the trend database 455, and from these values, the next maximum frequency of operation of the digital system (F'κ+i) is estimated 810 using, for example, linear N-model analysis 720. If the estimated next maximum frequency of operation of the digital system (F'κ+i) is less than the predefined warning threshold frequency (FWARN) 820, then a warning signal is generated 830, thereby completing trend analysis 840. No warning signal is generated if the estimated next maximum frequency of operation of the digital system is above the warning threshold frequency.
FIG. 9 depicts an alternate approach wherein historical aging information saved in a historical aging database (or memory) 915 is employed in estimating a next rate of change in the difference between measured maximum frequencies of operation of the digital system in place of a linear N-order model, such as employed in the processing of FIG. 7. The historical aging information may comprise a database of aging information gathered through conventional burn- in testing on the type of digital system being monitored. Alternatively, the historical aging information could be derived from measuring aging of other digital systems of the particular type as the current digital system being monitored. Depending upon the digital system, this historical aging information may provide a more accurate estimate of a next maximum frequency of operation and/or a next rate of change in the difference between measured maximum frequencies of operation than a linear progression model.
In the protocol of FIG. 9, the most recent N determined rates of change in the difference between measured maximum frequencies of operation of the digital system are retrieved 900 from the trend database 455 and employed with the historical aging information or nominal aging model (from database 915) to estimate a next rate of change (R'κ+i) in the difference between measured maximum frequencies of operation of the digital system 910. From this estimated next rate of change, a next maximum frequency of operation of the digital system (F'κ+i) is estimated 920, for example, by adding to the prior maximum frequency of operation of the digital system the estimated rate of change in the difference between the measured maximum frequencies of operation multiplied by the difference in time between measurements. The estimated next maximum frequency of operation (F'κ+i) is then compared against the predefined warning threshold frequency (FWARN) 925, and a warning signal is generated if the estimated next maximum frequency of operation is below the warning threshold 930. Otherwise, no warning signal is generated and trend analysis is complete 935.
FIG. 10 depicts an alternate analysis approach wherein N recently measured maximum frequencies of operation are retrieved 1000 from the trend database 455 and employed to directly estimate a next maximum frequency of operation of the digital system (F'κ+i) 1010 employing historical aging information from the historical aging database 915. The estimated next maximum frequency of operation of the digital system is then compared against the predefined warning threshold frequency 1020, and if less, a warning signal is generated 1030. Otherwise, no warning signal is generated and trend analysis is complete 1040, that is, until a next analysis interval.
FIGS. 11-13 depict alternate embodiments for analyzing and dynamically adjusting the sampling period employed by the control logic in periodically determining the maximum frequency of operation of the digital system.
In FIG. 11, the most recent measured maximum frequency of operation of the digital system 1100 is, for example, fetched from the trend database 455 and compared against a first predefined threshold (FTHI) 1110. In one embodiment, the first predefined threshold is a frequency threshold that is greater than the predefined warning threshold frequency. If the maximum frequency of operation is above the first predefined threshold, no action is necessary and sampling period analysis is complete 1130. However, if the most recently determined maximum frequency of operation is below the first predefined threshold, then the sampling period is adjusted to a new sampling period Pi 1120. Sampling period analysis might then be complete (not shown), or alternatively, the most recent measured maximum frequency of operation of the digital system may be compared against a second predefined threshold (Fim) 1140. This second predefined threshold is a frequency that may be, in one embodiment, equal to the predefined warning threshold frequency (FWARN). If the most recent measured maximum frequency of operation is again above the second threshold frequency, then sampling period analysis is finished 1130 and the new sampling period Pi is employed. However, if the most recent measured maximum frequency of operation (FMAX (FK)) is also less than the second predefined threshold (FTH2), then the sampling period employed by the control logic via periodically determining the maximum frequency of operation is set to sample a period P2 1150. This example assumes that sample period P2 is less than sample period P1, thereby providing a greater sampling rate for the periodically determining of the maximum frequency of operation of the digital system.
FIG. 12 depicts an alternate embodiment to the protocol of FIG. 11. In this embodiment, the rate of change in the difference between measured maximum frequencies of operation of the digital system is employed in adjusting the sample period for the periodically determining of the maximum frequency of operation. As shown, a most recent rate of change RK is, for example, retrieved 1200 from the trend database 455 and is compared to a first predefined rate of change threshold (RTHI) 1210. If the most recent rate of change (RK) is less than the first predefined rate of change threshold (RTHI), then no action is taken and sampling period analysis is complete 1230. However, if it is less than or equal to the first threshold, then the sampling period is adjusted to sample period Pi 1220, which (in one embodiment) completes adjustment of the sample period 1230. In the embodiment of FIG. 12, the most recent determined rate of change RK is further compared against a second predefined rate of change threshold (RTH2) 1240. If the most recent rate of change RK is less than the second predefined rate of threshold change, then the sampling period remains at period Pi and analysis is finished 1230. Otherwise, the sampling period is set to a second sampling period
P2 1250, wherein it is assumed that sample period P2 is smaller than sample period P1, meaning that the sampling rate of the periodically determined maximum frequency of operation of the digital system is greater.
FIG. 13 depicts an alternate processing approach for determining a next time in which to sample the maximum frequency of operation of the digital system. In this approach, the most recently determined rate of change in the difference between measured maximum frequencies of operation, as well as the most recently measured maximum frequency of operation of the digital system, are retrieved 1300 from the trend database 455 and used to estimate a time interval (T'κ+i) for when an estimated maximum frequency of operation of the digital system (F'κ+0 will be equal to the warning threshold frequency of operation 1310. This estimate can be obtained using either historical aging information on the digital system type, for example, retrieved from a historical aging database 915, or by linear progression analysis using a linear N-order model 720. The estimated sampling time at which the estimated maximum frequency of operation of the digital system will be at the predefined warning threshold frequency is then used to determine an estimated sampling period to arrive at that predefined warning threshold frequency 1320. This estimated sampling period
(P'κ+i) is compared against the previously employed sampling period PK used in measuring the most recent maximum frequency of operation of the digital system. If the previously employed sampling period is greater than the estimated sampling period to arrive at the predefined warning threshold frequency, then the sampling time employed for the next measurement of the maximum frequency of operation of the digital system is the prior sampling time plus the estimated sampling period until the maximum frequency of operation reaches the predefined warning threshold frequency 1340. Alternatively, if the previously employed sampling period is less than the estimated sampling period until the maximum frequency of operation reaches the predefined warning threshold frequency (P'κ+i), then the next sampling time is the prior sampling time plus the previously employed sampling period
PK 1360. Once the sampling time for the next determination of the maximum frequency of operation of the digital system is determined, sampling period analysis is finished 1350.
Advantageously, those skilled in the art will note from the above description that provided herein are various protocols for actively monitoring or measuring aging, and hence reliability, of a specific digital system, and for issuing a warning signal if degradation of operation of the system exceeds a specified threshold. In accordance with the protocols presented, actual measurement of digital system performance is performed by evaluating a maximum frequency of operation of the digital system at periodic intervals. A variable clock frequency is employed (along with, in certain embodiments, a variable power supply) in implementing the concepts presented. Measured maximum frequencies of operation, as well as determined rates of change in the difference between measured maximum frequencies of operation of the digital system are saved (for example, in a trend database) for subsequent trend analysis and warning signal generation.
Advantageously, the sampling period for determining the maximum frequency of operation of the digital system may be changed with aging of the digital system. In generating the warning signal, either the measured maximum frequency of operation of the digital system or the estimated maximum frequency of operation of the digital system, or the actual rate of change in the difference between measured maximum frequencies of operation of the digital system may be employed in evaluating whether to issue a warning signal.
In certain embodiments, more than one warning threshold frequency and/or more than one rate of change threshold frequency may be employed, for example, in either generating different levels of warning signals, or dynamically adjusting the sampling period employed in the periodically monitoring of the maximum frequency of operation of the digital system.
Advantageously, the approach presented herein does not require burn- in testing of the digital system, and is based on measurements derived from the actual digital system itself, rather than historical data for the particular type of digital system. The protocols presented are an in situ aging prediction and warning signal generation technique. The approach may be utilized for a wide variety of digital systems, including processor based systems, as well as non-processor based systems.
One or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has therein, for instance, computer readable program code means or logic (e.g., instructions, code, commands, etc.) to provide and facilitate the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
One example of an article of manufacture or a computer program product incorporating one or more aspects of the present invention is described with reference to FIG. 14. A computer program product 1400 includes, for instance, one or more computer usable media 1402 to store computer readable program code means or logic 1404 thereon to provide and facilitate one or more aspects of the present invention. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A sequence of program instructions or a logical assembly of one or more interrelated modules defined by one or more computer readable program code means or logic direct the performance of one or more aspects of the present invention.
Advantageously, a data structure of readily accessible units of memory is provided. By employing this data structure, memory access and system performance are enhanced (e.g., faster). The data structure includes designations (e.g., addresses) of one or more units of memory (e.g., pages) that while in the data structure do not need address translation or any other test to be performed in order to access the unit of memory. This data structure can be used in any type of processing environment including emulated environments.
Although various embodiments are described above, these are only examples. For instance, one or more aspects of the present invention can be included in environments that are not emulated environments. Further, one or more aspects of the present invention can be used in emulated environments that have a native architecture that is different than the one described above and/or emulates an architecture other than the z/ Architecture®. Various emulators can be used. Emulators are commercially available and offered by various companies. Additional details relating to emulation are described in Virtual Machines: Versatile Platforms For Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design), Jim Smith and Ravi Nair, June 3, 2005, which is hereby incorporated herein by reference in its entirety.
Input/Output or I/O devices (including, but not limited to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs, thumb drives and other memory media, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
The capabilities of one or more aspects of the present invention can be implemented in software, firmware, hardware, or some combination thereof. At least one program storage device readable by a machine embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified. All of these variations are considered a part of the claimed invention.

Claims

1. A method of monitoring reliability of a digital system, the method comprising: periodically determining a maximum frequency of operation of the digital system; and generating a warning signal indicative of reliability degradation of the digital system if at least one of:
(i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or
(ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
2. The method of claim 1 , further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
3. The method of claim 1, further comprising periodically determining a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employing the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
4. The method of claim 1, further comprising employing multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
5. The method of claim 1, further comprising employing multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and estimating the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
6. The method of claim 1, further comprising employing multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
7. The method of claim 1, further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
8. The method of claim 7, wherein the first predefined threshold is a first predefined threshold frequency which is greater than the warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
9. The method of claim 1, further comprising dynamically adjusting a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a rate of change in the difference between measured maximum frequencies of operation of the digital system is above a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
10. The method of claim 9, wherein the dynamically adjusting of the sampling period further comprises determining whether the rate of change in the difference between measured maximum frequencies of operation of the digital system is greater than a second predefined threshold rate, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system, wherein the second predefined threshold is greater than the first predefined threshold.
11. The method of claim 1 , further comprising controlling a sampling period for the periodically determining of the maximum frequency of operation of the digital system, wherein controlling the sampling period comprises: estimating a time interval from a most recent determination of maximum frequency of operation of the digital system to the digital system reaching a maximum frequency of operation equal to the warning threshold frequency; employing the estimated time interval in setting a next sampling period for determining a maximum frequency of operation of the digital system; determining whether the next sampling period is less than a previous sampling period employed in the periodically determining of the maximum frequency of operation of the digital system; and if so, increasing the sampling period to increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
12. A system of monitoring reliability of a digital system, the system comprising: control logic adapted to periodically determine a maximum frequency of operation of the digital system, and generate a warning signal indicative of reliability degradation of the digital system if at least one of:
(i) a measured or estimated maximum frequency of operation of the digital system is below a warning threshold frequency of operation for the digital system, wherein the warning threshold frequency is greater than or equal to a manufacturer specified minimum frequency of operation for the digital system; or (ii) a rate of change in a difference between measured maximum frequencies of operation of the digital system exceeds an acceptable rate of change threshold for the digital system.
13. The system of claim 12, wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if a current rate of change in the difference between measured maximum frequencies of operation of the digital system exceeds the acceptable rate of change threshold for the digital system.
14. The system of claim 12, wherein the control logic is further adapted to periodically determine a rate of change in the difference between measured maximum frequencies of operation of the digital system, and employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system in estimating a next rate of change employing linear model estimation, and employ the estimated next rate of change to estimate a next maximum frequency of operation of the digital system at a next sample time determined by the period of the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated next maximum frequency of operation of the digital system is below the warning threshold frequency of operation of the digital system.
15. The system of claim 12, wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation to estimate a next maximum frequency of operation utilizing linear model estimation, and wherein the generating of the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
16. The system of claim 12, wherein the control logic is further adapted to employ multiple determined rates of change in the difference between measured maximum frequencies of operation of the digital system and historical aging data for the digital system type to estimate a next rate of change in the difference between maximum frequencies of operation of the digital system, and to estimate the maximum frequency of operation of the digital system from the estimated rate of change in the difference between maximum frequencies of operation of the digital system and a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation.
17. The system of claim 12, wherein the control logic is further adapted to employ multiple measured maximum frequencies of operation of the digital system to estimate a next maximum frequency of operation of the digital system employing historical aging data for the digital system type, and wherein generating the warning signal comprises generating the warning signal if the estimated maximum frequency of operation of the digital system is below the warning threshold frequency of operation for the digital system.
18. The system of claim 12, wherein the control logic is further adapted to dynamically adjust a sampling period employed by the periodically determining of the maximum frequency of operation of the digital system, the dynamically adjusting comprising: determining whether a maximum frequency of operation of the digital system is below a first predefined threshold, and if so, adjusting the sampling period of the periodically determining to increase the sampling rate of the periodically determining.
19. The system of claim 18, wherein the first predefined threshold is a first predefined threshold frequency which is greater than the predefined warning threshold frequency, and wherein the dynamically adjusting of the sampling period further comprises determining whether the maximum frequency of operation of the digital system is less than or equal to the predefined warning threshold frequency of operation, and if so, further adjusting the sampling period to further increase the sampling rate of the periodically determining of the maximum frequency of operation of the digital system.
20. A computer program product loadable into the internal memory of a digital computer, comprising software code portions for performing, when said product is run on a computer, to carry out the invention as claimed in claims 1 to 11.
PCT/EP2008/052021 2007-04-10 2008-02-19 Monitoring reliability of a digital system WO2008122459A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020097015211A KR101114054B1 (en) 2007-04-10 2008-02-19 Monitoring reliability of a digital system
JP2010502470A JP5181312B2 (en) 2007-04-10 2008-02-19 Method and system for monitoring the reliability of a digital system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/733,318 US8094706B2 (en) 2007-04-10 2007-04-10 Frequency-based, active monitoring of reliability of a digital system
US11/733,318 2007-04-10

Publications (2)

Publication Number Publication Date
WO2008122459A2 true WO2008122459A2 (en) 2008-10-16
WO2008122459A3 WO2008122459A3 (en) 2008-11-27

Family

ID=39592107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/052021 WO2008122459A2 (en) 2007-04-10 2008-02-19 Monitoring reliability of a digital system

Country Status (4)

Country Link
US (1) US8094706B2 (en)
JP (1) JP5181312B2 (en)
KR (1) KR101114054B1 (en)
WO (1) WO2008122459A2 (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5088541B2 (en) * 2007-06-11 2012-12-05 ソニー株式会社 Information processing apparatus and method, and program
JP5018510B2 (en) * 2008-01-29 2012-09-05 富士通株式会社 Failure prediction apparatus, failure prediction method, and failure prediction program
US20090288092A1 (en) * 2008-05-15 2009-11-19 Hiroaki Yamaoka Systems and Methods for Improving the Reliability of a Multi-Core Processor
US7900114B2 (en) * 2009-02-27 2011-03-01 Infineon Technologies Ag Error detection in an integrated circuit
TW201140308A (en) 2010-03-15 2011-11-16 Kyushu Inst Technology Semiconductor device, detection method, and program
US8729920B2 (en) 2010-11-24 2014-05-20 International Business Machines Corporation Circuit and method for RAS-enabled and self-regulated frequency and delay sensor
JP5805962B2 (en) * 2011-02-25 2015-11-10 ナブテスコ株式会社 Electronic equipment health monitoring device
JP2013088394A (en) * 2011-10-21 2013-05-13 Renesas Electronics Corp Semiconductor device
JP5630453B2 (en) * 2012-02-16 2014-11-26 日本電気株式会社 Degradation detection circuit and semiconductor integrated device
JP5997476B2 (en) * 2012-03-30 2016-09-28 ラピスセミコンダクタ株式会社 Operating margin control circuit, semiconductor device, electronic device, and operating margin control method
US8587351B1 (en) 2012-05-11 2013-11-19 Hamilton Sundstrand Corporation Method for synchronizing sampling to sinusoidal inputs
US9317389B2 (en) * 2013-06-28 2016-04-19 Intel Corporation Apparatus and method for controlling the reliability stress rate on a processor
JP2016188825A (en) 2015-03-30 2016-11-04 ルネサスエレクトロニクス株式会社 Semiconductor device and system
US10310548B2 (en) 2016-11-07 2019-06-04 Microsoft Technology Licensing, Llc Expected lifetime management

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5175751A (en) * 1990-12-11 1992-12-29 Intel Corporation Processor static mode disable circuit
WO2002093380A1 (en) * 2001-05-16 2002-11-21 General Electric Company System, method and computer product for performing automated predictive reliability
US6557035B1 (en) * 1999-03-30 2003-04-29 International Business Machines Corporation Rules-based method of and system for optimizing server hardware capacity and performance
US20030233624A1 (en) * 2002-06-13 2003-12-18 Texas Instruments Incorporated Method for predicting the degradation of an integrated circuit performance due to negative bias temperature instability
US20050134394A1 (en) * 2003-12-23 2005-06-23 Liu Jonathan H. On-chip transistor degradation monitoring
US6937965B1 (en) * 1999-12-17 2005-08-30 International Business Machines Corporation Statistical guardband methodology

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4443116A (en) * 1981-01-09 1984-04-17 Citizen Watch Company Limited Electronic timepiece
US5625288A (en) * 1993-10-22 1997-04-29 Sandia Corporation On-clip high frequency reliability and failure test structures
KR960014952A (en) * 1994-10-14 1996-05-22 가즈오 가네코 Burn-in and test method of semiconductor wafer and burn-in board used for it
JP3815835B2 (en) * 1997-02-18 2006-08-30 本田技研工業株式会社 Semiconductor device
US6091287A (en) * 1998-01-23 2000-07-18 Motorola, Inc. Voltage regulator with automatic accelerated aging circuit
US6233536B1 (en) * 1998-11-30 2001-05-15 General Electric Company Remote lifecycle monitoring of electronic boards/software routines
US6400173B1 (en) * 1999-11-19 2002-06-04 Hitachi, Ltd. Test system and manufacturing of semiconductor device
JP3211957B2 (en) * 1999-12-07 2001-09-25 川崎重工業株式会社 Device diagnostic method and diagnostic device
JP2002099448A (en) * 2000-09-21 2002-04-05 Ntt Data Corp Performance monitoring apparatus and its method
US7031950B2 (en) * 2000-12-14 2006-04-18 Siemens Corporate Research, Inc. Method and apparatus for providing a virtual age estimation for remaining lifetime prediction of a system using neural networks
US20030158803A1 (en) * 2001-12-20 2003-08-21 Darken Christian J. System and method for estimation of asset lifetimes
JP2005004336A (en) * 2003-06-10 2005-01-06 Hitachi Ltd Resource monitoring method and device, and resource monitoring program
US7450655B2 (en) * 2003-07-22 2008-11-11 Intel Corporation Timing error detection for a digital receiver
US20050175751A1 (en) * 2004-02-09 2005-08-11 Cheng-Jen Lin Method of extracting isoflavon from soybeans
US7506216B2 (en) * 2004-04-21 2009-03-17 International Business Machines Corporation System and method of workload-dependent reliability projection and monitoring for microprocessor chips and systems
EP1807760B1 (en) * 2004-10-25 2008-09-17 Robert Bosch Gmbh Data processing system with a variable clock speed
US8189484B2 (en) * 2005-04-26 2012-05-29 Reich Jr Richard D System for data archiving and system behavior prediction
US7412353B2 (en) * 2005-09-28 2008-08-12 Intel Corporation Reliable computing with a many-core processor
US7617427B2 (en) * 2005-12-29 2009-11-10 Lsi Corporation Method and apparatus for detecting defects in integrated circuit die from stimulation of statistical outlier signatures
FR2897173A1 (en) * 2006-02-03 2007-08-10 St Microelectronics Sa METHOD FOR DETECTING DYSFUNCTION IN A STATE MACHINE

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5175751A (en) * 1990-12-11 1992-12-29 Intel Corporation Processor static mode disable circuit
US6557035B1 (en) * 1999-03-30 2003-04-29 International Business Machines Corporation Rules-based method of and system for optimizing server hardware capacity and performance
US6937965B1 (en) * 1999-12-17 2005-08-30 International Business Machines Corporation Statistical guardband methodology
WO2002093380A1 (en) * 2001-05-16 2002-11-21 General Electric Company System, method and computer product for performing automated predictive reliability
US20030233624A1 (en) * 2002-06-13 2003-12-18 Texas Instruments Incorporated Method for predicting the degradation of an integrated circuit performance due to negative bias temperature instability
US20050134394A1 (en) * 2003-12-23 2005-06-23 Liu Jonathan H. On-chip transistor degradation monitoring

Also Published As

Publication number Publication date
US8094706B2 (en) 2012-01-10
KR101114054B1 (en) 2012-02-29
WO2008122459A3 (en) 2008-11-27
US20080253437A1 (en) 2008-10-16
JP5181312B2 (en) 2013-04-10
KR20090130167A (en) 2009-12-18
JP2010524101A (en) 2010-07-15

Similar Documents

Publication Publication Date Title
US8094706B2 (en) Frequency-based, active monitoring of reliability of a digital system
US7495519B2 (en) System and method for monitoring reliability of a digital system
US7707461B2 (en) Digital media drive failure prediction system and method
US8340923B2 (en) Predicting remaining useful life for a computer system using a stress-based prediction technique
KR100888271B1 (en) Device having a capability of detecting deterioration of an air-flow generating capability of fan, cooling function monitoring apparatus, and fan deterioration monitoring program storing medium
CN102812373B (en) Semiconductor device and detection method
EP3737953A1 (en) Integrated circuit workload, temperature and/or sub-threshold leakage sensor
JP4500063B2 (en) Electronic device, prediction method, and prediction program
US9117011B2 (en) Characterization and functional test in a processor or system utilizing critical path monitor to dynamically manage operational timing margin
JP2004150439A (en) Method for determining failure phenomenon
JP2008250594A (en) Device diagnostic method, device-diagnosing module and device with device-diagnosing module mounted thereon
US20070220340A1 (en) Using a genetic technique to optimize a regression model used for proactive fault monitoring
JPWO2009011028A1 (en) Electronic device, host device, communication system, and program
JP2004156616A (en) Method of diagnosing prognosis of turbine blade (bucket) by monitoring health condition thereof by using neural network utilizing diagnosing method in relation to pyrometer signals
US20100217562A1 (en) Operating parameter control of an apparatus for processing data
US7725285B2 (en) Method and apparatus for determining whether components are not present in a computer system
US7984353B2 (en) Test apparatus, test vector generate unit, test method, program, and recording medium
US7483816B2 (en) Length-of-the-curve stress metric for improved characterization of computer system reliability
US8120379B2 (en) Operating characteristic measurement device and methods thereof
US6620639B1 (en) Apparatus to evaluate hot carrier injection performance degradation and method therefor
CN117093821B (en) Energy efficiency and water efficiency measuring system and method for washing machine
KR102538542B1 (en) Method and apparatus for diagnosis of motor using current signals
CN117637004A (en) Test result data-based test index optimization method
EP4070315A1 (en) Memory device degradation monitoring

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08716955

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 1020097015211

Country of ref document: KR

ENP Entry into the national phase

Ref document number: 2010502470

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08716955

Country of ref document: EP

Kind code of ref document: A2