US 6725208 B1 Abstract An optimization system is provided utilizing a Bayesian neural network calculation of a derivative wherein an output is optimized with respect to an input utilizing a stochastical method that averages over many regression models. This is done such that constraints from first principal models are incorporated in terms of prior art distributions.
Claims(80) 1. A method for determining the optimum operation of a system, comprising the steps of:
receiving the outputs of the system and the measurable inputs to the system; and
optimizing select ones of the outputs as a function of the inputs by minimizing an objective function J to provide optimal values for select ones of the inputs;
wherein the step of optimizing includes the step of predicting the select ones of the outputs with a plurality of models of the system, each model operable to map the inputs through a representation of the system to provide predicted outputs corresponding to the select ones of the outputs which predicted outputs of each of the plurality of models are combined in accordance with a predetermined combination algorithm to provide a single output corresponding to each of the select ones of the outputs.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
determining the average predicted output of the plurality of models <y(t)>;
determining the average derivative of the average predicted output <y(t)> with regards to the inputs x(t) as ∂<y(t)>/∂x(t);
the objective function J being a function of <y(t)> and determining a derivative of the objective function J with respect to <y(t)> as ∂J/∂<y(t)>;
determining with the chain rule the relationship ∂J/∂x(t); and
determining the minimum of the J.
8. The method of
9. A method for optimizing the parameters of a system having a vector input x(t) and a vector output y(t), comprising the steps of:
storing a representation of the system in a plurality of models, each model operable to map the inputs through a representation of the system to provide a predicted output, each of the models operable to predict the output of the system for a given input value of x(t),
providing predetermined optimization objectives; and
determining a single optimized input vector value {circumflex over (x)}(t) by applying a predetermined optimization algorithm to the plurality of models to achieve a minimum error to the predetermined optimization objective.
10. The method of
11. The method of
12. The method of
where
is the likelihood, P(ω) is a prior distribution of the parameters ω of the model, and their product is the posterior distribution.
13. The method of
14. The method of
15. The method of
16. The method of
17. The method of
18. The method of
19. The method of
where
is the likelihood, P(ω) is a prior distribution of the parameters ω of the model, and their product is the posterior distribution.
20. The method of
determining the weighted average of the predicted output of each of the models by the following relationship:
where P(y
^{(i)}|x^{(i)}, ω) P(ω) represents the posterior probability of the model indexed by w, and N_{w }represents the maximum number of models in the stochastic relationship, and wherein the stored representation of the system in each of the plurality of models are related in such a manner wherein the parameters of each of the models are stochastically related to each other;determining the derivatives ∂J/∂<y(t)> as the variation of the predetermined optimization objective with respect to the predicted output y(t); and
determining by the chain rule the following:
21. A method for determining the dynamic operation of a system, comprising the steps of:
receiving the outputs of the plant system and the measurable inputs to the system; and
optimizing select ones of the outputs as a function of the inputs over a future horizon by minimizing an objective function J to achieve a predetermined desired setpoint to provide optimal values for select ones of the inputs over a trajectory to the desired setpoint in incremental time intervals;
wherein the step of optimizing includes the step of predicting as predicted outputs the select ones of the outputs over the trajectory at each of the incremental time intervals from the current value to the setpoint with a plurality of models of the system, each model operable to map the inputs through a representation of the system to provide predicted outputs corresponding to the select ones of the outputs, which predicted outputs of the plurality of models are averaged.
22. The method of
23. The method of
24. The method of
25. The method of
26. The method of
determining the average predicted output of the plurality of models <y(t)>;
determining the average derivative of the average predicted output <y(t)> with regards to the inputs x(t) as ∂<y(t)>/∂x(t);
the objective function J being a function of <y(t)> and determining a derivative of the objective function J with respect to <y(t)> as ∂J/∂<y(t)>;
determining with the chain rule the relationship ∂J/∂x(t); and
determining the minimum of the objective function J.
27. The method of
28. A method for optimizing the parameters of a system having a vector input x(t) and a vector output y(t) and with respect to the dynamic operation thereof from a current operating point to a desired setpoint for the output y(t), comprising the steps of:
storing a representation of the system in a plurality of models, each of the models operable to predict the output of the system for a given input value of x(t);
providing predetermined optimization objectives; and
determining a single optimized input vector value {circumflex over (x)}(t) for each of a plurality of time increments between the current value and the desired setpoint over a future horizon by applying a predetermined optimization algorithm to the plurality of models to achieve a minimum error to the predetermined optimization objective at each of the plurality of time increments between the current value and the desired setpoint over the future horizon, each model operable to map the inputs through a representation of the system to provide a predicted output.
29. The method of
30. The method of
31. The method of
is the likelihood, P(ω) is a prior distribution of the parameters ω of the model, and their product is the posterior distribution.
32. The method of
33. The method of
34. The method of
35. The method of
36. The method of
37. The method of
38. The method of
39. The method of
where
40. The method of
determining the weighted average of the predicted outputs of each of the models at each of the increments of time by the following relationship:
where P(y
^{(i)}|x^{(i)},{right arrow over (ω)})P({right arrow over (ω)}) represents the posterior probability of the model indexed by w, and N_{w }represents the maximum number of models in the stochastic relationship, and wherein the stored representation of the system in each of the plurality of non-linear or linear networks are related in such a manner wherein the parameters of each of the non-linear or linear networks are stochastically related to each other;determining the derivatives ∂J/∂<y(t)> as the variation of the predetermined optimization objective with respect to the output y(t) at each of the plurality of time increments between the current value and the setpoint; and
determining by the chain rule the following:
41. An optimizing system for determining the optimum operation of a system, comprising:
an input for receiving the outputs of the system and the measurable inputs to the system; and
an optimizer for optimizing select ones of the outputs as a function of the inputs by minimizing an objective function J to provide optimal values for select ones of the inputs;
said optimizer including a plurality of models of the system, each model operable to map the inputs through a representation of the system to provide predicted outputs corresponding to the select ones of the outputs, each of the models for predicting the select ones of the outputs of the system, which predicted outputs of said plurality of models are combined in accordance with a predetermined combination algorithm to provide a single predicted output corresponding to each of the select ones of the outputs.
42. The optimizing system of
43. The optimizing system of
44. The optimizing system of
45. The optimizing system of
46. The optimizing system of
means for determining the average predicted output of the plurality of models <y(t)>;
means for determining the average derivative of the average predicted output <y(t)> with regards to the inputs x(t) as ∂<y(t)>/∂x(t);
the objective function J being a function of <y(t)> and means for determining a derivative of the objective function J with respect to <y(t)> as ∂J/∂<y(t)>;
means for determining with the chain rule the relationship ∂J/∂x(t); and
means for determining the minimum of the J.
47. The optimizing system of
48. The optimizing system of
49. An optimizing system for optimizing the parameters of a system having a vector input x(t) and a vector output y(t), comprising:
a plurality of models of the system, each for storing a representation of the system, each of said models operable to predict the output as a predicted vector output of the system for a given input value of x(t), each model operable to map the inputs x(t) through a representation of the system to provide a predicted output vector corresponding to the vector output y(t); and
an optimizer for determining a single optimized input vector value {circumflex over (x)}(t) by applying a predetermined optimization algorithm to the plurality of models to achieve a minimum error to a predetermined optimization objective for the predicted output vectors for each of the models.
50. The optimizing system of
51. The optimizing system of
52. The optimizing system of
where
is the likelihood, P(ω) is a prior distribution of the parameters ω of said associated one of said models, and their product is the posterior distribution.
53. The optimizing system of
54. The optimizing system of
55. The optimizing system of
56. The optimizing system of
57. The optimizing system of
58. The optimizing system of
59. The optimizing system of
where
is the likelihood, P(ω) is a prior distribution of the parameters ω of each of said models, and their product is the posterior distribution.
60. The optimizing system of
means for determining the weighted average of the predicted output of each of said models by the following relationship:
where P(y
^{(i)}|x^{(i)},ω)P(ω) represents the posterior probability of said each model indexed by w, and N_{w }represents the maximum number of said models in the stochastic relationship, and wherein said stored representation of the system in each of said plurality of models are related in such a manner wherein the parameters of each of said models are stochastically related to each other;means for determining the derivatives ∂J/∂<y(t)> as the variation of the predetermined optimization objective with respect to the output y(t); and
means for determining by the chain rule the following:
61. An optimizing system for determining the dynamic operation of a system, comprising the steps of:
an input for receiving the outputs of the system and the measurable inputs to the system; and
an optimizer for optimizing select ones of the outputs as a function of the inputs over a future horizon by minimizing an objective function J to achieve a predetermined desired setpoint to provide optimal values for select ones of the inputs over a trajectory to the desired setpoint in incremental time intervals;
said optimizer operable to predicting the select ones of the outputs over the trajectory at each of the incremental time intervals from the current value to the setpoint with a plurality of models of the system, which predicted outputs of said plurality of models are combined in accordance with a predetermined combination algorithm to provide a single predicted output corresponding to each of the select ones of the outputs.
62. The optimizing system of
63. The optimizing system of
64. The optimizing system of
65. The optimizing system of
66. The optimizing system of
means for determining the average predicted output of said plurality of models <y(t)>;
means for determining the average derivative of the average predicted output <y(t)> with regards to the inputs x(t) as ∂<y(t)>/∂x(t);
the objective function J being a function of <y(t)> and means for determining a derivative of the objective function J with respect to <y(t)> as ∂J/∂<y(t)>;
means for determining with the chain rule the relationship ∂J/∂x(t); and
means for determining the minimum of the objective function J.
67. The optimizing system of
68. An optimizing system for optimizing the parameters of a system having a vector input x(t) and a vector output y(t) and with respect to the dynamic operation thereof from a current operating point to a desired setpoint for the output y(t), comprising:
a plurality of models, each for storing a representation of the system, each of said models operable to predict the output of the system for a given input value of x(t), each model operable to map the vector input x(t) through a representation of the system to provide a predicted output vector corresponding to the vector output y(t); and
an optimizer for determining a single optimized input vector value {circumflex over (x)}(t) for each of a plurality of time increments between the current value and the desired setpoint over a future horizon by applying a predetermined optimization algorithm to the plurality of models to achieve a minimum error to a predetermined optimization objective at each of the plurality of time increments between the current value and the desired setpoint over the future horizon.
69. The optimizing system of
70. The optimizing system of
71. The optimizing system of
where
is the likelihood, P(ω) is a prior distribution of the parameters ω of each of said models, and their product is the posterior distribution.
72. The optimizing system of
73. The optimizing system of
74. The optimizing system of
75. The optimizing system of
76. The optimizing system of
77. The optimizing system of
_{{circumflex over (x)}(t) }by determining the derivative of the predetermined optimization objective relative to the input vector x(t) as ∂J/∂x(t), where J represents the predetermined optimization objective between the current value and the desired setpoint.78. The optimizing system of
79. The optimizing system of
is the likelihood, P(ω) is a prior distribution of the parameters ω of each of said model, and their product is the posterior distribution.
80. The optimizing system of
means for determining the weighted average of the predicted outputs of each of said models at each of the increments of time by the following relationship:
where P(y
^{(i)}x^{(1)},{right arrow over (ω)})P({right arrow over (ω)}) represents the posterior probability of said associated one of said models indexed by w, and N_{w }represents the maximum number of said models in the stochastic relationship, and wherein the stored representation of the system in each of said models is related in such a manner wherein the parameters of each of said models are stochastically related to each other;means for determining the derivatives ∂J/∂<y(t)> as the variation of the predetermined optimization objective with respect to the predicted output y(t) at each of the plurality of time increments between the current value and the setpoint; and
means for determining by the chain rule the following:
Description The present application is a Continuation-in-Part of, and claims priority in, U.S. Provisional Patent Application Serial No. 60/103,269, entitled Bayesian Neural Networks For Optimization and Control, and filed Oct. 6, 1998. The present invention pertains in general to neural networks for use with optimization of plants and, more particularly, to the use of Bayesian-trained neural networks for optimization and control. In general, modeling techniques for a plant involve the generation of some type of model. This is typically done utilizing a single model, either a linear model or a non-linear model. However, another technique of generating a model is to utilize a plurality of models that can be utilized to define a predicted vector output y(t) of values y Given a set n of measured process data points:
and assuming that an underlying mapping exists with the following relationship:
exists, a stochastical method for generating y(t) with respect to x(t) can be defined by averaging over many (non-linear) regression models F The present invention disclosed and claimed herein comprises a method for optimizing a system in which a plant is provided for optimization. A training network having an input layer for receiving inputs to the plant, an output layer for outputting predicted outputs, and a hidden layer for storing a learned representation of the plant for mapping the input layer to the output layer is also provided. A method for training the neural network in utilizing the stochastical method of a Bayesian-type is provided. In another aspect of the present invention, a method utilizing the network in an optimization mode in feedback from the output of the plant to the input of the plant to optimize the output with respect to the input via the stochastical Bayesian method is provided. For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description taken in conjunction with the accompanying Drawings in which: FIG. 1 illustrates a block diagram of the present invention utilizing the optimizer of the disclosed embodiment; FIG. 2 illustrates a diagrammatic view of the optimizer of FIG. 1; FIG. 3 a block diagram of the combination of the models utilizing a weighted average; FIG. 4 illustrates a diagram depicting the training operation for the network; FIG. 5 illustrates a process flow for the training operation of the multiple models; FIG. 6 illustrates a block diagram of the optimizer wherein a single optimized value is determined averaged over all of the models; FIG. 7 illustrates a block diagram depicting the optimizer wherein each model is optimized and then the optimized values averaged; FIG. 8 illustrates a diagram for projecting a prediction over a horizon for a dynamic model in accordance with the disclosed embodiment; FIG. 8 FIG. 8 FIG. 9 illustrates a diagrammatic view of the optimization process for control; FIG. 10 illustrates a block diagram of the plant utilizing a multiple model feedback control network for predicting a trajectory over the control horizon; FIGS. 11 and 12 illustrate block diagrams of an the implementation of the network for a dynamic model; FIG. 13 illustrates a block diagram of the dynamic model corresponding to FIG. 8; FIG. 14 illustrates a block diagram of the dynamic model utilizing a steady state model to fix the gain; and FIG. 15 illustrates an alternate embodiment of the dynamic model of FIG. Referring now to FIG. 1, there is illustrated a block diagram of the system of the disclosed embodiment for optimizing/controlling the operation of a plant. A plant The values for operating the plant in the form of the x(t) variables are generated by a controller The optimizer Referring now to FIG. 2, there is illustrated a block diagram of the optimizer
This will provide a plurality of the models
There are “w” of these models, such that there are also y The method of optimizing an output with respect to inputs described hereinabove, with the option of being subject to constraints from first principles models, provides some advantages over the standard neural network methods primarily by giving rise to high quality solutions in the system identification phase in a parameter insensitive way that avoids overfitting. Furthermore, by having a clean statistical interpretation, the approach easily lends itself to estimating confidence levels and related quantities. The prediction operation will be described for the stochastic method in a more detailed manner in the following. The data is contained in a dataset D with an index n representing the portion thereof that is associated with training. Indices exceeding n (n+1, n+2, . . . ) refers to data not included in the training process, this being the testing data, and no index refers to an arbitrary data point. Subscripted values x In the first step, it is necessary to predict y where P(ω|D) is the conditional probability (called the posterior) for the model F where is the likelihood, P(ω) is a prior distribution of the parameters or model weights ω, and their product is the posterior distribution. Assuming (not necessary) also a Gaussian distribution for the likelihood distribution of the weights of the model, the average predicted output relationship is as follows: where: the first term representing the summed square error over the dataset D with n being the number of patterns, and the second term corresponding to the prior penalizes large weights (a regularizer). The third term, also part of the prior, is written as a generic constraint that could include, for instance, fitting on different levels to first principles knowledge. The value of i ranges over the dataset D, with n being the number of patterns. Referring now to FIG. 3, there is illustrated a block diagram of the average predicted output y In the situation wherein the models are utilized in a feedback mode, i.e., for the purpose of predicting input values, and which feedback is utilized for control, the gain is an important factor. Therefore, during the training of the model, it is necessary to take into account the gain as one constraint on the training. This is reflected in the term H(ω, D) which, when gain constraints are considered, results in the following relationship: where f( ) measures whether the argument satisfies the known constraint, and the index i in the sum indicates the x The models F Referring now to FIG. 4, there is illustrated a diagrammatic view for the “thermalized” static behavior utilizing the training operation and, also referring to FIG. 5, there is illustrated a diagrammatic flow for the training operation. Both of FIGS. 4 and 5 will be referred to. With specific reference to FIG. 5, the training operation is initiated by a set of weights ω The steps noted hereinabove, between blocks In addition to being able to compute y(t)=F (1) sensitivity analysis, and (2) optimization and/or control Each of the N The average derivatives, weighted over the ensemble of models, can then be calculated by the following relationship: In this relationship the values for derivatives are averaged over the models. To reduce computation time, it may be desirable to estimate Equation (10) instead of computing it fully. The best single-term estimate would be the one with the largest posterior (or probability weighting factor) for weighting the gains: In Bayesian terminology, any such estimate is called the MAP (maximum a posteriori) estimate. In order for this MAP estimate to significantly reduce computing time, it would be necessary to have access to the ensemble of models already sorted in posterior magnitude order: a sorted index to the models at the completion of the training procedure could quickly and easily be created. Since this would be done only once, the required computing time would be insignificant. Referring now to FIG. 6, there is illustrated a block diagram depicting the operation illustrated in Equation (10), the models This basic idea of estimating using the single MAP model can be iterated to improve the estimation to the desired level of accuracy. The second level estimate would consist of taking the model with the next highest posterior (next model in the indexed list) and averaging it with the first (the MAP) model, to yield a two-model average. This process could be iterated, incrementally improving the estimate, with some stopping criterion defined to halt the procedure. A stopping criterion might be to halt when the change in the estimate due to adding the next model is less than some threshold. The extreme of this process is of course the full sum of Equation (10). The above discussion of Equation (10) or its estimation, involved taking the derivative ∂y where < >D indicates the average over the dataset of vector points x In addition, statistics over the dataset other than the average can often yield useful information, such as the median, the standard deviation, and so forth. Process optimization ordinarily refers to determining the optimal input vector {circumflex over (x)}(t) that will minimize a defined objective function J while satisfying any defined constraint functions C “Optimization” typically means “steady-state optimization” (finding an optimal point in operating space using a steady-state model), while “control” typically means “dynamic control” (finding an optimal trajectory in operating space using a dynamic model). Both are “optimization problems.” In optimization or control, an optimization algorithm uses the process model to find the optimal {circumflex over (x)}(t), given the objective J and constraint C Nonlinear constrained optimization algorithms that make use of derivatives generally execute much faster than those that do not. A variety of such nonlinear constrained optimization programs are commercially available. The most popular codes are based on the Sequential Quadratic Programming (SQP) or the Generalized Reduced Gradient (GRG) methods. A prototypical objective function is J=Σ In order to optimize the objective function J with respect to the input variable x(t), it is necessary to determine ∂J/∂x because the output variables <y Therefore, for purposes of (derivative-based) process optimization or control using Bayesian modeling, the (q, p) derivative matrix ∂y In each method, any nonlinear constrained optimization code, such as an SQP or GRG code, may be used to perform the optimization. Any such code searches the x-space for the x-value that will minimize the given J while satisfying the given C There are at least two fundamentally different ways that optimization over a Bayesian ensemble of models may be carried out. Roughly speaking, method (1) performs a single optimization over the entire ensemble, and method (2) performs multiple optimizations, one for each model in the ensemble, and when finished combines the results. Optimization Method (1): In this method, the optimization routine performs a single optimization over all of the models in the ensemble and returns a single optimal value for {circumflex over (x)}(t). When the optimizer requests the values of the functions and derivatives evaluated at a point x(t), a user-supplied subroutine must compute the derivative values ∂(y Referring now to FIG. 7, there is illustrated a block diagram of the first optimization method. In this method, the models Optimization Method (2): In this method, each model in the Bayesian ensemble is optimized separately, yielding an optimal {circumflex over (x)} In addition, the distribution of {circumflex over (x)}(t) values may hold useful information for process operation in addition to the single averages. It should be understood that combinations of these two fundamental optimization methods are to be considered and that the disclosed embodiment is not exhaustive. Referring now to FIG. 8, there is illustrated a block diagram depicting the second method of optimization. In this diagrammatic view, the models Referring now to FIG. 8 The above discussion has been described with respect to steady-state process models. The indices k described hereinabove describe new data (n+k), whereas a dynamic system utilizes the index k to represent time intervals, which need not represent equally spaced time intervals. In general, a trajectory of output values {y(t+1) . . . y(t+k When using the above-described optimization/control procedures for dynamic models, which are iterated out to the control horizon in time, the optimization is performed over the entire trajectory (time interval (t+1, t+k Referring now to FIG. 9, there is illustrated a diagrammatic view of the trajectory of y Referring now to FIG. 10, there is illustrated a block diagram for the plant In utilizing dynamic models, the models can be of differing types. In the disclosed embodiment, the dynamic model is a linear model which is defined by the following relationship:
where: y u d is a delay value; and a, b are the parameters of the linear model. One example of the linear model is set by the following relationship:
Although Equation (17) is set forth as a linear equation with a linear model, additional non-linear terms can be attached to this equation to result in a non-linear model. However, the parameters of this model are set by the a's and b's, i.e., the parameter values of the model. This also pertains to the gain, this described in detail in U.S. patent application Ser. No. 08/643,464, which was incorporated by reference hereinabove. When identifying the stochastically-related model via the various techniques described hereinabove, the disclosed one being the Bayesian technique, the models are trained in substantially the same way as the non-linear and neural networks, described with respect to the steady-state process hereinabove. This will yield w models which are stochastically related by the following relationship:
This will provide the models y where P(a, b|D) is a conditional probability for the model G where is the likelihood. P(a, b) is the prior distribution of the parameters (a, b) of the model, and their product is the posterior distribution, as was described hereinabove with respect to the steady-state case. All of the above-noted equations apply to the dynamic case. The only difference is that the input is now u(t) and the parameters of the model are (a, b), as compared to ω. In order to perform a sensitivity analysis or to perform an optimization and/or control, each of the N As noted hereinabove, it is then necessary to determine the average derivatives, weighted over the ensemble of models by the following relationship similar to Equation (21): Referring now to FIG. 11, there is illustrated a block diagram depicting the operation illustrated in Equation (22) for a dynamic model to determine the average derivative. This basically parallels the operation of the embodiment in the FIG. 6, described hereinabove with respect to steady-state models. There are provided a plurality of dynamic models Once the average derivative is determined for the dynamic model, then this can be optimized, utilizing the optimization method (1) or the optimization method (2) described hereinabove, except that a dynamic model is used. This is illustrated in FIG. 12 which parallels FIG. 7 for the static model. The derivatives of each of the models output from the derivative block In optimization method (2), the dynamic model representation is illustrated in FIG. 13, which parallels FIG. In U.S. patent application Ser. No. 08/643,464, incorporated herein by reference, there was disclosed a technique for defining the gain of dynamic models as a function of the gain of the steady-state neural network model. The gain of the steady-state model is referred to by the term “K Since the gain K This makes a dynamic model consistent with its steady-state counterpart, as described in U.S. patent application Ser. No. 08/643,464, which was incorporated by reference hereinabove. Therefore, each time the steady-state value changes such that the operating region of the steady-state model is different, this will correspond to a potentially different gain K Referring now to FIG. 14, there is illustrated a block diagram of the optimizer Each of the blocks The b Although the index for the steady-state model In an alternate embodiment, as illustrated in FIG. 15, the gain of a single steady-state model In summary, there has been provided a method and apparatus by which a stochastical method is utilized for optimizing y(t) with respect to x(t) through the use of averaging over multiple regression models F Although the preferred embodiment has been described in detail, it should be understood that various changes, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Patent Citations
Non-Patent Citations
Referenced by
Classifications
Legal Events
Rotate |