US 20070136035 A1 Abstract Systems methods and recordable media for predicting multi-variable outcomes based on multi-variable inputs. Additionally, the models described can be used to predict the multi-variable inputs themselves, based on the multi-variable inputs, providing a smoothing function, acting as a noise filter. Both multi-variable inputs and multi-variable outputs may be simultaneously predicted, based upon the multi-variable inputs. The models find a critical subset of data points, or “tent poles” to optimally model all outcome variables simultaneously to leverage communalities among outcomes.
Claims(3) 1-26. (canceled) 27. A method of generating a predictor model for predicting multivariable outcomes (a matrix of rows of Y-profiles) based upon multivariable inputs (a matrix of rows of X-profiles) with consideration of nuisance variables, said method comprising the steps of:
analyzing each X-profile row of multivariable inputs as an object; calculating similarity among the objects; selecting tent poles determined to be critical profiles in supporting a prediction function for predicting the Y-profiles; optimizing the number of tent poles to minimize the error between the X-profiles and the Y-profiles; and performing at least one of storing and outputting a prediction function for predicting the Y-profiles that results from said analyzing, calculating, selecting and optimizing wherein said Y-profiles are calculatable for continuous variables, logistic variables and ordinal variables. 28-35. (canceled)Description This application claims the benefit of U.S. Provisional Application No. 60/368,586, filed Mar. 29, 2002, which application is incorporated herein, in its entirety, by reference thereto. The present invention relates to software, methods, and devices for evaluating correlations between observed phenomena and one or more factors having putative statistical relationships with such observed phenomena. More particularly, the software, methods, and devices described herein relate to the prediction of the suitability of new compounds for drug development, including predictions for diagnosis, efficacy, toxicity, and compound similarity among others. The present invention may also be applicable in making predictions relating to other complex, multivariate fields, including earthquake predictions, economic predictions, and others. For example the transmission of seismic signals through a particular fault may exhibit significant changes in properties prior to fault shifting. One could use the seismic transmissions of the many small faults that are always active near major fault lines. The application of statistical methods to the treatment of disease, through drug therapy, for example, provides valuable tools to researchers and practitioners for effective treatment methodologies based not only on the treatment regimen, but taking into account the patient profile as well. Using statistical methodologies, physicians and research scientists have been able to identify sources, behaviors, and treatments for a wide variety of illnesses. Thus, for example, in the developed world, diseases such as cholera have been virtually eliminated due in great part to the understanding of the causes of, and treatments for, these diseases using statistical analysis of the various risk and treatment factors associated with these diseases. The most widely used statistical methods currently used in the medical and drug discovery fields are generally limited to conventional regression methods which relate clinical variables obtained from patients being treated for a disease with the probable treatment outcomes for those patients, based upon data relating to the particular drug, drugs or treatment methodology being performed on that patient. For example, logistic regression methods are used to estimate the probability of defined outcomes as impacted by associated information. Typically, these methods utilize a sigmoidal logistic probability function (Dillon and Goldstein 1984) that is used to model the treatment outcome. The values of the model's parameters are determined using maximum likelihood estimation methods. The nonlinearity of the parameters in the logistic probability function, coupled with the use of the maximum likelihood estimation procedure, makes logistic regression methods complicated. Thus, such methods are often ineffective for complex models in which interactions among the various clinical variables being studied are present, or where multivariable characterizations of the outcomes are desired, Such as when characterizing all experimental drug. In addition, the coupling of logistic and maximum likelihood methods limits the validation of logistic models to retrospective predictions that can overestimate the model's true abilities. Such conventional regression models can be combined with discriminant analysis to consider the relationships among the clinical variables being studied to provide a linear statistical model that is effective to discriminate among patient categories (e.g., responder and non-responder). Often these models comprise multivariate products of the clinical data being studied and utilize modifications of the methods commonly used in the purely regression-based models. In addition, the combined regression/discriminant models can be validated using prospective statistical methods in addition to retrospective statistical methods to provide a more accurate assessment of the model's predictive capability. However, these combined models are effective only for limited degrees of interactions among clinical variables and thus are inadequate for many applications. The Similarity Least Square Modeling Method (SMILES) disclosed in U.S. Pat. No. 5,860,917 (of which the present inventor is a co-inventor), and which is hereby incorporated, in its entirely, by reference thereto, is capable of predicting an outcome (Y) as a function of a profile (X) of related measurements and observations based on a viable definition of similarity between such profiles. SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types. For multiple outcome variables, or Y-variables, SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further. What is needed, needed, therefore, are methods of providing statistically meaningful models for analyzing the Y-variables as an ensemble of related observations, to produce a common model for all Y-variables as a function of multiple X-variables to obtain a more efficient model with better leverage on common phenomena and less noise. The present invention includes systems, methods and recordable media for predicting multi-variable outcomes based on multi-variable inputs. In one aspect of the invention, a predictor model is generated by: a) defining an initial model as Model Zero and inputting Model Zero as initial column(s) one or more of a similarity matrix T; b) performing an optimization procedure (e.g., least squares regression or other linear regression procedure, nonlinear regression procedure, maximum entropy procedure, mini-max entropy procedure or other optimization procedure) to solve for matrix values of an α matrix which is a transformation of outcome profiles associated with input profiles; c) calculating a residual matrix ε based on the difference between the actual outcome values and the predicted outcome values determined through a product of matrix T and matrix α, d) selecting a row of the a residual matrix ε which contains an error value most closely matching a pre-defined error criterion; e) identifying a row from a matrix of the multivariable inputs which corresponds to the selected row front the residual matrix ε; f) calculating similarity values between the identified row and each of the rows in the matrix of the multivariable inputs, including the identified row with itself; g) populating the next column of similarity matrix T with the calculated similarity values if it is determined that such column of the identified row is not collinear or nearly collinear with Model Zero and columns of previously identified rows, the similarity values for which were used to populate Such previous columns of similarity matrix T; and h) repeating steps b) through g) until a predefined stopping criterion has been reached. In another aspect of the present invention, the predictor model may be used to predict multi-variable outcomes for multi-variable income data of which the outcomes are not known. In another aspect of the present invention, the model learns to represent a process from process profile data such as process input, process output, process parameters, process controls and/or process metrics, so that the trained model is useful for process optimization, model-based process control, statistical process control and/or quality assurance and control. In another aspect of the present invention, a model may be used to self-predict multi-variable profiles, wherein the input multivariable profiles are used to predict the input multivariable profiles themselves as multi-variable outputs. In another aspect of the present invention, the self-prediction model is used iteratively to impute data values to missing data values in the multivariable input profiles. In another aspect of the present invention, a model is used to simultaneously predict both multi-variable X-input profiles and multi variable Y-output profiles based oil the multi-variable X-input profiles. In another aspect Y-columns may be similarity values of a select subset of the original Y-variables by analogy to S-columns as similarity values of the X-variables. In another aspect of the present invention, score functions may be optimally assigned to the predicted multi-variable outcomes for use in any multivariate distribution process, such as ordinal, logistic, and survival probability analysis and predictions. In yet another aspect, the identified rows, also described as math-functional “tent pole” locations, may be tested for ellipticity as a function of the X-space, using the Marquardt-Levenberg algorithm, and then ranked according to the testing. Still further, the present invention may include determining one or more decay constants for each of the identified rows of X-profiles (tent pole locations) used to calculate similarity values to populate the T matrix (similarity matrix). Methods, systems and recordable media are disclosed for generating a predictor model for predicting multi-variable outcomes (a matrix of rows of Y-profiles) based upon multivariable inputs (a matrix of rows of X-profiles) with consideration of nuisance or noise variables, by analyzing each X-profile row of multivariable inputs as an object; calculating similarity among the objects; selecting tent pole locations determined to be critical profiles in supporting a prediction function for predicting the Y-profiles; determining a maximum number of such profiles by model properties such as collinearity or max fit error or least squares sum of squared errors; and optimizing the final number of tent poles by prospective “true” prediction properties such as the minimum of the sum of squared “prospective errors or ensemble errors” between the Y-profile predictions and the know Y-profile value(s). According to the present invention, the dimensions of the data can be reduced to a lower dimension as defined only by necessary critical components to represent the phenomenon being modeled. Hence, in general, the present invention is valuable to help researchers “see” the high-dimensional patterns from limited noisy data on complex phenomenon that can involve multiple inputs and multiple consequential outputs (e.g., outcomes or responses). The present invention can optimize the model fit and/or the model predictions and provides diagnostics that measure the predictive and fit capabilities of a derived model. Input profile components may simultaneously be included as outcome variables and vice versa, thus enabling a nonlinear version of partial least squares that induces proper matrix-eigenvalue matching between input and output matrices. Eigenvalue matching is well-practiced as lineal transformations related to generalized singular value decompositions (GSVD). The present invention can also be used for self-prediction imputation and smoothing, e.g., predicting smoothed and missing values in input data based on key profiles in the input data. The present invention includes the capability to measure the relative importance of individual input variables to the prediction and fit process by nonlinear statistical parameters calculated by the Marquardt-Levenberg algorithm. The present invention can also associate decay constants with each location (tent poles) which is useful to quantity types and scopes of the influence of that profile on the model, i.e., local and/or global effect. The present invention finds a critical subset of data points to optimally model all outcome variables simultaneously to leverage both communalities among outcomes and uniqueness properties of each outcome. The method relates measured variables associated with a complex phenomenon using a simple direct functional process that eliminates artifactual inferences even if the data is sparse or limited and the variable space is high dimensional. The present invention can also be layered to model higher-ordered features, e.g., output of a GSMILES network can be input to a second GSMILES network. Such GSMILES networks may include feedback loops. If profiles include one or more ordered indices such as “time,” GSMILES networks can incorporate the ordering of such indices (i.e., “time” series). GSMILES also provides statistical evaluations and diagnostics of the analysis, both retrospective and prospective scenarios. GSMILES reduces random noise by combining data from replicate and nearby adjacent information (i.e., pseudo-replicates). Before the present invention is described, it is to be understood that this invention is not limited to particular statistical methods described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims. Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and systems similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and systems are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or systems in connection with which the publications are cited. It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a variable” includes a plurality of such variables and reference to “the column” includes reference to one or more columns and equivalents thereof known to those skilled in the art, and so forth. The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed. “Microarrays” measure the degree to which genes are expressed in a particular cell or tissue. One-channel microarrays attempt to estimate an absolute measure of expression. Two-channel microarrays compare two different cell types or tissues and output a measure of relative strength of expression. “RTPCR” designates Real Time Polymerized Chain Reaction, and includes techniques such as Taqman™, for example, for high resolution gene expression profiling. “Bioassays” are experiments that determined properties of biological systems and measure certain quantities. Microarrays are an example of bioassays. Other bioassays are fluorescence assays (which cause a cell to fluoresce if a certain biological event occurs) and yeast two-hybrids (which determine whether two proteins of interest bind to each other or not). “Chemical data” include the chemical structure of compounds, chemical and physical properties of compounds (Such as solubility, pH value, viscosity, etc.), and properties of compounds that are of interest in pharmacology, e.g., toxicity for particular tissues in particular species, etc. “Process control” includes all methods such as feed-forward, feed-backward, and model-based control loops and policies used to stabilize, reduce noise, and/or control any process (e.g., production lines in factories), based on inherent correlations between systematic components and noise components of the process. “Statistical process control” refers to statistical evaluation of process parameters and/or process-product parameters to verify process stability and/or product quality based on non-correlated noise. “Genomics databases” contain nucleotide sequences. Nucleotide sequences include DNA (the information in the nucleus of eukaryotes that is propagated in cell division and is the basis for transcription), messenger RNA (the transcripts that are then translated into proteins), and ribosomal and transfer RNA (part of the translation machinery). “Proteomics databases” contain amino acid sequences, both sequences inferred from genomic data and sequences found through various bioassays and experiments that reveal the sequences of proteins and peptides. “Publications” include medicine (the collection of biomedical abstracts distributed by the national library of medicine), biomedical journals, journal articles from related fields, such as chemistry and ecology, or articles, books or any other published material in the field being examined, whether it be geology, economics, etc. “Patent” includes U.S. patents and patents throughout the world, as veil as pending patent applications that are published. “Proprietary documents” include those documents which have not been published, or are not intended to be published. “Medical data” include all data that are generated by diagnostic devices, such as urinalysis, blood tests, and data generated by devices that are currently under investigation for their diagnostic potential (e.g., microarrays, mass spectroscopy data, etc.). “Patient records” are the records that physicians and nurses maintain to record a patient's medical history. Increasingly, information is captured electronically as a patient interacts with hospitals and practitioners. Any textual data captured electronically in this context may be part of patient records. When one location is indicated as being “remote” form another, this refers to the tow locations which are at least in different buildings, and these locations may be at least one mile, ten miles or at least one hundred miles apart. “Transmitting” information refers to sending the data representing that information as electrical signals over a suitable communication channel (e.g., a private or public network). “Forwarding” a result refers to any means of getting that result from one location to the next, whether by transmitting data representing the result or physically transporting a medium carrying the data or communicating the data. A “result” obtained from a method of the present invention includes one directly or indirectly obtained from use of the present invention. For example, a directly obtained “result” may include a predictor model generated using the present invention. An indirectly obtained “result” may include a clinical diagnosis, treatment recommendation, or a prediction of patient response to a treatment which was made using a predictor model which was generated by the present invention. The present invention provides methods and systems for extracting meaningful information from the rapidly growing amount of genomic and clinical data, using sophisticated statistical algorithms and natural language processing. The block diagram in The ETL module Once the ETL module extracts the data, it may transform the data with simple preprocessing steps. For example, the ETL module may normalize the data and filter out noise and non-relevant data points. The ETL module then loads the data into the RDBMS (i.e., relational database management system) in a form that is usable in the GSMILES process, e.g., the input and output variables according to the GSMILES model. Specifically, the ETL module loads the extracted (and preferably preprocessed) data into the RDBMS in fields corresponding to the input and output variables for the entities to which the data relate. The ETL module may be run in two modes. If a data source is available permanently, data are processed in batch mode and stored in the RDBMS. If it data source is interactively supplied by the user, it will be processed interactively by the ETL module. The text mining module In one embodiment, text mining module In one embodiment, text mining module The Blast or Homology module Data interpretation module Client The client By processing the preprocessed data received from ETL Information may be exchanged with Text Index Module(s) One important aspect of the methods and systems disclosed concerns their use in the prediction of the suitability of new compounds for drug development. GSMILES predictor The present system may utilize the Generalized Similarity Least Squares (GSMILES) modeling method to reveal association patterns within genomic, proteomic, clinical, and chemical information and predict related outcomes such as disease state, response to therapy, survival time, toxic events, genomic properties, immune response/rejection level, and measures of kinetics/efficacy of single or multiple therapeutics. The GSMILES methodology performed by GSMILES module The GSMILES Methodology A useful method and system for extracting meaningful information from the genomic and clinical data requires an efficient algorithm, an effective model, helpful diagnostic measures and, most importantly, the capability to handle multiple outcomes and outcomes of different types. The ability to handle multiple outcomes and outcomes of different types is necessary for many types of complex modeling. For example, genomic and clinical data are typically represented as related series of data values or profiles, requiring a multi-variate analysis of outcomes. The Similarity Least Square Modeling Method (SMILES) disclosed in U.S. Pat. No. 5,860,917 (of which the present inventor is a co-inventor, and which was incorporated by reference above), is capable of predicting an outcome (Y) as a function of a profile (X) of related measurements and observations based on a viable definition of similarity between such profiles. SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types. For multiple outcome variables, or Y-variables, SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further. GSMILES remedies this deficiency by analyzing the Y-variables as an ensemble of related observations. GSMILES produces a common model for all Y-variables as a function of multiple X-variables to obtain a more efficient model with better leverage on common phenomena with less noise. T his aspect of GSMILES allows a user to find strategic gene compound associations that involve multiple-X/multiple-Y variables on noisy cell functions or responses to stimuli. GSIMILES treats each profile of associated measurements of variables as an object with three classes of information: predictor/drivel variables (X-variables), predictee/consequential variables (Y-variables), and nuisance variables (noise variables, known and unknown). Note that these classes are not mutually exclusive; hence, a variable can belong to one or more of such GSMILES classes as dictated by each application. GSMILES calculates similarity among all such objects using a definition of similarity based on the X-variables. Note that similarity may be compound, e.g., a combination of similarity measures, where each similarity component is specific to a subset of profile X-variables. GSMILES uses such similarity values to predict the Y-variables. It selects a critical subset of objects that can optimally predict the Y-values of all objects within the precision limitations imposed by nuisance effects, assured by statistically valid criteria. An iterative algorithm as discussed below may make the selection. Affine prospective predictions of Y-profiles may be performed to predict profiles (i.e., row vectors) in the Y-outcome-variable matrix where Z is an N×M matrix of predicted Y values (where N and M are positive integers); S is an N×P matrix of similarity values between profiles in matrix X (where N and P are positive integers, which may further include one or more columns of Model Zero values, as will be discussed below); and R is an X-nonlinear transformation of P Y-profiles associated with P strategic X profiles (also referred to as “α” values, below). The final prediction model according to this methodology is prospective, since each predicted row of Y in turn is used to estimate a prospective error, the sum of squares of which determine the optimal number of model terms by minimization. The transforms are optimized to minimize the least-squares error between Z and Y. Thus, R is a P×M matrix of P optimal transforms of Y-profiles and the similarity values in each row of S are the strategic affine coefficients for these optimal profiles to predict the associated row in Y. In this way, GSMILES not only represents Y efficiently, but reduces noise by functional smoothing. Equation (1) can be easily transformed into a mixture representation by normalizing each row of S to sum to unity as follows:
where D is a diagonal matrix of the inverse of the sum or each row of matrix S. The GSMILES methodology finds the strategic locations in matrix X Referring to Each row of matrix With any set of data being analyzed, such as the data in matrix Conceptually speaking, if a function To solve for the critical profiles, an initial model (called Model Zero (Model 0) is inputted to the system, in matrix T (See A least squares regression algorithm is next performed to solve for coefficients α ·T α)=ε (3)
where The error matrix ε resulting from processing, using the example shown in Whatever technique is used to determine the maximum absolute error, the row from which the maximum absolute error is noted and used to identify the row (X-profile) from matrix Assuming, for exemplary purposes, that the row from which the maximum absolute error was found in matrix ε was the seventh, GSMILES then identifies the seventh row in matrix The X-profile row selected for calculating the similarity values marks the location of the first critical profile or “tent pole” identified by GSMILES for the model. A least squares regression algorithm is again performed next, this time to solve for coefficients α Again, GSMILES determines the row of the ε matrix which has the maximum absolute value of error, in a manner as described above. Whatever technique is used to determine the maximum absolute error, the row from which the maximum absolute error is noted and used to identify the row (X-profile) from matrix Alternatively, GSMILES may continue iterations as long as no two identified tent poles have locations that are too close to one another so as to be statistically indistinct from one another, i.e., significantly collinear. Put another way, GSMILES will not use two tent poles which are highly correlated and hence produce highly correlated similarity columns, i.e., which are collinear or nearly collinear (e.g., correlation squared (R When a tent pole (row from matrix The last calculated α matrix (α profile from the last iteration performed by GSMILES) contains the values that are used in the model for predicting the Y-profile with an X-profile input. Thus, once GSMILES determines the critical support profiles and the a values associated with them, the model can be used to predict the Y-profile for a new X-profile. Referring now to Again for simplicity, the example uses only a single X* profile, so that only a single row is added to the X-profile Because the X-profile matrix has been expanded to N+1 rows, Model Zero in this case will also contain N+1 components (i.e., is an (N+1)×1 vector)) as shown in GSMILES then utilizes the α matrix to solve for the Y where, for this example, The error values will be within the acceptable range of permitted error designed into the GSMILES predictor according to the iterations performed in determining the tent poles as described above. Typically, GSMILES overfits the data, i.e., noise are fit as systematic effects when in truth they tend to be random effects. The GSMILES model is trimmed back to the minimum of the sum of squared prospective ensemble errors to optimize prospective predictions, i.e., to remove tent poles that contribute to over fitting of the model to the data used to create the model, where even the noise associated with this data will tend to be modeled with too many tent poles. Once the model is determined, the Z-columns of distribution-based U's are treated as linear score functions where the associated distribution, Such as the binomial logistic model, for example, assigns probability to each of the score values. The initial such Y-score function is estimated by properties of the associated distribution, e.g., for a two-category logistic, assign the value +1 for one class and the value −1 for the other class. Another method uses a high-order polynomial in a conventional distribution analysis to provide the score vector. The high order polynomial is useless for making any type of predictions however. The GSMILES model according to the present invention predicts this score vector, thereby producing a model with high quality and effective prediction properties. The GSMILES model can be further optimized by using the critical S-columns of the similarity matrix directly in the distributional optimization that could also include conventional X-variables and/or Model Zero. Hence, GSMILES provides a manageable set of high-leverage terms for distributional optimizations such as provided by generalized linear, mixed, logistic, ordinal, and survival model regression applications. In this fashion, GSMILES is not restricted to univariate binomial logistic distributions, because GSMILES can predict multiple columns of Y (in the Y-profile GSMILES can also fit disparate properties at the same time and provide score functions for them. For example, the Y columns may include distributional, text and continuous variables, all within the same matrix, which can be predicted by the model according to the present invention. GSMILES can also perform predictions and similarity calculations on textual values. When text variables are included in the X-profile and/or the Y-profile, similarity calculations are performed among the rows of text, so that similarity values are also placed into the Y-profile, where the regression is performed with both predictor similarity values and predictee similarity values (i.e., similarity values are inserted on both sides of the equation, both in the X-profile, as well as the Y-profile). The GSMILES methodology can also be performed on a basis of dissimilarity, by forming a dissimilarity matrix according to the same techniques described above. Since dissimilarity, or distance has an inverse relationship to similarity, one of ordinary skill in the alt would readily be able to apply the techniques disclosed herein to form a GSMILES model based upon dissimilarity between the rows of the X-profile. Leave-One-Out Cross-Validation When modeling according to the GSMILES methodology, as with any type of prediction model, both fit error (training error) and validation error (test error) are encountered. In this case, fit error is the error that results in the ε matrix at the final iteration of determining the α matrix according to the above-described methodology, as GSMILES optimizes the training set (N×n matrix In general, to determine test or validation error, the model determined with the training set is applied to an independent set of data (the test or validation set) which has known Y-outcome values. The model is applied to the X-profile of the test set to determine the Y-profile. The calculated Y-profile is then compared with the known Y-profile to calculate the test or validation error, and the test or validation error is then examined to determine whether it is within the preset, acceptable range of error permitted by the model. If the test or validation error is within the predefined limits of the error range, than the model passes the validation test. Otherwise, it may be determined that the model needs further revision, or other factors prevent the model from being used with the test profile. For example, the test profile may contain some X values that are outside the range of X-values that the present model can effectively form predictions on. Some of the X-variables may have little association with the Y-profiles and hence they contribute non-productive variations thereby reducing the efficiency of the GSMILES modeling process. Hence, more data would be required to randomize out the useless variations of such non-productive X-variables. Optionally, one can identify and eliminate such noisy X-variables, since they tend to have very low rank via the Marquardt-Levenberg (ML) ranking method described in this document. To identify a rank threshold between legitimate and noisy X-variables, an intentional noisy variable may be included in the X-profile and its ML rank noted. Repetition of this procedure with alternate versions of the noisy X-column, e.g., by random number generations, produces a distribution of such noise ranks, whose statistical properties may be used to set an X-noise threshold. The leave-one-out cross-validation technique involves estimating the validation error through use of the training set. As an example, assuming that matrix Using the altered training data set, an α matrix is solved for using the techniques described above with regard to the GSMILES least squares methodology. After determining the α matrix, this α matrix is then used to predict the outcome for the extracted row (i.e., the test set, row The same procedure may be carried out for each row of the original training data set For simplicity and clarity, standard notation is used in the following discussion wherein a single variable denoted y is a function of a vector of variables denoted by x. Note that this x actually represents the T-rows in the GSMILES formulism referred to above. Without loss of generality consider a single y-variable as a function of multiple x-variables. A generalized solution for the Leave-One-Out (LOO) cross-validation statistic for a model f(x;α) trained on a data set D={(x ^{m}, y_{i}ε, where a single data point (x_{i}, y_{i}) is removed, results in a training set D_{i }and a predictor f_{i}(x, α). The difference between the observation y_{i }and what a model predicts in the absence of (x_{i}, y_{i}) is ε_{i}=y_{i}−f_{i}(x_{i}, α). The Leave-One-Out (LOO) cross-validation statistic predicts the variance in this error:
Rather than evaluating LOO by retraining the model n times, a formulation which relates σ ^{m}. If the data matrix and response vector are defined as:
then the linear least squares solution α and corresponding residual ρ are: where P≡I−X(X ^{T}X)^{−}1X^{T }is the n×n projection matrix. If the first data point is partitioned from the data matrix, the abbreviated training set defines a matrix The least squares solution of the truncated data set is:
The prediction error resulting from the removal of the first row is therefore:
^{T} x _{1} (16)
The relationships defined in equations (12), (13) and (14) are next used to replace For the sake of abbreviation, define F=(X Returning to the prediction error of equation (16) and substituting with the above developed relationship gives:
By noting that y From this it can be observed that the prediction error resulting from the removal of the first data point is the ratio of the first element of the residual and the first diagonal element of the projection matrix. Since any data point (x In order to compute σ When n is large, forming the projection matrix P in order to extract its diagonal elements is impractical, requiring n×n memory, which could exceed the limits of current hardware. It is also computationally expensive, making it infeasible to re-compute at every iteration k. Instead, the QR factorization of X Where X ^{n×k}, Q_{k}ε ^{n×n}, R_{k}ε ^{n×k}, ^{k×k}. _{k }is orthogonal. By design, it is also non-singular. Q_{k} ^{T }is a product of Householder matrices, as follows:
Q _{k} ^{T}=H_{k}H_{k−1 }. . . H_{1} (41)
Each Householder matrix is dependent only on ν ^{n}, the Householder vector:
H _{k} =I−T _{k}ν_{k}ν_{k} ^{T} (42)
Where T Hence, one has an LOO sum of squared residuals for every y-column column in matrix Y. Optionally, using an ensemble error for each row produces an ensemble LOO sum of squared residuals as is used by GSMILES. Referring now to GSMILES calculates similarity among all objects at step Using the similarity values, GSMILES selects a critical subset of objects (identifying the locations of the tent poles) at step Upon identification of the tent pole locations and similarity values representing the tent poles, as well as an estimation of the X-nonlinear transformation (“α values”) of the Y-profiles associated with the strategic X-profiles (tent poles) by least squares regression or other optimization technique, GSMILES maximizes the number of tent poles at step After optimization of the model, the model is ready to be used in calculating predictions at step Next, the residuals (prediction errors ε) are calculated at step GSMILES then selects the X-profile row from the input matrix (e.g., matrix If it is determined that the values are not collinear or nearly collinear with a previously selected tent pole profile, then the similarity values calculated in step Processing then returns to step An optional stopping method is shown in step As alluded to above, the GSMILES predictor model can be used to fit a matrix to a matrix, e.g. to fit a matrix of X-profiles to itself, inherently using eigenvalue analysis and partial least squares processing. Thus, the X-profile values may be used to fit themselves through a one dimensional linear transformation, i.e., a bottleneck, based on the largest singular-value eigenvalue of that matrix. Using the techniques described above, the same procedure is used to develop a similarity matrix, only the X-profile matrix replaces the Y-profile matrix referred to above. This technique is useful for situations where some of the X values are missing in the X-profile (missing data), for example. In these situations, a row of X-profile data may contain known, useful values that the researcher doesn't necessarily want to throw out just because all values of that row are not present. In such an instance, imputation data may be employed, where GSMILES (or the user) puts in some estimates of what the missing values are. Then GSMILES can use the completed X-profile matrix to predict itself. This produces predictions for the missing values which are different from the estimates that were put in. The predictions are better, because they are more consistent with all the values in the matrix, because all of the other values in the matrix were used to determine what the missing value predictions are. Initial estimates of the missing values may be average X values, or some other starting values which are reasonable for the particular application being studied. When the predictions are outputted from GSMILES, they can then be plugged into the missing data locations, and the process may be repeated to get more refined predictions. Iterations may be performed until differences between the current replacement modifications and the previous iteration of replacement modifications are less than a pre-defined threshold value of correction difference. Another use for this type of processing is to use it as an effective noise filter for the X-profile, wherein cycling the X-profile data through GSMILES as described above (whether there is missing data or not) effectively smoothes the X-profile function, reduce noise levels and acting as a filter. This results in a “cleaner” X-profile. Still further, GSMILES may be used to predict both X- and Y-profiles simultaneously, using the X-profile also to produce tent poles. This again is related to eigenvalue analysis and partial least squares processing, and dimensional reduction or bottlenecking transformations. Note that GSMILES inherently produces a nonlinear analogy of partial least squares. However, partial least squares processing may possibly incorrectly match information (eigenvalues) of the X- and Y-matrices. To prevent this possibility, GSMILES may optionally use the X-profile matrix to simultaneously predict both X- and Y-values in the form of a combined matrix, either stacked vertically or concatenated horizontally. If the relative weight of each matrix within the combination is about equal, then one achieves correct matching of the eigenvalues. The nonlinear version of this method is accomplished by using the X-profile to predict both the X- and Y-profiles using GSMILES. Still further, it is possible to simultaneously remove noise, impute missing X-values, and analyze causal relationships between the rows (profiles) of the concatenated version X/Y of the two matrices (X- and Y-profiles), by using GSMILES to model X/Y as both input and output. Optionally to enhance causal leverage, GSMILES is not allowed to use Y-profiles in the input X/Y for tent-pole selection. Hence, strategic profiles may be found in the X-profile part of the X/Y input matrix to optimally predict all profiles in X stacked on Y, symbolized by X/Y. GSMILES can then cluster the resulting profiles in the prediction-enhanced X/Y matrix. This is a form of synchronization that tends to put associated heterogeneous profiles such as phenotypic properties versus gene-expression properties, for example, into the same cluster. This method is useful to identify gene expression profiles and compound activity profiles that tend to synchronize or anti-synchronize together, suggesting some kind of interaction between the genes and compounds in each cluster. The importance of each X-variable is determined by the Marquardt-Levenberg (ML) method applied to the GSMILES model. Hence, this process is leveraged by all Y-variables and their internal relationships, such as communalities induced by common phenomena, which common phenomena are often unknown. GSMILES may multiply a coefficient onto each variable to express the ellipticity of the basis set as a function of the X space. Typically, these coefficients are assumed to be constant with a value of unity, i.e., signifying global radial symmetry over the X space. The Marquardt-Levenberg algorithm can be used to test this assumption. A byproduct of use of the Marquardt-Levenberg algorithm in this manner is the model leverage associated with each coefficient and hence, each variable. This leverage may be used to rank the X-variables. The GSMILES nodes (tent poles) are localized basis functions based on similarity between locations in the model domain (X-space). The spans of influence of each basis function are determined by each function's particular decay constants. The bigger a constant is, the faster the decay, and hence the smaller the influence region of the node surrounding its domain location. The best decay value depends both on the density of data adjacent to the node location, clustering properties of the data, and the functional complexity of the Y-ensemble there. For example, if the Y-ensemble is essentially constant in the domain region containing the node location, then all adjacent data are essentially replicates. Hence, the node function should essentially average these adjacent Y-values. However, beyond such adjacent data, the node influence should decay appropriately to maintain its localized status. If decay is too fast, then the basis function begins to act like a delta function or dummy spike variable and cannot represent the possible systematic regional trends. If decay is too slow, the basis function begins to act like a constant. The same concept applies to data clusters in place of individual data points. In that respect, note that individual data points may be considered as clusters of size or membership of one element. To determine appropriate decay constants for each domain location in the data, GSMILES determines the working dimension of the domain at each data location, and then computes a domain simplex of data adjacent to each such location. The decay constant for each location is set to the inverse of the largest of the dissimilarity values between each location and the simplex of adjacent data. This normalizes the dissimilarity function for each node according to the data density at the node. In this case, the normalized dissimilarity becomes unity at the most dissimilar location within the simplex of adjacent data for each location in the domain (X-space) of the data. Optionally, GSMILES can add a few points (degrees of freedom) of data to each simplex to form a complex. However, too few points can cause “data clumping” and too many points can compensate the efficacy of GSMILES. Data clumping occurs when the decay constant is too high for a particular data location of a data point or cluster of data points, so that it tends to be isolated from the rest of the data and cannot link properly due to insufficient overlap with other nodes. This results in a spike node at that location that cannot interpolate or predict properly within its adjacent domain region. In summary, data clumping can be localized as with singular data points, or it can be more global in terms of distribution of data clusters. While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, system, process, process step or steps, algorithm, hardware or software, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto. Referenced by
Classifications
Rotate |