US 7188032 B2 Abstract A method for determining Teiresias patterns. Provided as input to the method are: a set S′
_{0 }of n sequences denoted as S_{1}, S_{2}, . . . S_{n}; positive integers L, W, and K; and Teiresias patterns P′_{0 }consisting of all <L, W, K> patterns for the set S′_{0}. Each sequence of the n sequences consists of characters from an alphabet. A sequence index i equals 1. A sequence S_{n+1 }is supplied to form a set S′_{i }consisting of S′_{i−1}∪S_{n+1}, where S_{n+1 }consists of characters from the alphabet. The Teiresias patterns P′_{i }are determined by performing an algorithm that utilizes S′_{i−1}, L, W, K, P′_{i−1}, and S_{n+i }as input. The Teiresias patterns P′_{i }consist of all <L, W, K> patterns for the set S′_{i}.Claims(30) 1. A method for determining Teiresias patterns, said method comprising the steps of:
providing a set S′
_{0 }of n sequences denoted as S_{1}, S_{2}, . . . S_{n}, positive integers L, W, and K, and Teiresias patterns P′_{0 }consisting of all <L, W, K> patterns for the set S′_{0}, each sequence of the n sequences consisting of characters from an alphabet, wherein a sequence index i equals 1;supplying a sequence S
_{n+1 }to form a set S′_{i }consisting of S′_{i−1}∪S_{n+1}, wherein S_{n+1 }consists of characters from the alphabet; anddetermining Teiresias patterns P′
_{i }consisting of all <L, W, K> patterns for the set S′_{i }by performing an algorithm that utilizes S′_{i−1}, L, W, K, P′_{i−1}, and S_{n+i }as input.2. The method of
3. The method of
4. The method of
5. The method of
_{0 }comprises determining P′_{0 }by performing a standard Teiresias algorithm.6. The method of
_{0 }does not comprise determining P′_{0 }by performing a standard Teiresias algorithm.7. The method of
performing a transcription step that utilizes W, P′
_{i−1}, and S_{n+i }as input and outputs an abridged sequence;performing a slicing step that utilizes the abridged sequence as input and outputs seqlets;
performing a combinatorial generation step that utilizes L, W, and the seqlets as input and outputs candidate elementary patterns;
performing a check support step that utilizes S′
_{i−1}, K, and the candidate elementary patterns as input and outputs elementary patterns;performing a convolve step that utilizes S′
_{i−1}, P′_{i−1}, L, W, K and the elementary patterns as input and outputs new patterns P′_{i−1}Δ; andperforming a merge step that utilizes P′
_{i−1 }and the new patterns P′_{i−1}Δ as input and outputs P′_{i}.8. The method of
9. The method of
10. The method of
_{i }in gene sequencing or in express sequence tags (EST) clustering, said utilizing step being performed after said determining step.11. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code comprising an algorithm adapted to implement a method for determining Teiresias patterns, said method comprising the steps of:
providing a set S′
_{0 }of n sequences denoted as S_{1}, S_{2}, . . . S_{n}, positive integers L, W, and K, and Teiresias patterns P′_{0 }consisting of all <L, W, K> patterns for the set S′_{0}, each sequence of the n sequences consisting of characters from an alphabet, wherein a sequence index i equals 1;supplying a sequence S
_{n+1 }to form a set S′_{i }consisting of S′_{i−1}∪S_{n+1}, wherein S_{n+1 }consists of characters from the alphabet; anddetermining Teiresias patterns P′
_{i }consisting of all <L, W, K> patterns for the set S′_{i }by performing an algorithm that utilizes S′_{i−1}, L, W, K, P′_{i−1}, and S_{n+i }as input.12. The computer program product of
ascertaining whether there is an additional sequence to be processed, and if said ascertaining ascertains that there is not said additional sequence is to be processed then ending said method else incrementing i by 1 followed by performing said supplying, determining, and ascertaining steps, said ascertaining step being performed after said determining step.
13. The computer program product of
14. The computer program product of
15. The computer program product of
_{0 }comprises determining P′_{0 }by performing a standard Teiresias algorithm.16. The computer program product of
_{0 }does not comprise determining P′_{0 }by performing a standard Teiresias algorithm.17. The computer program product of
performing a transcription step that utilizes W, P′
_{i−1}, and S_{n+i }as input and outputs an abridged sequence;performing a slicing step that utilizes the abridged sequence as input and outputs seqlets;
performing a combinatorial generation step that utilizes L, W, and the seqlets as input and outputs candidate elementary patterns;
performing a check support step that utilizes S′
_{i−1}, K, and the candidate elementary patterns as input and outputs elementary patterns;performing a convolve step that utilizes S′
_{i−1}, P′_{i−1}, L, W, K and the elementary patterns as input and outputs new patterns P′_{i−1}Δ; andperforming a merge step that utilizes P′
_{i−1 }and the new patterns P′_{i−1}Δ as input and outputs P′_{i}.18. The computer program product of
19. The computer program product of
20. The computer program product of
utilizing P′
_{i }in gene sequencing or in express sequence tags (EST) clustering, said utilizing step being performed after said determining step.21. A process for integrating computing infrastructure, said process comprising integrating computer-readable code into a computing system, wherein the code in combination with the computing system is capable of performing a method for determining Teiresias patterns, said method comprising the steps of:
providing a set S′
_{0 }of n sequences denoted as S_{1}, S_{2}, . . . S_{n}, positive integers L, W, and K, and Teiresias patterns P′_{0 }consisting of all <L, W, K> patterns for the set S′_{0}, each sequence of the n sequences consisting of characters from an alphabet, wherein a sequence index i equals 1;supplying a sequence S
_{n+1 }to form a set S′_{i }consisting of S′_{i−1}∪S_{n+1}, wherein S_{n+1 }consists of characters from the alphabet; anddetermining Teiresias patterns P′
_{i }consisting of all <L, W, K> patterns for the set S′_{i }by performing an algorithm that utilizes S′_{i−1}, L, W, K, P′_{i−1}, and S_{n+i }as input.22. The process of
23. The process of
24. The process of
25. The process of
_{0 }comprises determining P′_{0 }by performing a standard Teiresias algorithm.26. The process of
_{0 }does not comprise determining P′_{0 }by performing a standard Teiresias algorithm.27. The process of
performing a transcription step that utilizes W, P′
_{i−1}, and S_{n+i }as input and outputs an abridged sequence;performing a slicing step that utilizes the abridged sequence as input and outputs seqlets;
performing a combinatorial generation step that utilizes L, W, and the seqlets as input and outputs candidate elementary patterns;
performing a check support step that utilizes S′
_{i−1}, K, and the candidate elementary patterns as input and outputs elementary patterns;performing a convolve step that utilizes S′
_{i−1}, P′_{i−1}, L, W, K and the elementary patterns as input and outputs new patterns P′_{i−1}Δ; andperforming a merge step that utilizes P′
_{i−1 }and the new patterns P′_{i−1}Δ as input and outputs P′_{i}.28. The process of
29. The process of
30. The process of
_{i }in gene sequencing or in express sequence tags (EST) clustering, said utilizing step being performed after said determining step. Description 1. Technical Field The present invention relates to a method for determining Teiresias patterns and more particularly to a method for incrementally determining Teiresias patterns. 2. Related Art Pattern discovery methods for solving problems in computational biology are fast becoming a tool of choice. The standard Teiresias algorithm is a powerful pattern discovery tool that uses a combinatorial method to discover rigid patterns in a given set of sequences according to the specified parameters. However, determining Teiresias patterns by direct execution of the standard Teiresias algorithm may be inefficient for circumstances in which sequences of Teiresias patterns are to be successively computed. Thus, there is a need for a more efficient method of determining Teiresias patterns than exists in the prior art for circumstances in which sequences of Teiresias patterns are to be successively computed. The present invention provides a method for determining Teiresias patterns, said method comprising the steps of:
The present invention provides a computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code comprising an algorithm adapted to implement a method for determining Teiresias patterns, said method comprising the steps of:
The present invention provides a process for integrating computing infrastructure, said process comprising integrating computer-readable code into a computing system, wherein the code in combination with the computing system is capable of performing a method for determining Teiresias patterns, said method comprising the steps of:
The present invention advantageously provides a more efficient method of determining Teiresias patterns than exists in the prior art for circumstances in which sequences of Teiresias patterns are to be successively computed. In certain pattern discovery applications (e.g., applications involve clustering) it is useful to incrementally discover the Teiresias patterns. The Teiresias patterns are the output patterns obtained from execution of the standard Teiresias algorithm. The Teiresias patterns may be obtained via direct execution of the standard Teiresias algorithm. The present invention discloses an alternative method for determining Teiresias patterns, namely a method for determining Teiresias patterns incrementally for a set of n+1 given sequences. In accordance with the present invention, a set of n sequences is supplied along with their corresponding Teiresias patterns. The Teiresias patterns corresponding to the supplied n sequences may be determined by, inter alia, direct execution of the standard Teiresias algorithm. If a (n+1) The remaining portion of the detailed description is presented infra in four sections. The first section (Section 1) provides a description of the standard Teiresias algorithm. The second section (Section 2) provides a description of the incremental determination of Teiresias patterns in accordance with the present invention. The third section (Section 3) provides an example illustrating incremental determination of Teiresias patterns in accordance with the present invention. The fourth section (Section 4) provides a description of a computer system which may be used to incrementally determine Teiresias patterns in accordance with the present invention. 1. The Standard Teiresias Algorithm Given a set S of input sequences, and three parameters L, W and K (defined infra), the standard Teiresias algorithm discovers patterns called Teiresias patterns which are rigid patterns. Teiresias belongs to the genre of pattern discovery algorithms which are capable of detecting and reporting all existing patterns in a set of input sequences without enumerating the entire solution space and without using alignment. Furthermore, the patterns reported are maximal, i.e., they are as specific as possible. Formally, a pattern P is more specific than a pattern Q if any sequence which matches P also matches Q (e.g., the pattern XYZ is more specific than the pattern X.Z). A pattern P in the set S of input sequences is maximal if there is no other pattern in S more specific than P with the same number of occurrences as P. The standard Teiresias algorithm utilizes the followings definitions and assumptions: -
- 1. A sequence is a string of characters from a specific alphabet. For example, an amino acid is a sequence of characters from the nucleotide alphabet (A, C, G and T). The alphabet gives all the characters that a sequence can have. The sequences can be of arbitrary length. A set of such sequences is given as input to the standard Teiresias algorithm.
- 2. The characters in the sequence may each represent a residual structural unit, called a residue, of a composite. An example of such composite is a molecular structure or complex molecule such as a protein molecule (e.g., an amino acid residue from hydrolysis of protein).
- 3. A dot (.) is referred to as a “don't care” character. This means that any valid character from the alphabet can appear in its place.
- 4. A pattern is a string of characters that begins and ends with a letter (not a dot, but a character from the alphabet), and can have zero or more letters or dots in between. For example, AC..H is a pattern, but AC. is not (since it does not end with a letter). Note that the dots can be any character from the alphabet, and therefore, a pattern is a regular expression and represents a set of concrete strings. Thus for the example cited above, ACAGH, ACCCH, ACAHH are all valid strings represented by the pattern.
- 5. L, W and K are numbers provided by the user. L represents the number of letters and W represents sequence length (i.e., total number of characters in the sequence). Together, these two parameters L and W represent the density constraint. The parameter K represents support that a pattern must have in order to be reported to the user. Also, L<=W. An <L, W> pattern is one which will have L letters in any consecutive W characters of the pattern. This is a way to constrain the number of dots that appear in the pattern. For example, if L=3 and W=4, then it means that for any consecutive 4 characters in a pattern, there should be at least 3 letters (means there could be 0 or 1 dot in the pattern). So, AC.H is a <3, 4> pattern while A.C.H is not.
- 6. An <L, W, K> pattern is one which is an <L, W> pattern and appears in at least K sequences from the given input sequence set. K is called the support parameter.
- 7. An elementary pattern is an <L, W> pattern that contains exactly L residues. For example, if S={s
_{1}=SDFBASTS, s_{2}=LFCASTS, s_{3}=FDASTSNP} then the set of all <3, 4> patterns with support at least 3 is {“F.AS”, “AST”, “AS.S”, “STS”, “A.TS”}.
Using the preceding definitions the standard Teiresias algorithm discovers the <L, W, K> patterns for the set of given sequences. The parameters L, W and K are provided by the user. There are two phases in the execution of the standard Teiresias algorithm: scanning and convolution. The scanning phase precedes the convolution phase. The scanning phase scans all the sequences in the input set S and locates all elementary patterns with support at least K. Note that the standard Teiresias algorithm considers all the characters in the alphabet and resorts to a combinatorial approach of generating all the possible combinations of characters (with dots as well), for all possible sizes (constrained by the requirement that it should be an elementary pattern). For each of these combinations generated, the standard Teiresias algorithm checks for support in the given set of sequences. Those combinations that have the required support will be put into the generated elementary pattern set. The convolution phase utilizes, as input, the set of elementary patterns generated in the scanning phase. In the convolution phase, these elementary patterns will be combined to form larger patterns. These larger patterns will then be checked for support and will be retained if they have the necessary support. These retained larger patterns will be used for further convolution to obtain yet bigger patterns. This process goes on recursively until all patterns are discovered. The way in which convolution occurs is described as follows. Two patterns A and B can be pieced together to form a bigger pattern if the suffix of A is the same as the prefix of B. For example, F.AS and AST can be combined to form F.AST. Similarly, F.AST and STS can be combined to form F.ASTS. In this manner, larger patterns can be formed by convolution. To make the description more formal, the two functions of prefix and suffix are defined. Let prefix(P) be the uniquely defined sub-pattern of P that has exactly (L−1) letters and is a prefix of P. Similarly, let suffix(P) be the uniquely identified sub-pattern of P that has exactly (L−1) letters and is a suffix of P. Thus, for given patterns P and Q if suffix(P) is the same as prefix(Q) then the resulting convolution pattern R will be PQ′ where Q=prefix(Q)Q′. If the suffix and prefix do not match, then the convolution pattern will be null. Using the preceding convolution process, the standard Teiresias algorithm methodically treats the set of patterns (starting from elementary patterns) until the final set of maximal patterns are obtained. The algorithm uses a stack based approach to process all the intermediate patterns. The standard Teiresias algorithm is available for public use at a website whose web address is a concatenation of “http” and “://cbcsrv.watson.ibm.com/Tspd.html”. The following illustrative example comprises Teiresias input of L=3, W=5, and K=2 along with the input sequences S1, S2, and S3 as shown in Table 1. The resultant output patterns from executing the standard Teiresias algorithm for this example are likewise S
The standard Teiresias algorithm is explained in detail in the following references: - 1) Floratos, A., and Rigoutsos, I. (1998). “Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm”, Bioinformatics, Vol 14, No. 1, 1998;
- 2) Anthony P. Burgard, Gregory L. Moore and Costas D. Maranas, “Review of the TEIRESIAS-Based Tools of the IBM Bioinformatics and Pattern Discovery Group”, Metabolic Engineering 3, 285–288 (2001); and
- 3) Website address formed by concatenation of “http” and “://cbcsrv.watson.ibm.com/Tspd.html”.
2. Incremental Determination of Teiresias Patterns
Let Σ be the alphabet of characters that can occur in the sequences. Given a set of n sequences, and three parameters L, W and K, the standard Teiresias algorithm discovers patterns with the following characteristics. A pattern is defined as any string that begins and ends with a character (from Σ), and contains an arbitrary combination of characters (from Σ) and ‘.’ characters. The ‘.’ character (referred to a as “don't care” character) is used to denote a position that can be occupied by an arbitrary character. For any pattern P, any substring of P that itself is a pattern is called a subpattern of P. For example, ‘H..E’ is a subpattern of the pattern ‘A.CH.E’. A pattern P is called an <L, W> pattern (with L≦W) if every subpattern of P with length W or more contains at least L characters. A pattern P is called an <L, W, K> pattern if it is an <L, W> pattern and occurs in at least K sequences (from the given input sequence set). The standard Teiresias algorithm discovers all <L, W, K> patterns from the given input sequence set and reports only maximal patterns, as described supra. The present invention incrementally determines the Teiresias patterns by a method having inputs and outputs listed in Table 2.
A set of n sequences (S There can be several applications for such a method of the present invention. These techniques of the present invention will be useful in scenarios where the sequences will be generated one after another, and there is a need to study the patterns as the sequences are added. In such scenarios, it makes more sense to have an incremental algorithm rather than running the original algorithm over the entire data set all the time. In clustering applications, such as Expressed Sequence Tags (EST) clustering, Gene Sequencing, etc., there are occasions when a cluster would have its pattern set already discovered and new sequences might have to be added to the cluster, or that two clusters have to be merged. In such circumstances, the techniques of the present invention will prove to be useful. The techniques of the present invention can be used as a basis for clustering using Teiresias patterns. The approach followed in the present invention is to compute the new elementary patterns that are generated due to the introduction of the (n+1) In The transcription step, slicing step, combinatorial generation step, check support step, convolve step, and merge step of 2.1 Transcription Step The inputs to the transcription step
Step Steps -
- 1. At the boundary areas of these pattern occurrences, some patterns can be found and should therefore not be neglected.
- 2. Due to the addition of S
_{n+1}, some patterns in P_{next }may become more specific and still hold support and therefore should also be considered. In the transcription stage**21**ofFIG. 1 , case 1 above possibilities is handled, whereas case 2 is handled in later during the convolution step**25**ofFIG. 12 .
In step In step Let (n+1)
Note that in Table 4, i_p is an abbreviation for inside_pattern vector and tr is an abbreviation for transcribe vector. The output of the transcription stage will be the following abridged sequence: CTGATTCxGACAGATTT In this example, the reduction in the number of residues considered for generation of elementary patterns is 5. The length of S In the pseudo code description of the transcription stage given in Table 3 supra, the computation of the transcribe vector requires one to establish if a character of S Occurrence of TT.C at location 5 is assigned prime number 3. Occurrence of TT.C at location 9 is assigned prime number 5. Occurrence of GAT at location 3 is assigned prime number 7. Occurrence of GAT at location 17 is assigned prime number 11. Occurrence of TT.CT.AC.AC at location 5 is assigned prime number 13. Once these assignments are done, the transcribe vector can be computed as follows. First an intermediate vector called t′ is computed as follows. The t′ vector is of the same length as S
The pseudocode of Table 5 assigns all characters not occurring in any pattern to have the value of 2 (the first prime number) and all other elements to have a value of 1. The t′ vector is then updated according to the pattern occurrences as shown in the pseudocode of Table 6.
Once the t′ vector is computed in the preceding manner, the transcribe vector can be efficiently computed as shown in the pseudocode of Table 7.
Note that the gcd (a, b) function in the pseudocode of Table 7 returns the greatest common divisor of a and b. Also note that array index out of bounds conditions are appropriately handled. The vector t′ for the running example is computed as follows. Note that in the following Table 8, tr is an abbreviation for transcribe vector.
The output of the transcription process will be an abridged sequence as shown for the example above. Further processing will next be done on this abridged sequence. The portion of the new sequence that was not transcribed will not be used for pattern discovery at all. 2.2 Slicing Step The slicing step Input Abridged Sequence: ATTGxTTGGGTGxTGxACAGxCCG Output Seqlets After Slicing: Seqlets={ATTG, TTGGGTG, TG, ACA, CCG} Each of these individual seqlets are next used in the generation of candidate elementary patterns. 2.3 Combinatorial Generation Step In the combinatorial generation step With respect to this problem, incremental discovery has an advantage while generating <L,W> elementary patterns, because only one new sequence needs to be processed. Therefore, it is not wise to take the approach taken in the standard Teiresias algorithm and generate all combinations of elementary patterns over the alphabet. However with incremental discovery, it makes sense to combinatorially generate all elementary patterns over the new input sequence. This is the approach taken in the algorithm of the present invention. The following pseudo code in Table 9 performs the job of generating all <L,W> candidate elementary patterns. The reason for calling them candidate elementary patterns is that whether there is support for these patterns has not yet been checked. The candidate elementary patterns become the actual elementary patterns once they have secured the required support (i.e., occur in at least K sequences, where K is the specified Teiresias parameter).
In the pseudo code of table 9, s.length refers to the length of the string s and the function min(a, b) returns the minimum of a and b. The function s.substring(i, j) returns the substring of s between the indices i and j inclusive. A sub-routine called permuteDots is used. This is a recursive routine that generates all combinations of strings from the string parameter provided with don't care characters (dots) in it as per the other integer parameter. The pseudo code for permuteDots is given in Table 10. An example is provided infra.
In the routine in Table 10, the parameter ‘slice’ is the string from which the elementary patterns are to be generated. The parameter ‘nDots’ specifies the maximum number of dots allowed in the generated patterns. The function slice.setAt(i, ‘.’) sets the i
The elementary patterns shown in Table 11 are the only elementary patterns that have any chance of generating any maximal Teiresias patterns due to the addition of S 2.4 Check Support Step In the check support step of 2.5 Convolve Step Once the elementary patterns are generated in the check support step Note the following two observations. The first observation is that before convolve the generated elementary patterns can be convolved, the patterns in P The second observation is that because of the addition of S An aspect of generating these specific patterns is the order in which these specific patterns are generated and the tests for maximality that are made. The pseudo code in Table 12 gives a description of this part of the algorithm.
The iterations are performed over all occurrences of all patterns from the P
In Table 13, K is the specified Teiresias parameter, and K′ is the support of P
The patterns returned by the procedure specificPatterns and the elementary patterns from the check support step 2.6 Merge Step The merge step In summary, there is a given sequence set S There are various applications for the method of the present invention. These techniques of the present invention will be useful in scenarios where the sequences will be generated one after another, and there is a need to study the patterns as the sequences come by. In such scenarios, it makes more sense to have an incremental algorithm rather than running the original algorithm over the entire data set all the time. In clustering applications (for example in EST clustering, or Gene Sequencing), there will be occasions when a cluster would have its pattern set already discovered and new sequences might have to be added to the cluster, or that two clusters have to be merged. In such circumstances, the techniques of the present invention will be useful. In fact this technique can be used as a basis for clustering using Teiresias patterns. The preceding applications of the incremental Teriresias pattern determinations (e.g., EST clustering, Gene Sequencing, etc.) may be implemented in accordance with the following iterative process for incrementally determining successive <L, W, K> Teiresias patterns associated with each of M successively added sequences to the base set S of sequences S Let S Let S′
Accordingly, Step Noting that i is a sequence index for the additional M sequences, step Step Step Step If step If step 3. Example of Incremental Determination of Teiresias Patterns This section present an illustrated example of using the present invention to incrementally determine Teiresias patterns. The following information is provided to the incremental Teiresias Algorithm as input. This includes the initial set of input sequences S, the pre-discovered Teiresias patterns P, the Teiresias parameters L, W and K, and the new sequences to be added to S (denoted as S
Each process of the algorithm specified in 3.1. Transcription (Step The inputs are pattern set P, the new sequence S
Using the vectors above, the transcription algorithm will produce the following abridged sequence: JSABCDxNOPQKLMAHUK. 3.2. Slicing (Step 22)
The input to this stage is the abridged-sequence shown above from transcription step 3.3 Combinatorial Generation (Step Candidate elementary patterns are generated from the set of seqlets of the previous slicing process step
3.4. Check Support (Step 24)
Out of these candidate-elementary-patterns obtained from the combinatorial generation step 3.5. Convolve (Step The convolve step Therefore, the set of patterns given to the convolution process is the union of elementary-patterns, P
The output of the convolution process is the following set P _{increment }of patterns:
3.6. Merge (Step 26):
The incrementally calculated patterns are merged with the original set as follows.
The pattern set P′ is the final output of the algorithm. Note that the original Teiresias algorithm has been run on the input set S={S1, S2, S3, S4}, which resulted in computed output patterns matching the output patterns P′ obtained by the previous calculations in accordance with the algorithm of the present invention. This verifies the correctness of the algorithm of the present invention. 4. Computer System Thus the present invention discloses a process for deploying or integrating computing infrastructure, comprising integrating computer-readable code into the computer system While While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention. Patent Citations
Non-Patent Citations
Classifications
Legal Events
Rotate |