Publication number | US7403943 B2 |
Publication type | Grant |
Application number | US 11/402,774 |
Publication date | Jul 22, 2008 |
Filing date | Apr 12, 2006 |
Priority date | Apr 15, 2005 |
Fee status | Paid |
Also published as | US20060253518 |
Publication number | 11402774, 402774, US 7403943 B2, US 7403943B2, US-B2-7403943, US7403943 B2, US7403943B2 |
Inventors | David Ruben Argentar |
Original Assignee | E. I. Du Pont De Nemours And Company |
Export Citation | BiBTeX, EndNote, RefMan |
Patent Citations (5), Referenced by (4), Classifications (9), Legal Events (2) | |
External Links: USPTO, USPTO Assignment, Espacenet | |
This application claims the benefit of U.S. Provisional Application 60/672,176, filed Apr. 15, 2005, the entire content of which is herein incorporated by reference.
Subject matter disclosed herein is disclosed and claimed in the following copending applications, all filed contemporaneously herewith and all assigned to the assignee of the present invention:
Fundamental Pattern Discovery Using The Position Indices Of Symbols In A Sequence Of Symbols (CL-3064);
Identifying Patterns of Symbols In Sequences of Symbols Using A Binary Array Representation of The Sequence (CL-3079);
Eliminating Redundant Patterns in a Method Using Position Indices of Symbols to Discover Patterns In Sequences of Symbols (CL-3070); and
Using Binary Array Representations of Sequences to Eliminate Redundant Patterns In Discovered Patterns of Symbols (CL-3073).
The present invention relates to a computationally efficient computer-implemented method of finding patterns in sequences of symbols and to a computer readable medium having instructions for controlling a computer system to perform the method.
Prior art methods of discovering patterns of symbols in a family of symbol sequences are computationally intensive. The computational intensity is dependent upon the lengths of the sequences (i.e., number of symbols in each sequence) and the size of the alphabet (i.e., the number of distinct symbols found in each sequence). Running time (i.e., the number of computational steps required) for the prior art methods tends to increase in proportion to the product of the lengths of the sequences and decrease in proportion to the alphabet size.
Patterns that occur in (i.e., are common to) “q” number of sequences in a family of “k” sequences are said to have q “levels of support”. For example, patterns that are common to two sequences are said to have a level of support of two. Patterns that are common to a greater number of sequences in a family are said to have a greater level of support. Patterns with greater levels of support are usually more descriptive of so-called “features”, or properties, of the underlying system. In biology, for example, these features characterize chemical or physical properties of proteins or nucleic acids.
The method of published United States Patent Application 2003-0220771-A1, Vaidyanathan el al., assigned to the assignee of the present invention, discovers patterns in two or more sequences. The method of this application first discovers patterns of symbols in pairs of sequences, then finds patterns of symbols at increasingly higher levels of support based upon the patterns found in the pairs. The identity of the symbols in the patterns is retained throughout the practice of this method, and all calculations are done with the alphabet of those symbols. Retaining the symbol identity may detract from the efficiency of the method.
In view of the foregoing it is believed advantageous to be able to discover patterns common to two or more sequences in a family of sequences in a more computer-efficient manner.
In a first aspect the present invention is directed to methods for identifying patterns in a set of k-sequences of symbols, where k is greater than two (k>2) and wherein the location of a symbol in a sequence is denoted by a position index. In another aspect the present invention is directed to a computer-readable medium containing instructions for controlling a computer system to discover one or more patterns in two or more sequences of symbols by performing the method described.
The patterns of symbols produced by the combination of “n” sequences is termed an “n-tuple” (“tuple of order n”). Any n-tuple, for order n=2 to order n=(k−1), is identifiable by the sequence indices of the n sequences combined to produce the patterns within that n-tuple.
As a first step in accordance with the method of the present invention patterns of symbols produced by each pair-wise combination of sequences (each “2-tuple”) are identified. Each identified pattern of symbols is represented by either a position index numerical array (PINA) or a position index binary array (PIBA). The position index numerical array (PINA) representation of a pattern is a set of position indices, each of which denotes the location in a selected reference sequence at which each symbol in the pattern occurs. The position index binary array (PIBA) representation of a pattern is a set of binary digits. The binary digit in each place in the array that corresponds to a location in the selected reference sequence of a symbol in the identified pattern has a first predetermined binary value (e.g., a binary “1”). All of the other binary digits in the array have a second predetermined binary value (i.e., a binary “0”).
The pattern representations of each tuple at any tuple order “n” may be combined with the pattern representations of all other tuples at that order “n” sharing a common reference sequence, provided patterns exist in each n-tuple.
Thus, as a second step of the method of the present invention all 2-tuples that share a common reference sequence are taken in pair-wise combinations to identify patterns common to 3-tuples also sharing that same reference sequence. The 2-tuples may be pair-wise combined using either: (i) the position index numerical array (PINA) representations of patterns; (ii) the position index binary array (PIBA) representations of patterns; or (iii) the position index binary array (PIBA) representations of one 2-tuple taken with the position index numerical array (PINA) representations of the other 2-tuple.
In the first instance, when using the position index numerical array (PINA) representations of the patterns in each 2-tuple, patterns in the resulting 3-tuple are identified from the position index numerical arrays (PINAs) produced by the intersection of the set of position indices in each position index numerical array (PINA) in one 2-tuple with the set of position indices in each position index numerical array (PINA) in the other 2-tuple. The sets of position indices are intersected by sequentially comparing each position index of one pattern with each of the position indices of the other pattern. The position index numerical array (PINA) representing the identified pattern in the resulting 3-tuple is converted into its corresponding symbols by mapping the indices in the numerical array to the respective symbols in the reference sequence.
In the second instance, when using the position index binary array (PIBA) representations of patterns in each 2-tuple, the set of binary digits of the position index binary array (PIBA) of each pattern from one 2-tuple is intersected with the set of binary digits of the position index binary array (PIBA) of each pattern from the other 2-tuple. Each intersection of these binary arrays defines the position index binary array (PIBA) representation of a pattern in a 3-tuple. The intersection is accomplished logically, as by performing a logical AND operation in a bit-by-bit manner on the binary arrays. The binary array representation produced by the logical AND operation is used to identify the common pattern. Using the places in the position index binary array (PIBA) produced by the intersection having the first predetermined binary value as a guide, the symbols in corresponding locations in the reference sequence are identified. These symbols comprise the symbols in the identified pattern in the 3-tuple.
In the hybrid combination technique, a position index binary array (PIBA) representing each pattern in a first identified 2-tuple of patterns is created. The position index numerical array (PINA) representing each pattern of symbols in the second identified 2-tuple of patterns is also created. The binary arrays are assembled into a “scoreboard”. Each position index in the position index numerical array (PINA) representing each pattern in the second 2-tuple is used to interrogate the places in the “scoreboard” of binary arrays from the first 2-tuple. As a result of the interrogation those places in each binary array in the first 2-tuple having the first predetermined binary value are identified. The symbols at the locations in the reference sequence corresponding to the identified places in the position index binary arrays (PIBAs) (i.e., those places having the first predetermined binary value) define the identified pattern of symbols. The binary arrays that are assembled into the scoreboard may be indirectly created by first creating the position index numerical arrays (PINAs) for each pattern in the first 2-tuple and thereafter converting each of those numerical arrays into its corresponding binary array.
In order to avoid redundancies produced by combinations at the 2-tuple order, sequences should be combined in either ascending sequence index order or descending sequence index order.
The teachings of the present invention as summarized above may be extended to higher order n-tuples.
A method in accordance with the present invention may also include steps wherein the pattern representations of each tuple at any tuple order “n”, for n=3 to n=(k−1), may be combined with the pattern representations of all other tuples at that order “n” sharing a common reference sequence, provided patterns exist in each n-tuple. Such pair-wise combinations may again be effected using either: (i) the position index numerical array (PINA) representations of patterns; (ii) the position index binary array (PIBA) representations of patterns; or (iii) the hybrid method using position index binary array representations of one tuple taken with the position index numerical array representations of the other tuple.
Combination of such higher order n-tuples may produce resultant tuples at the next-higher order [i.e., at order (n+1)] or may “leapfrog” to still-higher orders [i.e., orders (n+2) or above], up to the (k−1)-order. The order of the resultant tuple is determined by the number of different sequence indices in the tuple identifiers of one tuple as against the sequence indices in the tuple identifier of the other tuple being pair-wise combined.
The “leapfrog effect” is especially advantageous when large numbers of long sequences are involved since it allows patterns having high levels of support to be found without the necessity of first finding all patterns at all lower levels of support.
However, pair-wise combinations of n-tuples of the same higher order also results in redundant pattern identifications. In order to reduce redundant pattern identifications the representations of the patterns in a first n-tuple should be only combined with pattern representations of those other n-tuples that include in their tuple identifiers at least one sequence index greater than the sequence indices included in the tuple identifier of the first n-tuple. Redundancies involving pair-wise combinations of n-tuples that share the same reference sequence may be eliminated provided that, aside from the reference sequence, all of the sequence indices in the identifier of one n-tuple are different from those of the other n-tuple.
It also lies within the contemplation of a method of the present invention that pattern representations in any higher order tuple may also be combined pair-wise with the pattern representations in any selected lower-order tuple. That is, the representations in any n-tuple may be combined with the pattern representations in any selected m-tuple, where m may have any integer value from 2 to (n−1).
Such pair-wise combinations may again be effected using either: (i) the position index numerical array (PINA) representations of patterns; (ii) the position index binary array (PIBA) representations of patterns; or (iii) the hybrid method using position index binary array representations of one tuple taken with the position index numerical array representations of the other tuple.
The resulting tuple may be one or more higher orders (leapfrog effect), again depending upon the number of different sequence indices in the tuple identifiers of the tuples combined.
Pair-wise combinations of an n-tuple with a lower order tuple may also result in redundant pattern identifications. Accordingly, in order to reduce redundant pattern identifications the representations of the patterns in an n-tuple should be only combined with pattern representations of a lower-order tuple that includes in its tuple identifier at least one sequence index greater than the sequence indices included in the tuple identifier of the n-tuple. To avoid redundancies involving pair-wise combinations of representations of patterns in an n-tuple with a lower order tuple that shares the same reference sequence, all of the sequence indices of the lower order m-tuple other than the reference sequence index must be different from those of the n-tuple.
The most preferred pair-wise combinations are those involving the representations of patterns in a higher order n-tuple [n=3 to n=(k−1)] with the representations of patterns in a 2-tuple that shares the same reference sequence and whose tuple identifier includes a sequence index greater than the sequence indices included in the identification of the n-tuple, provided there exists patterns in each n-tuple and 2-tuple. Combining an n-tuple with such a 2-tuple insures that no redundant pattern representations are produced by the comparison, while finding all patterns at successive levels of support.
The invention will be more fully understood from the following detailed description, taken in connection with the accompanying drawings, which form a part of this application and in which:
Throughout the following detailed description, similar reference numerals refer to similar elements in all figures of the drawings.
In one aspect the present invention is directed toward a computer-implemented method useful in identifying patterns of symbols in a set “S” containing “k” sequences of symbols, where k is greater than two (where k>2), that is, there are three or more patterns, thus:
S={S_{0}, S_{1}, S_{2}, . . . , S_{k-1}}.
The basic implementation of the method of the present invention may be understood by considering the following set of five sequences S_{0 }through S_{4}:
By convention, each sequence is assigned a predetermined sequence index, indicated by the respective subscripts 0, 1, 2, 3, and 4, to order the sequences. The sequence indexes (or the more preferable plural form used herein, “indices”) are assigned in any desired manner. Sequences S_{0 }through S_{4 }are derived from a biological system of G-coupled protein receptors and have been modified better to illustrate the principles of the present invention.
It should be noted that each sequence S_{0 }through S_{4 }has an arbitrary length determined by the source from which the sequence is derived. The sequences may have equal, or as seen above, different lengths.
The present invention is independent of the particular alphabet in which sequences are presented. In fact, a useful preliminary step is to discover all of the symbols in the alphabet in which the sequence data are written. The term “alphabet” is meant to include any collection of letters or other characters (including numerals). For example, sequences describing DNA are typically written in a four-symbol alphabet consisting of the symbols {A,G,C,T}. Protein sequences are written in a twenty-symbol alphabet representing the amino acids, consisting of the symbols {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}.
POSITION INDEX
The top row of numerals in the table, labeled “Position Index”, ascribes numeric values to locations in the sequences (from 0, 1, . . . , 22 for the lengths of sequences illustrated). The location of any given symbol in a sequence is denoted by its “position index”, that is, the numeric value of the location that the symbol occupies in that sequence, as measured from the beginning of the sequence. It is noted that, by convention, the first location in each sequence is assigned the position index 0.
A “position index” of a symbol has meaning only relative to the particular sequence in which the symbol occurs. For example, in sequence S_{0 }the symbol “M” occupies location 0 and, thus, has position index 0; the symbol “S” occupies locations 4 and 13 and, thus, has position index 4 and position index 13. In the sequence S_{3 }the symbol “M” occupies locations 0 and 19 and, thus, has position indices 0 and 19; the symbol “S” occupies locations 4, 5, 6, 12 and 21 and thus, has corresponding position indices 4, 5, 6, 12 and 21, respectively.
Conversely, in sequence S_{0}, at the locations corresponding to position indices 5, 14, 15 and 17, the symbol “P” appears. In sequence S_{3 }the locations corresponding to position indices 5, 14, 15, and 17 are occupied by the symbols “S”, “E”, “L”, and “N”, respectively.
A “pattern” is defined as any distributed substring of two or more symbols that occurs in (i.e., is common to) at least two sequences. The symbols comprising a pattern may be separated within the sequence by gaps. In this description of the present invention, when expressing patterns, dots will be used represent gaps, i.e., locations where the symbols in the two sequences do not match, and are thus considered placeholder positions in the pattern.
In general, a sequence may be considered in combination with one or more of the other sequences in the set S. The group of patterns of symbols common to combinations of sequences is known as an “n-tuple”, where “n” is the order of the tuple denoting the number of sequences being combined. For any set of k sequences, assuming the numeration of the sequence index begins at zero, the order number “n” may take any value up to (k−1). For example, as used herein, the group of patterns of symbols produced when sequences are taken together in pair-wise combination is referred to as a “2-tuple” (i.e., n=2). The group of patterns of symbols produced when sequences are considered in combination three-at-a-time may be may be referred to as a “3-tuple” (i.e., n=3).
Identification of Patterns The first step of a method in accordance with the present invention is the identification of patterns of symbols common to each pair-wise combination of sequences (i.e., identifying the 2-tuple of patterns).
Preferably, any of the pattern identification methods disclosed in published U.S. patent application Ser. No. 2003-0220771-A1, Vaidyanathan, el al., assigned to the assignee of the present invention, may be used. Published U.S. patent application Ser. No. 2003-0220771-A1 is hereby incorporated by reference herein.
The basic implementation of the method of the referenced incorporated patent application in the context of the present invention may be understood by considering the twenty-place sequence S_{0 }and the sixteen-place sequence S_{1 }of the set of sequences S_{0 }through S_{4}, thus:
The MOT Table Data Structure The method of the referenced incorporated patent application is based upon the translation of a sequence written as a list of symbols into a position-based data structure that groups, for each symbol in the sequence, the position in the sequence occupied by each occurrence of that symbol, that is, by its position index. This position-based data structure is called the “Master Offset Table”, also referred to as a “MOT table”.
The MOT tables for S_{0 }and S_{1 }are as shown in
Thus, from the S_{0 }MOT table it may be observed that the symbol “S” occurs at the fourth and thirteenth position indices and the symbol “P” occurs at the fifth, fourteenth, fifteenth and seventeenth position indices in the first sequence S_{0}. Similarly, from the S_{1 }MOT table it may be observed that the symbol “S” occurs at the second and fifteenth position indices and the symbol “P” occurs at the third, ninth, tenth, eleventh, and twelfth position indices in the second sequence S_{1}.
Pattern Map Data Structure For all of the symbols in one sequence the difference-in-position between each occurrence of a symbol in that sequence and each occurrence of that same symbol in the other sequence is determined. The difference-in-position between an occurrence of a symbol of interest in the first sequence S_{0 }and an occurrence of the same symbol in the second sequence S_{1 }is the sum of:
Difference-in-position is determined by constructing another data structure called the “Pattern Map”. The Pattern Map is a table of difference-in-position values. In forming the Pattern Map only index differences from corresponding MOT columns are computed (i.e., A's from A's, C's from C's, etc.). By focusing on position differences the computational cost of exhaustive symbol-by-symbol comparison of the two sequences is avoided. The value of each row number in the Pattern Map corresponds to a value of a difference-in-position of a corresponding number of position indices. Thus, row “6” of the Pattern Map lists symbols that have a difference-in-position value of six, that is, that are six position indices apart.
The value of a difference-in-position between a symbol in the sequence S_{0 }and an occurrence of that same symbol in the sequence S_{1 }can be determined in several ways. In a preferred implementation, in order to compute the Pattern Map, all of the indices in one MOT table (e.g., the MOT table corresponding to sequence S_{1}) were offset by the length of the sequence In effect, the sequence S_{1 }and the sequence S_{0 }are concatenated. It should be noted that the order of concatenation is immaterial. For clarity of presentation the following description describes a situation where sequence S_{1 }follows the sequence S_{0}. This offset results in non-negative indices in the Pattern Map. Then, for each element of each MOT table column, the index in MOT_{0 }is subtracted from the offset index of MOT_{1}. The result (i.e., the difference-in-position) is the row index of the Pattern Map, and the value stored in that row is the position index from MOT_{0 }(again by convention).
Referring to
The Pattern Map tabulates the symbols that have a given difference-in-position (that is, symbols that are that distance apart). The symbols are identified in the Pattern Map by their position index in the sequence The Pattern Map sets forth, for each value of a difference-in-position, the position in the sequence S_{0 }of each symbol therein that appears in the sequence S_{1 }at that difference-in-position. Thus, for example, referring to the Pattern Map of
As another example the row index numbered “14” tabulates the three symbols that are spaced apart by
(that is, have a difference-in-position value of)
fourteen. The numbers “14”, “15” and “17” appearing on that line of the table refers to those symbols that appear in the sequence S_{1 }at a distance of fourteen from the appearance of that same symbol in the sequence S_{0}. By consulting sequence S_{0 }it may be appreciated that:
position index “14” corresponds to symbol “P”;
position index “15” corresponds to symbol “P”; and
position index “17” corresponds to symbol “P”.
These symbols P, P and P comprise a pattern that occurs at a difference-in-position value of fourteen. Thus, a second patterns of symbols common to the pair-wise combination of sequences S_{0 }and S_{1 }(i.e., the 2-tuple of patterns [0,1]) is “PP•P”.
As another example the row index numbered “15” tabulates the three symbols that are spaced apart by (that is, have a difference-in-position value of) fifteen. The numbers “8”, “14” and “15” appearing on that line of the table refers to those symbols that appear in the sequence S_{1 }at a distance of fifteen from the appearance of that same symbol in the sequence S_{0}. By consulting sequence S_{0 }it may be appreciated that:
position index “8” corresponds to symbol “G”;
position index “14” corresponds to symbol “P”; and
position index “15” corresponds to symbol “P”.
These symbols G, P and P comprise a pattern that occurs at a difference-in-position value of fifteen. Thus, a third pattern of symbols common to the pair-wise combination of sequences S_{0 }and S_{1 }(i.e., the 2-tuple of patterns [0,1]) is “G•••••PP”.
As still another example the row index numbered “16” tabulates the three symbols that are spaced apart by (that is, have a difference-in-position value of) sixteen. The numbers “14”, “15” and “16” appearing on that line of the table refers to those symbols that appear in the sequence S_{1 }at a distance of sixteen from the appearance of that same symbol in the sequence S_{0}. By consulting sequence S_{0 }it may be appreciated that:
position index “14” corresponds to symbol “P”;
position index “15” corresponds to symbol “P”; and
position index “16” corresponds to symbol “A”.
These symbols P, P and A comprise a pattern that occurs at a difference-in-position value of sixteen. Thus, a fourth pattern of symbols common to the pair-wise combination of sequences S_{0 }and S_{1 }(i.e., the 2-tuple of patterns [0,1]) is “PPA”.
As yet another example the row index numbered “17” tabulates the four symbols that are spaced apart by (that is, have a difference-in-position value of) seventeen. The numbers “4”, “5”, “6” and “14” appearing on that line of the table refers to those symbols that appear in the sequence S_{1 }at a distance of seventeen from the appearance of that same symbol in the sequence S_{0}. By consulting sequence S_{0 }it may be appreciated that:
position index “4” corresponds to symbol “S”;
position index “5” corresponds to symbol “P”;
position index “6” corresponds to symbol “G”; and
position index “14” corresponds to symbol “P”.
These symbols S, P, G and P comprise a pattern that occur at a difference-in-position value of seventeen. Thus, a final patterns of symbols common to the pair-wise combination of sequences S_{0 }and S_{1 }(i.e., the 2-tuple of patterns [0,1]) is “SPG•••••••P”.
Summarizing, the patterns SP•A, PP•P, G•••••PP, PPA, and SPG•••••••P are found in both of the sequences S_{0 }and S_{1}, and thus comprise the group of pattern in the 2-tuple [0,1].
In a similar manner the patterns of symbols common to each pair-wise combination of sequences (i.e., the 2-tuples of patterns) may be identified.
With reference to
In
For example, the [0,1] 2-tuple contains five patterns of symbols, labeled as “(a)” through “(e)” respectively (as identified above), viz.,
Similarly, the [0,2] 2-tuple contains two patterns of symbols, labeled as “(f)” through “(g)” respectively, viz.,
These patterns, as labeled above, are used in connection with fuller explanations of various aspects of the present invention hereinafter set forth.
The 2-tuples produced by the combination of the sequence S_{0 }with each of the other four sequences are shown across the top row of
Since patterns occur in combinations of sequences regardless of the order in which the sequences are combined, sequences need be combined only once. Thus, combinations of sequences need appear only once. In the context of
In general, by convention herein, sequences are combined in ascending sequence index order. The listing of sequences in a tuple identifier in all Figures reflects this convention. By combining sequences in an ascending sequence index order (the second sequence index of a pair-wise combination always being higher than the first sequence index) the identification of redundant patterns at the 2-tuple level is avoided. A convention which pair-wise combines sequences in descending sequence index order could also be used to avoid redundancies.
In any n-tuple one of sequences is selected as a reference sequence. In practice, it is believed convenient to select the sequence having the lower(est) sequence index as the reference sequence. By convention, the first-listed sequence index in the tuple identifier for that combination of sequences designates the selected reference sequence. It should be understood that any other notational convention may be adopted. It should also be understood that any of the sequences in a combination may be selected as the reference sequence.
Position Index Numerical Array The next step in the method in accordance with one embodiment of the invention is the creation of a position index numerical array (herein also referred to by the acronym “PINA”) for each identified pattern of symbols. The position index numerical array (PINA) representation of a pattern is an array of numerical values listing the set of position indices, each of which denoting the location in a selected reference sequence at which each symbol in that pattern occurs.
By way of example,
As may be seen from
With respect to the sequence S_{0 }the symbols in the pattern “SP•A” occur at locations corresponding to position indices 13, 14, and 16, respectively. However, in the sequence S_{1 }the symbols “SP•A” occur at locations corresponding to position indices 2, 3, 5, respectively.
Under the convention adopted herein the sequence S_{0}, having the lower sequence index, is selected as the reference sequence. Accordingly, a position index numerical array (PINA) comprising the set of position indices {13, 14, 16} represents the pattern “SP•A” by denoting the position index in the selected reference sequence (sequence S_{0}) of the 2-tuple at which each respective symbol in that pattern occurs.
In a similar manner the position index numerical array (PINA) for each pattern produced by each pair-wise combination of sequences may be derived. In
A pseudo-code program for creating the position index numerical array (PINA) representing a pattern is as follows:
parameter: symbol-index tuple T | ||
begin; | ||
allocate empty destination PINA tuple D; | ||
allocate empty scratch PINA S; | ||
for each symbol-index pattern P in T | ||
{ | ||
for each symbol-index-pair Y in P | ||
{ | ||
append Y.index to S; | ||
} | ||
copy S to D; | ||
empty S; | ||
} | ||
Position Index Binary Array Each identified pattern of symbols for a 2-tuple may alternatively be represented in the form of a position index binary array (herein also referred to by the acronym “PIBA”). A position index binary array (PIBA) is a set of binary digits. Each place in the binary array corresponds to a location in the sequence. The binary digit in each place in a position index binary array (PIBA) that corresponds to a location in a selected reference sequence having a symbol in an identified pattern is assigned a first predetermined binary value (e.g., “1”). All other binary digits in the position index binary array (PIBA) are assigned the second predetermined binary value (i.e., “0”).
It is apparent that a position index binary array (PIBA) must have a length (i.e., number of places) at least equal to the number of locations in the sequence to which the array corresponds. When two sequences of unequal length are combined to identify patterns the position index binary array (PIBA) used to represent each pattern must have a length at least equal to the length of the reference sequence. It may have a length at least equal to the length of the longer of the sequences in the combination. It may be practical in some implementations to make the length of all position index binary arrays (PIBAs) at least as long as the length of the longest sequence in the set of sequences being considered. Preferably, the length of the position index binary arrays (PIBAs) should be an integral number of word lengths used by the architecture of the computing system implementing the method of the present invention.
With respect to the reference sequence S_{0 }it may be seen that the symbols in the pattern “SP•A” occur at locations corresponding to position indices 13, 14, 16, respectively. Accordingly, a position index binary array (PIBA) representing the pattern “SP.A” has a binary digit with the value “1” in the places in the position index binary array (PIBA) corresponding to the position indices 13, 14, 16, respectively.
In
A pseudo-code program for creating a position index numerical array (PIBA) representing a pattern is as follows:
parameters: symbol-index tuple T, length of | ||
PIBAs L | ||
begin; | ||
allocate empty destination PIBA tuple D; | ||
allocate empty scratch PIBA S; | ||
for each symbol-index pattern P in T | ||
{ | ||
for each bit S_{i }in S | ||
{ | ||
S_{i }= 0; | ||
} | ||
for each symbol-index-pair Y in P | ||
{ | ||
S_{Y.index }= 1; | ||
} | ||
copy S to D; | ||
empty S; | ||
} | ||
Creating 3-Tuples of Patterns The next step of the method of the present invention is to take pair-wise combinations of all 2-tuples that share a common reference sequence to identify patterns of symbols in the resulting 3-tuples.
As is depicted in
In accordance with the present invention 2-tuples may be pair-wise combined using either the position index numerical array (PINA) representation of patterns (
When using position index numerical arrays (PINAs) patterns are identified from the position index numerical arrays (PINAs) produced by the intersection of the set of position indices in each position index numerical array (PINA) in one 2-tuple with the set of position indices in each position index numerical array (PINA) in the other 2-tuple. Each position index numerical array (PINA) so defined represents a pattern in a 3-tuple of patterns.
As shown in
These sets of position indices are intersected by sequentially comparing each position index of one position index numerical array (PINA) with each of the position indices of the other position index numerical array (PINA).
As specifically depicted in
Next, the second position index in pattern (g) (here, “5”) is compared with each of the indices (4, 5, 6, and 14) of pattern (e). Again, a matching index resulting from this comparison (here, “5”) is stored.
Finally, the third position index in pattern (g) (here, “14”) is compared with each of the indices (4, 5, 6, and 14) of pattern (e). The resulting matching index (“14”) is stored.
The set of stored matching position indices {4, 5, 14} collectively defines a position index numerical array (PINA) representing a identified pattern in the [0,1,2] 3-tuple. The position index numerical array (PINA) representing the identified pattern is converted into the corresponding symbols by mapping the indices (“4, 5. 6”) in the array to the respective symbols in the reference sequence S_{0}. The identified pattern of symbols is “SP••••••••P”.
A pseudo-code program for creating the intersection of the set position indices of one position index numerical array (PINA) with the set position indices of another position index numerical array (PINA) is as follows:
parameters: PINA tuple T, PINA tuple U | ||
begin; | ||
determine length L of longest pattern in T; | ||
allocate empty destination PINA tuple D; | ||
allocate empty scratch PINA S; | ||
for each pattern P in T | ||
{ | ||
for each pattern Q in U | ||
{ | ||
for each numeric index M in Q | ||
{ | ||
if (M appears in P) append M to S; | ||
} | ||
if (S is non-empty) | ||
{ | ||
copy S into D; | ||
empty S; | ||
} | ||
} | ||
} | ||
As previously noted 2-tuples may be pair-wise combined using the position index binary array (PIBA) representation of patterns.
The sequence S_{0 }has twenty symbols located in position indices 0 through 19. The sequence S_{1 }has sixteen symbols located in position indices 0 through 15. The sequence S_{2 }contains nineteen symbols located in position indices 0 through 18.
Since sequence S_{0 }is the reference sequence the length of the position index binary array (PIBA) representations for patterns in these 2-tuples is determined by the length of the reference sequence S_{0}.
As shown in
By way of example, the position index binary array (PIBA) representation of the pattern (e) in the [0,1] 2-tuple is: 00001110000000100000.
The position index binary array (PIBA) representation of the pattern (g) in the [0,2] 2-tuple is: 00001100000000100000.
To define the position index binary array (PIBA) that represents a patterns in a 3-tuple the set of binary digits of the position index binary array (PIBA) of the pattern (e) from one 2-tuple is intersected with the set of binary digits of the position index binary array (PIBA) of the pattern (g) from the other 2-tuple. The intersection is accomplished by performing a logical AND operation in a bit-by-bit manner on the position index binary arrays (PIBAs).
The position index binary array (PIBA) representation of the pattern produced by the logical AND operation is used to identify the common pattern. Using the places in the position index binary array (PIBA) produced by the intersection having the first predetermined binary value as a guide, the symbols in corresponding locations in the reference sequence are identified. These symbols comprises the symbols in the identified pattern in the 3-tuple.
Performing the same logical operation using each of the position index binary arrays (PIBA) in one 2-tuple with each position index binary array (PIBA) in the other 2-tuple yields the position index binary arrays (PIBAs) of all patterns in the 3-tuple. The position index binary arrays (PIBAs) and the common patterns represented thereby for all 3-tuples are shown in
It is noted that, as implemented in the discussed example the binary value “1” has been used to represent symbols in a pattern and the logical operation used to perform the intersection is the logical AND function. It should understood that alternative representations of symbols in a pattern and complementary logical operations may also be used and remain within the contemplation of the present invention.
A pseudo-code program for creating the intersection the set position indices of one position index binary array (PIBA) with the set position indices of another position index binary array (PIBA) is as follows:
parameters: PIBA tuple T, PIBA tuple U, | ||
length of PIBAs L | ||
begin; | ||
allocate empty destination PIBA tuple D; | ||
allocate empty scratch PIBA S of length L; | ||
for each pattern P in T | ||
{ | ||
for each pattern Q in U | ||
{ | ||
S = P bitwise-logical-AND Q; | ||
if (any bit S_{i }in S is 1) | ||
{ | ||
copy S into D; | ||
} | ||
} | ||
} | ||
Alternatively, the set of position indices of position index numerical array (PINA) representations of patterns from one 2-tuple may be intersected with the set of position indices of the position index numerical arrays (PINAs) of the patterns from the other 2-tuple by first converting the position index numerical array (PINA) to corresponding position index binary array (PIBA) representations and logically ANDing the same.
The resultant position index binary array (PIBA) representations are converted back to the position index numerical array (PINA) representations.
A n pseudo-code program for implementing this alternative intersection is as follows:
parameters: PINA tuple T, PINA tuple U | ||
begin; | ||
determine length L of longest pattern in T and | ||
U; | ||
allocate bit arrays B and C of length L; | ||
allocate scratch bit array S of length L; | ||
allocate empty scratch PINA P; | ||
allocate empty destination PINA tuple D; | ||
for each bit B_{i }in B | ||
{ | ||
B_{i }= 0; | ||
} | ||
for each bit C_{i }in C | ||
{ | ||
C_{i }= 0; | ||
} | ||
for each pattern P in T | ||
{ | ||
for each numeric index N in P | ||
{ | ||
B_{N }= 1; | ||
} | ||
for each pattern Q in U | ||
{ | ||
for each numeric index M in Q | ||
{ | ||
C_{M }= 1; | ||
} | ||
S = B bitwise-logical-AND C; | ||
if (any bit S_{i }in S is 1) | ||
{ | ||
for each bit S_{i }in S | ||
{ | ||
if(S_{i }is 1) append i to P; | ||
} | ||
copy P to D; | ||
empty P; | ||
} | ||
} | ||
The identification of common patterns in 3-tuples may be performed by a hybrid operation that uses the position index numerical array (PINA) representations of the patterns in one 2-tuple taken with the position index binary array (PIBA) representations of the patterns in the other 2-tuple. This implementation is illustrated in
In
These position index binary array (PIBA) representations are assembled in a rectangular array resembling a “scoreboard”. The rows of the scoreboard respectively contain the position index binary array representations of the patterns (a) thorough (e). The columns of the scoreboard identify the places within the position index binary arrays (PIBAs).
For the [0,2] 2-tuple the position index numerical array (PINA) representations of patterns (f) and (g) are created in accordance with the techniques shown in
Each position index in each position index numerical array (PINA) in the [0,2] 2-tuple is used to interrogate the places in the position index binary arrays (PIBAs) of the [0,1] 2-tuple. The interrogation is designed to identify the places in each position index binary array (PIBA) in the [0,1] 2-tuple that have the first predetermined binary value (i.e., “1”). These operations are illustrated in
The pattern (f) is the first pattern in the [0,2] 2-tuple. This position index numerical array (PINA) for pattern (f) contains the numerical position indices 10, 17 and 19.
As shown by the solid line from numeric position index value “10” in the pattern (f), the tenth places in the position index binary arrays (PIBAs) (shown enclosed by the solid oval) are interrogated. It can be seen that none of the binary digits in that tenth place of any of the position index binary arrays (PIBAs) contain the predetermined binary value (i.e., a “1”).
The next position index in the position index numerical array (PINA) for pattern (f) (a value “17”) is next taken as the interrogator. The solid line from numeric index “17” terminates in a solid oval enclosing the seventeenth places in the scoreboard of the position index binary arrays (PIBAs). This interrogation identifies the fact that the predetermined binary value “1” is present in the seventeenth place in the position index binary array (PIBA) for pattern (b).
Similarly, the last position index in the position index numerical array (PINA) for pattern (f) (a value “19”) is next taken as the interrogator. The solid line from the numeric index “19” terminates in a solid oval enclosing the nineteenth places in the scoreboard of the position index binary arrays (PIBAs). None of the binary digits in the nineteenth places of any of the position index binary arrays (PIBAs) in the scoreboard contain a binary “1”.
The interrogation by the position indices in the position index numerical array (PINA) for pattern (f) results in an output numeric array containing the only value “17”, the place in the position index binary arrays of the [0,1] 2-tuple that contain the binary value “1”. No patterns (i.e., two or more symbols) are identified by this interrogation.
The second pattern in the [0,2] 2-tuple, i.e., the position index numerical array (PINA) for the pattern labeled (g) is considered next. This position index numerical array (PINA) for pattern (g) contains the numerical position indices 4, 5 and 14.
As shown by the dashed line from the first numeric position index value “4” in the position index numerical array (PINA) for the pattern (g), the places in the position index binary arrays (PIBAs) shown enclosed by the dashed oval are interrogated. This interrogation reveals that the predetermined binary value “1” is present only in the place corresponding to the position index “4” in the position index binary array (PIBA) for pattern (e).
The next position index in the position index numerical array (PINA) for the pattern (g) (a value “5”) is next taken as the interrogator. The dashed line from numeric index “5” terminates in a dashed oval enclosing the illustrated places in the position index binary arrays (PIBAs). The predetermined binary value “1” is also only found only in the place corresponding to the position index “5” in the position index binary array (e).
The value “14” is the last position index in the position index numerical array (PINA) for the pattern (g). This value is next taken as the interrogator. The dashed line from this numeric index “14” terminates in a dashed oval enclosing the corresponding places in the position index binary arrays (PIBAs). The predetermined binary value “1” is again only found only in the place corresponding to the position index “14” in the position index binary array (e).
The interrogation by the position indices of the position index numerical array (PINA) for the pattern (g) is seen to produce five output numeric arrays respectively containing the values “14”; “14”; “14”; “14”; and “4, 5, 14”. Thus, the interrogation by the indices in the second pattern of the [0,2] 2-tuple identifies only the pattern represented by the position indices “4, 5. 6” as being present in the 3-tuple.
The identifies of those places in scoreboard of position index binary arrays (PIBA's) having the first predetermined binary value may be used to define one or more position index numerical arrays (PINAs) that each represent a pattern in a 3-tuple of patterns. The position index numerical arrays (PINAs) of the patterns in the 3-tuple of patterns defined in step (d) are then converted into the symbols represented thereby in the same manner as shown in
A pseudo-code program for creating the “scoreboard” method is as follows:
parameters: PINA tuple T, PINA tuple U | ||
begin; | ||
determine length L of longest pattern in T; | ||
allocate bit array B of length L; | ||
allocate empty destination PINA tuple D; | ||
allocate empty scratch PINA S; | ||
for each bit B_{i }in B | ||
{ | ||
B_{i }= 0; | ||
} | ||
for each pattern P in T | ||
{ | ||
for each numeric index N in P | ||
{ | ||
B_{N }= 1; | ||
} | ||
for each pattern Q in U | ||
{ | ||
for each numeric index M in Q | ||
{ | ||
if (B_{M }is 1) append M to S; | ||
} | ||
if (S is non-empty) | ||
{ | ||
copy S into D; | ||
empty S; | ||
} | ||
} | ||
for each numeric index N in P | ||
{ | ||
B_{N }= 0; | ||
} | ||
} | ||
Alternatively, the pattern of symbols in the reference sequence S_{0 }at the locations “4, 5. 6” (corresponding to the identified places in the scoreboard of position index binary arrays (PIBA's) having the first predetermined binary value) is directly identified in the same manner as shown in
The “scoreboard” of binary array representations may be indirectly assembled by first creating the position index numerical array (PINA) representations of the patterns of the [0,1] 2-tuple using the techniques discussed in connection with
The principles of the present invention hereinbefore set forth and used to illustrate the combination of 2-tuples sharing a common reference sequence to produce a 3-tuple may be readily extended to situations involving greater numbers of sequences than heretofore described (i.e., situations where k is greater than four) and combinations of still higher order n-tuples sharing a common reference sequence than heretofore described, i.e., “n” has any value up to (k−1).
The extension of these principles may be better understood from
In general, tuples at any order “n” that share a common reference sequence may be pair-wise combined. Such pair-wise combinations may be effected using either: (i) the position index numerical array (PINA) representations of patterns as fully discussed in connection with
The pattern representations of each tuple at any order “n” may be combined with the pattern representations of all other tuples at that order sharing a common reference sequence, provided patterns exist in each n-tuple.
Consider the grouping of 4-tuples. Each 4-tuple (as identified by the sequence indices listed in its tuple identifier) may be combined with any other 4-tuple to produce a resultant tuple. For example, the [0234] 4-tuple combined with the [0235] 4-tuple produces the [02345] 5-tuple. The same [0234] 4-tuple combined with the [0145] 4-tuple produces the [012345] 6-tuple.
It should thus be appreciated from the foregoing that combinations of 4-tuples can produce a tuple at the next-higher order [i.e., 5-order] as well as a still-higher 6-order tuple. In general, combination of n-tuples may produce resultant tuples at the next-higher [i.e., (n+1)] or at still-higher [i.e., (n+2) or above] orders, up to the (k−1)-order. The order of the resultant tuple is determined by the number of different sequence indices in the tuple identifiers of one tuple as against the sequence indices in the tuple identifier of the other tuple being pair-wise combined. If “p” is the number of different sequence indices in the tuple identifiers of one tuple as against the sequence indices in the tuple identifier of the other tuple with which it is being pair-wise combined, then resultant tuple is an (n+p)-tuple.
This “leapfrog effect”, i.e., jumping to higher order tuples, is especially advantageous when large numbers of long sequences are involved. Leapfrogging to higher order tuples allows patterns having high levels of support to be found without the necessity of first finding all patterns at all lower levels of support.
However, the ability to leap to higher order tuples has a cost. Pair-wise combinations of n-tuples of the same order result in redundant pattern identifications. For example, if the [0234] 4-tuple is combined with the combined with the [0245] 4-tuple the same [02345] 5-tuple as produced earlier is again produced.
In order to reduce redundant pattern identifications the representations of the patterns in a first n-tuple should be only combined with pattern representations of those other n-tuples that include in their tuple identifiers at least one sequence index greater than the sequence indices included in the tuple identifier of the first n-tuple. For example, if the highest sequence index in the tuple identifier of a first n-tuple is the number “x”, in order to avoid redundant identifications, that n-tuple should only be combined with those n-tuples whose tuple identifier includes at least one sequence index having a value greater than “x”.
Redundancies involving pair-wise combinations of n-tuples that share the same reference sequence may be eliminated provided that, aside from the reference sequence, all of the sequence indices in the identifier of one n-tuple are different from those of the other n-tuple.
The pattern representations in any higher order tuple may also be combined pair-wise with the pattern representations of any selected lower-order tuple. That is, the representations in any n-tuple may be combined with the pattern representations in any selected m-tuple, where m may have any integer value from 2 to (n−1). The resulting tuple may be one order higher or more than one order higher (leapfrog effect), again depending upon the number of different sequence indices in the tuple identifiers of the tuples combined.
Referring to
Pair-wise combinations of an n-tuple with a lower order tuple may also result in redundant pattern identifications. For example, if the [1245] 4-tuple is combined with the combined with the [156] 3-tuple the same [12456] 5-tuple is again produced.
Accordingly, in order to reduce redundant pattern identifications the representations of the patterns in an n-tuple should be only combined with pattern representations of a lower-order tuple that includes in its tuple identifier at least one sequence index greater than the sequence indices included in the tuple identifier of the n-tuple. If the highest sequence index in the tuple identifier of the n-tuple is the number “y”, that n-tuple should only be combined with a lower-order tuple whose tuple identifier includes at least one sequence index having a value greater than “y”.
To eliminate redundancies involving pair-wise combinations of representations of patterns in an n-tuple with a lower order tuple that shares the same reference sequence, all of the sequence indices of the lower order tuple other than the reference sequence index must be different from those of the n-tuple.
The most preferred pair-wise combinations are those involving the representations of patterns in an n-tuple with the representations of patterns in a 2-tuple that shares the same reference sequence and whose tuple identifier includes a sequence index greater than the sequence indices included in the identification of the n-tuple, provided there exists patterns in each n-tuple and 2-tuple. Combining an n-tuple with such a 2-tuple insures that no redundant pattern representations are produced by the comparison, while finding all patterns at successive levels of support.
An example of these most preferred pair-wise combinations is shown in
As seen from
The combination of the [0,1,2] 3-tuple with the [0,3] 2-tuple is indicated by the dashed lines. The next-higher order tuple resulting from this combination is the [0,1 2,3] 4-tuple. The combination of the [0,1,2] 3-tuple with the [0,4] 2-tuple is indicated by the dot-dash lines. The next-higher order tuple resulting from this combination is the [0,1,2,4] 4-tuple.
Similarly, as seen from
The methods of the present invention may be implemented using any suitable computing system, such as a desktop personal computer running under any operating system, such as Windows® (Microsoft Corporation, Redmond, Wash.). Alternatively, a workstation such as that available from Sun MicroSystems, Inc., running under a Unix-based operating system may be used. Computer architectures employing wider internal data busses accommodating longer word lengths (e.g., greater than 32 bits) are believed most advantageous.
The program of instructions (typically written in C++ language) and data structures of the present invention may be stored on any suitable computer readable medium, such as a magnetic storage medium (such as a “hard disc” or a “floppy disc”), an optical storage medium (such as a “CD-ROM”), or semiconductor storage medium [such as static or dynamic random access memory (RAM)].
While all of the methods described above operate in a computer-efficient manner, those employing the position index binary array (PIBA) representations of patterns are believed to be the most computer-efficient. That is, they require the minimum of computer resources (amount of memory, number of registers) and execute in the minimum number of machine-language instructions (number of CPU cycles).
The methods employing the position index binary array (PIBA) representations of patterns can also benefit from the use of a vector processor, i.e., an auxiliary processor device that operates on arrays in a single machine cycle. Vector processors having long word lengths, where each word can accommodate an entire position index binary array of patterns representations are especially advantageous. The logical AND-ing of entire position index binary array representations of patterns in a single CPU cycle further reduces the time required for a computer to perform the method of the present invention.
Those skilled in the art, having the benefits of the teachings of the present invention as hereinabove set forth, may effect numerous modifications thereto. Such modifications are to be construed as lying within the contemplation of the present invention, as defined by the appended claims.
Cited Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|
US5577249 * | Aug 8, 1995 | Nov 19, 1996 | International Business Machines Corporation | Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings |
US5668988 * | Sep 8, 1995 | Sep 16, 1997 | International Business Machines Corporation | Method for mining path traversal patterns in a web environment by converting an original log sequence into a set of traversal sub-sequences |
US6718336 * | Sep 29, 2000 | Apr 6, 2004 | Battelle Memorial Institute | Data import system for data analysis system |
US7048198 * | Apr 22, 2004 | May 23, 2006 | Microsoft Corporation | Coded pattern for an optical device and a prepared surface |
US20030220771 | May 9, 2001 | Nov 27, 2003 | Vaidyanathan Akhileswar Ganesh | Method of discovering patterns in symbol sequences |
Citing Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|
US8788388 | Mar 11, 2013 | Jul 22, 2014 | American Express Travel Related Services Company, Inc. | Using commercial share of wallet to rate business prospects |
US20060235845 * | Apr 12, 2006 | Oct 19, 2006 | Argentar David R | Identifying patterns of symbols in sequences of symbols using a binary array representation of the sequence |
US20090171687 * | Jun 23, 2008 | Jul 2, 2009 | American Express Travel Related Services Company, Inc. | Identifying Industry Passionate Consumers |
US20100023374 * | Jan 28, 2010 | American Express Travel Related Services Company, Inc. | Providing Tailored Messaging to Customers |
U.S. Classification | 1/1, 707/999.006, 707/999.101 |
International Classification | G06F17/30 |
Cooperative Classification | Y10S707/99942, G06F7/02, G06F2207/025, Y10S707/99936 |
European Classification | G06F7/02 |
Date | Code | Event | Description |
---|---|---|---|
Nov 21, 2007 | AS | Assignment | Owner name: E. I. DU PONT DE NEMOURS AND COMPANY, DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ARGENTAR, DAVID RUBEN;REEL/FRAME:020145/0673 Effective date: 20060411 |
Dec 21, 2011 | FPAY | Fee payment | Year of fee payment: 4 |