US 20070185910 A1
Similar concepts are deduced in consideration of relationships among concepts belonging to a plurality of categories. Concepts belonging to a plurality of categories are shown in the form of a graph in which concepts are represented by nodes and relationships between pairs of concepts are represented by edges. The number of crossings of edges linking pairs of concepts belonging to categories is reduced. Similar concepts are deduced multilaterally in consideration of the relationships between the pairs of concepts belonging to the categories.
1. A similar concept extraction system, comprising:
a data reception unit that receives, from a first input, information on concepts to be included in a plurality of layers, and, from a second input, information on databases to be used for at least two adjacent layers, and that receives information on relationships between concepts in and between adjacent layers;
a graph creation unit that creates at least one graph in which nodes represent the concepts acquired by the data reception unit, edges represent the relationships between concepts, and wherein the nodes included in the adjacent layers are linked by the edges;
a number-of-edge crossings reduction unit that modifies arrays of the nodes in and across the adjacent layers to reduce a number of edge crossings in the graph; and
a display device on which the graph is displayed.
2. The similar concept extraction system according to
3. The similar concept extraction system according to
4. The similar concept extraction system according to
5. The similar concept extraction system according to
6. The similar concept extraction system according to
7. The similar concept extraction system according to
8. The similar concept extraction system according to
9. The similar concept extraction system according to
10. The similar concept extraction system according to
11. The similar concept extraction system according to
12. The similar concept extraction system according to
13. A similar concept extraction method comprising the steps of:
creating a graph in which a first set of concepts is regarded as a layer, the concepts included in one of the layers are represented by nodes and arrayed one-dimensionally, relationships between pairs of the concepts are represented by edges, and the nodes included in adjacent ones of the layers are linked by respective ones of the edges;
modifying the arrays of the nodes in a respective one of the layers so as to minimize a number of edge crossings in the graph; and
displaying the graph having the number of the edge crossings minimized.
14. The similar concept extraction method according to
15. The similar concept extraction method according to
16. The similar concept extraction method according to
The present application claims priority from Japanese application JP-2006-030037 filed on Feb. 7, 2006, the content of which is hereby incorporated by reference into this application.
The present invention relates to a system and method for graphically showing concepts and relationships among the concepts, optimizing the graphic structure under certain conditions, and thus extracting similar concepts and relationships among the concepts.
As one of methods for estimating similarities among concepts, there has been suggested a method of representing features of each concept with numerical values or vectors whose elements are based on other concepts, and defining the similarities in descending order of the results of the inner product operations (refer to “U-statistic Hierarchical Clustering” (D'andrade, R., 1978, Psychometrika, Vol. 4, pp. 58-67).
The conventional method fails to take account of the similarities among elements representing features of concepts. Even if the similarities among concepts concerned are deduced after defining the similarities among elements, since the similarities among elements are predefined, the similarities cannot be defined with the relationships to concepts other than the concepts concerned relatively established.
Along with the further advancement in studies in various scholarly fields, the relationships among concepts will presumably be discussed from an unprecedented angle. Assume that medical concepts are classified into significant categories such as genes, approved drugs, and diseases. In this case, for example, as insulin is a genetic product and an approved drug, so a concept often belongs to several significant categories. Since a concept does not behave independently among the significant categories, it may be necessary not only to consider the concept in terms of a category concerned but also to estimate the similarity of concepts while taking account of the similarities of concepts relevant to the concept concerned. For example, assume that diverse physiological phenomena, diverse phenotypes, diverse partial compound structures, and diverse gene-compound interactions come to light through comprehensive analysis of genetic variations or analysis of results of experiments in which compounds are administered. If the similarities among the physiological phenomena, those among the phenotypes, or those among the genetic functions are estimated, they should be determined in consideration of correlations, for example, the relationships of the physiological phenomena to any phenotype, or the relationships of the genetic functions to any physiological phenomenon. This is because every concept has multiple aspects. Even when genes A and B are physiologically similar to each other, they may be highly probably dissimilar from each other in terms of a relevant disease. As for concepts belonging to the category of physiological functions or diseases, the similarities among the concepts are uncertain. Therefore, the relationships among the concepts that are fixed cannot be satisfactorily adopted as criteria for measuring the similarities among relevant genes.
An object of the present invention is to provide a method for estimating the similarities among concepts in consideration with the correlations among concepts belonging to other categories.
In order to overcome the foregoing drawbacks, an attribute of a concept or a highly related concept and a concept relevant to the attribute or highly related concept should be extracted. The similarity between the concepts or the relevancy to the concept is taken account in order to calculate more multifaceted similarity. According to an embodiment of the present invention, relationships among concepts are shown with a graphic structure together with the attributes of the concepts and highly relevant concepts thereof. The number of edge crossings in the graph is reduced in order to extract similar concepts. As a result of the reduction in the number of edge crossings, similar concepts are spatially disposed at close positions and become discernible. At this time, the relationships among similar concepts belonging to categories are also discernible. According to this method, not only extraction of the similarities among concepts that is a major object but also extraction of the similarities among attributes or concepts relating to the concepts can be achieved.
According to an embodiment of the present invention, after relationships among concepts are shown with a graphic structure, the number of edge crossings is reduced in order to extract similar concepts.
According to an embodiment of the present invention, a micro-array of DNAs, a micro-array of proteins, or any other groups of genes whose expressions have changed are graphically expressed in terms of a plurality of relationships among physiological functions and molecular functions, whereby the degrees of similarities among genes can be multilaterally deduced. Moreover, for example, concepts belonging to various categories such as a category of genes, a category of physiological functions, a category of biological functions, and a category of molecular functions are expressed in the form of a graphic structure. Thus, not only the similarities among genes can be extracted multilaterally but also the similarities of physiological concepts or concepts belonging to other category can be extracted at the same time. Moreover, when concepts belonging to categories of partial compound structures, compounds, genes, side effects, symptoms, and others are expressed in the form of a graphic structure, if the number of edge crossings is reduced, the similarities relative to the partial compound structures that are likely to cause side effects and the side effects can be extracted multilaterally. Moreover, the present invention is not limited to the biological or medical field. When relationships to companies, business lines, product lines, and business relations are graphically expressed, the degrees of similarities or relevancies among the companies can be multilaterally deduced.
According to an embodiment of the present invention, the similarities among concepts belonging to diverse categories can be extracted in consideration of the correlations. For example, the similarities among concepts or natures belonging to diverse categories can be estimated in consideration of the correlations. Herein, the categories refer to proteins or compounds having similar natures, similar structures among proteins or compounds having similar natures, highly-related physiological phenomena, highly-related interactions of drugs, and others.
Referring to the drawings, an embodiment of the present invention will be described below. Herein, a description will be made of a case where the present invention is applied to processing of biomedical terms. Noted is that the present invention shall not be limited to the embodiment described below.
Conceivable as concepts represented by nodes or categories into which the concepts represented by the nodes are classified are compounds, diseases, symptoms, proteins or genes, physiological terms, descriptors signifying partial structures or properties of a compound or protein, foods, human beings, organizations, and projects. Any concepts can be adopted as far as they interest a user. Edges represent relationships among concepts. Each edge may express only an intensity of relevancy alone, only a type of relevancy such as activation, inhibition, equality (is-a), or inclusion (component-of), both the intensity and type of relevancy. However, the present invention is not limited to this mode.
The preprocessing unit 11 accumulates pieces of information on proteins, pieces of information on interactions between pairs of compounds, and pieces of functional information, which are extracted from literatures stored locally in a document database 20 or from text data, which is stored in a document database 22 at a Web site accessed over a network 21, manually or automatically through syntax analysis or statistical analysis. The preprocessing unit 11 also accumulates as binary relationships various relationships between pairs of concepts such as the relationships between proteins and diseases, the relationships between symptoms and diseases, and the relevancies between physiological phenomena and diseases which are extracted from the literatures or text data. Namely, the relationships between genes and biological functions or other relationships between pairs of concepts which are fetched from any local database or any database at a Web site are accumulated. As for the relationships, when the number of objects is small, pre-calculation is not needed but terms may be indexed at a preprocessing step. Thus, input data may be dynamically produced according to the necessity.
Relationships represented by edges conceivably include relevancies that are obtained statistically and have intensities, relevancies whose intensities are inferred through mechanical learning, relationships that are obtained through syntax analysis and have types and intensities (frequencies of appearance), binary relationships obtained through reading performed by a human being, and binary relationships described in various databases. However, the present invention is not limited to the relevancies and relationships. A compound may be discomposed into partial structures, and the partial structures may be represented by nodes. Edges linking the compound with the partial structures may represent relationships of inclusion (component-of). Likewise, proteins, domains constituting each protein, and a motif of the domains may be represented by nodes, and the relationships of inclusion (component-of) of each protein to the domains and motif may be represented by edges. Furthermore, the natures of the proteins and other proteins may be expressed using nodes and edges. Object literatures include not only abstracts acquired from database MEDLINE and full papers sampled from PubMed Central but also biomedical literatures including pieces of information on drugs provided by the Food and Drug Administration of the United States Department of Health and Human Services and documents appended to drugs, patent documents, various scientific literatures, trade journals, newspapers, and other documents that interest a user.
In efforts to solve a problem posed by synonyms and homonyms, names of genes or proteins, names of compounds, names of diseases acquired from database Online Mendelian Inheritance in Man (OMIM), manually controlled terminologies or dictionaries such as the Unified Medical Language System (UMLS), International: the Systematized Nomenclature of Medicine (SNOMED), and Medical Subject Headings (MeSH), or the combination thereof should preferably be used to recognize terms or concepts in advance in consideration of the spelling-related diversity of terms. Alternatively, all nouns contained in a text may be adopted as terms or concepts. Among all nouns contained in a text, only nouns whose use frequencies in corpus concerned are higher than the user frequencies thereof in newspapers or other corpus may be adopted as terms or concepts. Otherwise, pieces of mutual information on neighboring words or x2-test may be utilized or the C-value or NC-value method may be used to automatically extract a set of words from an object literature. Moreover, when terms or concepts are automatically extracted, categories (significant categories) may have to be appended to them. The categories of concepts are, for example, genes or proteins, compounds, diseases, symptoms, physiological terms, molecular functions, biological functions, or partial compound structures. In order to newly classify concepts into the categories, a thesaurus defining terms and significant categories may be used to create tagged corpus. The relationships between local contexts of terms or concepts and categories may be automatically learned through the maximum entropy method or mechanical learning to be achieved using a support vector machine or the like. In order to newly create a category, corpus accompanied by a tag signifying an answer may be created, and the mechanical learning approach or both the mechanical learning approach and boot strapping approach may be used to automatically learn the relationships between local contexts of terms or concepts and categories.
For analysis of the relationships between pairs of concepts through syntax analysis, there is a method of using a shallow parser or a full parser to extract predicate argument structures so as to search relevant structures. Moreover, methods of extracting the relationships between pairs of concepts through statistical analysis include a method utilizing dice coefficients, mutual information, or singular value decomposition. The statistical relationships between pairs of concepts may be listed in the form of a table in advance or dynamically calculated.
When Terms or IDs is designated in the input type selection block 32, actual terms or IDs to be assigned to a certain layer are entered in the concept input block 31. If Categories is designated in the input type selection block 32, the system searches a database designated with the database designation block 33, extracts terms belonging to categories, and adopts the terms as terms included in the layer. In the case of, for example, a three-layer graphic structure, if Categories is designated for the first layer using the input type selection block, the system searches a database designated as a database to be used for the first and second layers, and extracts terms belonging to certain categories. If Categories is designated for the second layer using the input type selection block, the system searches and extracts terms belonging to certain categories from terms contained in both the database designated as a database to be used for the first and second layers and the database designated as a database to be used for the second and third layers.
The plotting condition reception unit 13 designates such a plotting condition that a sole category or a plurality of categories is fixed for each set (layer) of concepts and concepts belonging to the other categories are made movable or that a concept concerned is fixed instead of a category or categories and the other concepts are made movable. The plotting condition reception unit 13 can fix the positions of concepts, which belong to a sole category or a plurality of specific categories, in a graph and move the positions of concepts, which belong to the other categories, for the purpose of reducing the number of edge crossings. This method makes it possible to learn the similarities among concepts, which belong to the other categories, from a certain viewpoint.
Moreover, a display form in which similar concepts are recognized under a certain condition and highlighted can be designated if necessary. For highlighting, not only cliques but also semi-cliques satisfying a certain condition are searched. As for concepts belonging to the same group or layer, even when no edge exists in reality, calculation may be performed as if edges were present. As a condition for extracting semi-cliques, a threshold may be determined for a quotient of the number of edges included in a sub-graph by the number of edges needed for a clique, a quotient of a minimum degree of a sub-graph by the degree of a clique, or a quotient of the number of nodes linked in common with nodes (belonging to the same category and contained in a sub-graph) by the number of adjacent nodes. However, the present invention is not limited to this method.
By checking a check box 48, the fixed information entered in the screen image is utilized. Unless the check box 48 is checked, the information entered in the fixed information input block 41 is not utilized. A check box 49 is checked in a case where weights to be assigned to edges are taken account at the time of reducing the number of edge crossings. The weights are designated in a weight input block 43. A check box 50 is checked in a case where the types of edges are taken account at the time of reducing the number of edge crossings. For highlighting of similar concepts, a check box 44 is checked, and a threshold for highlighting is designated in an input box 45. For display of similarities using a different color, a check box 46 is checked, and a color is designated in an input box 47.
A graph creation unit 15 constructs an appropriate initial structure of a graph. Various techniques are conceivable as a method according to which a number-of-edge crossings reduction unit 14 reduces the number of edge crossings. For example, a bubble sort technique may be applied to each of layers orderly from a start layer to an end layer in order to reduce the number of edge crossings. Furthermore, the bubble sort technique may be applied to the layers orderly from the end layer to the start layer in order to thus minimize the number of edge crossings in an entire graph. An alternative is a statistical thermodynamic method such as the Monte Carlo method for minimizing energy of an entire graph on the assumption that a state of a graph containing crossing edges is considered as a state of a high energy level. However, the present invention is not limited to the methods. The priority for eliminating edge crossings may be differentiated according to the intensity of a relationship or the type thereof.
A number-of-edge crossings reduction unit 14 reduces the number of edge crossings under a designated condition. For reduction of the number of edge crossings which depends on a weight assigned to edges or a type of relationship such as activation or inhibition, crossings of edges assigned a higher weight may be eliminated according to priority, or crossings of different types of edges are eliminated according to priority. Either of the conditions is designated by the plotting condition reception unit 13. Furthermore, as for edges getting out of the same node other than crossing edges, the number of adjacencies between pairs of different types of edges can be reduced.
For example, assuming that nodes represent partial structures of a compound, side effects of compounds, and physiological actions, the relationships among nodes are graphically expressed. If the number of edge crossings is reduced, similarities relative to the side effects, similarities relative the partial compound structures causing the side effects, and the physiological actions shared by the partial structures can be acquired simultaneously. Techniques for discomposing a compound into partial structures or elements in advance include a compass algorithm and a finger print method. However, the present invention is not limited to the techniques.
Next, an example of similar concept extraction to be performed using the system in accordance with the present invention will be described in conjunction with concrete examples.
In the present embodiment, concepts to be included in three layers are, as shown in
In this state, if a Submit button 34 is pressed, the data reception unit 12 extracts terms, which belong to the categories of molecular functions, physiological terms, biological functions, and experimentation techniques, from the MEDLINE Subset 1. Consequently, for example, terms shown in
Assume that nothing is entered in the input screen image supported by the plotting condition reception unit 13 and shown in
For the graph shown in
As shown in
The graph creation unit 15 receives a condition for highlighting from the plotting condition reception unit 13, highlights nodes meeting the condition, and displays them on the display device. Consequently, a graph like the one shown in
Next, a description will be made of a case where similarities among concepts are differently recognized by changing categories adopted for a layer to be used in combination with the first layer.
According to an embodiment of the present invention, a plurality of layers such as two, three, or four layers can be utilized. When the number of layers is increased in order to introduce a new viewpoint from which concepts are assessed, the similarities of the concepts may be recognized differently. Referring to
If the leftmost first layer includes data items representing degrees of gene expression, the data items are sorted in descending order of a level of variation or a significant probability of a p-value. Thereafter, the number of edge crossings is reduced. Consequently, genes whose degrees of expression have risen and genes whose degrees of expression have fallen are different from each other in terms molecular functions.
Next, an example in which types of edges are utilized will be described below. In the input screen image supported by the data reception unit 12, as shown in