US20050027681A1 - Methods and systems for model matching - Google Patents

Methods and systems for model matching Download PDF

Info

Publication number
US20050027681A1
US20050027681A1 US10/930,971 US93097104A US2005027681A1 US 20050027681 A1 US20050027681 A1 US 20050027681A1 US 93097104 A US93097104 A US 93097104A US 2005027681 A1 US2005027681 A1 US 2005027681A1
Authority
US
United States
Prior art keywords
elements
model
schema
similarity
similarity coefficients
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/930,971
Inventor
Philip Bernstein
Jayant Madhavan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US10/930,971 priority Critical patent/US20050027681A1/en
Publication of US20050027681A1 publication Critical patent/US20050027681A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99942Manipulating data structure, e.g. compression, compaction, compilation

Definitions

  • the present invention relates to model or schema matching, or more generally to the matching of separate hierarchical data sets. More particularly, the present invention relates to methods and systems for matching models, or schemas, that discover similarity coefficients between schema elements, including analyses based on one or more of schema names, schema data types, schema constraints and schema structure.
  • Match is a schema manipulation operation that takes two schemas, models or otherwise hierarchically represented data as input and returns a mapping that identifies corresponding elements in the two schemas.
  • Schema matching is a critical step in many applications. For example, in Ebusiness, match helps to map messages between different extensible markup language (XML) formats. In data warehousing, match helps to map data sources into warehouse schemas. In mediators, match helps to identify points of integration between heterogeneous databases. Schema matching thus far has primarily been studied as a piece of other applications. For example, schema integration uses matching to find similar structures in heterogeneous schemas, which are then used as integration points. Data translation uses matching to find simple data transformations. Given the continued evolution and importance of XML and other message mapping, match solutions are similarly likely to become increasingly important in the future.
  • Schema matching is challenging for many reasons. First and foremost, schemas for identical concepts may have structural and naming differences. In addition, schemas may model similar, but yet slightly different, content. Schemas may be expressed in different data models. Schemas may use similar words that may nonetheless have different meanings, etc.
  • schema matching is done manually by domain experts, sometimes using a graphical tool that can graphically depict a first schema according to its hierarchical structure on one side, and a second schema according to its hierarchical structure on another side.
  • the graphical tool enables a user to select and visually represent a chosen mapping to exact matches automatically, although even minor name and structure variations may lead them astray.
  • model matching has not yet been studied independently except as it may apply to other more narrow problems, such as those named above, and thus a generic solution for schema matching that can apply to many different data models and application domains remains to be provided.
  • such a wide variety of tools would benefit from a matching solution that an independent match component or module that can be incorporated into or downloaded for such tools would be of great utility.
  • a schema consists of a set of related elements, such as tables, columns, classes, XML elements or attributes, etc.
  • the result of the match operation is a mapping between elements of two schemas.
  • a mapping consists of a set of mapping elements, each of which indicates that certain elements of schema S 1 are related to certain elements of schema S 2 .
  • a mapping between purchase order schemas PO and POrder may include a mapping element that relates element Lines.Item.Line of S 1 to element Items.Item.ItemNumber of S 2 , as shown by the dotted line. While a mapping element may have an associated expression that specifies its semantics, mappings are treated herein as nondirectional.
  • a model or schema is thus a complex structure that describes a design artifact.
  • models are Structured Query Language (SQL) schemas, XML schemas, Unified Modeling Language (UML) models, interface definitions in a programming language, Web site maps, make scripts, object models, project models or any hierarchically organized data sets.
  • SQL Structured Query Language
  • UML Unified Modeling Language
  • Many uses of models require building mappings between models. For example, a common application is mapping one XML schema to another, to drive the translation of XML messages.
  • Another common application is mapping a SQL schema into an XML schema to facilitate the export of SQL query results in an XML format, or to populate a SQL database with XML data based upon an XML schema.
  • a mapping is usually produced by a human designer, often using a visual modeling tool that can graphically represent the models and mappings.
  • a visual modeling tool that can graphically represent the models and mappings.
  • a robust algorithm that automatically creates a mapping between two given models.
  • schema matching is inherently subjective. Schemas may not completely capture the semantics of the data they describe, and there may be several plausible mappings between two schemas, making the concept of a single best mapping ill defined. This subjectivity makes it valuable to have user input to guide the match for user validation of the result. This guidance may come via an initial mapping, a dictionary or thesaurus, a library of known mappings, etc.
  • the goal of schema matching and one not yet adequately achieved by today's algorithms is: Given two input schemas in any data model, optional auxiliary information and an input mapping, compute a mapping between schema elements of the two input schemas that passes user validation.
  • Schema matchers can be characterized by the following orthogonal criteria. With respect to schema-based vs. instance-based criteria, schema-based matchers consider only schema information, not instance data. Schema information includes names, descriptions, relationships, constraints, etc. Instance-based matchers either use metadata and statistics collected from data instances to annotate the schema, or directly find correlated schema elements, e.g., using machine learning.
  • an element-level matcher computes a mapping between individual schema elements, e.g., an attribute matcher.
  • a structure-level matcher compares combinations of elements that appear together in a schema, e.g., classes or tables whose attribute sets only match approximately.
  • a linguistic matcher uses names of schema elements and other textual descriptions. Name matching involves: putting the name into a canonical form by stemming and tokenization, comparing equality of names, comparing synonyms and hypernyms using generic and domain specific thesauri and matching substrings. Information retrieval (IR) techniques can be used to compare descriptions that annotate some schema elements.
  • IR Information retrieval
  • a constraint-based matcher uses schema constraints, such as data types and value ranges, uniqueness, requiredness, cardinalities, etc.
  • schema constraints such as data types and value ranges, uniqueness, requiredness, cardinalities, etc.
  • a constraint-based matcher might also use intraschema relationships, such as referential integrity.
  • schema matchers differ in the cardinality of the mappings they compute. Some only produce one to one mappings between schema elements. Others produce n to one mappings, e.g., matchings that map the combination of DailyWages and WorkingDays in the source schema to MonthlyPay in the target.
  • schema matchers differ in their use of auxiliary information sources such as dictionaries, thesauri and input match mismatch information. Reusing past match information can also help, for example, to compute a mapping that is the composition of mappings that were performed earlier.
  • Combinational matchers can be one of two types: hybrid matchers and composite matchers.
  • Hybrid matchers use multiple criteria to perform the matching.
  • Composite matchers run independent match algorithms on the two schemas and combine the results.
  • the SEMINT system is an instance-based matcher that associates attributes in the two schemas with match signatures.
  • the SEMINT system includes 15 constraint-based and 5 content-based criteria derived from instance values and normalized to the [0,1] interval, so that each attribute is a point in 20-dimensional space. Attributes of one schema are clustered with respect to their Euclidean distance. A neural network is trained on the cluster centers and then is used to obtain the most relevant cluster for each attribute of the second schema.
  • SEMINT is a hybrid element-level matcher, but does not utilize schema structure, as the latter cannot be mapped into a numerical value.
  • the DELTA system groups all available metadata about an attribute into a text string and then applies IR techniques to perform matching. Like SEMINT, the DELTA system does not make much use of schema structure.
  • the LSD system uses a multilevel learning scheme to perform one to one matching of XML Document Type Definition (DTD) tags.
  • DTD Document Type Definition
  • a number of base learners that use different instance-level matching schemes are trained to assign tags of a mediated schema to data instances of a source schema.
  • a metalearner combines the predictions of the base learners. LSD is thus a multi strategy instance-based matcher.
  • the SKAT prototype implements schema-based matching following a rule-based approach. Rules are formulated in first order logic to express match and mismatch relationships and methods are defined to derive new matches.
  • the SKAT prototype supports name matching and simple structural matches based on isA hierarchies.
  • the TranScm prototype uses schema matching to drive data translation.
  • the schema is translated to an internal graph representation. Multiple handcrafted matching rules are applied in order at each node.
  • the matching is done top down with the rules at higher level nodes typically requiring the matching of descendants. This top down approach performs well only when the top level structures of the two schemas are quite similar.
  • the TranScm prototype represents an element level and schema-based matcher.
  • DIKE Entity Relationship
  • ER Entity Relationship
  • the DIKE system integrates multiple Entity Relationship (ER) schemas by exploiting the principle that the similarity of schema elements depends on the similarity of elements in their vicinity.
  • the relevance of elements is inversely proportional to their distance from the elements being compared, so nearby elements influence a match more than ones farther away.
  • Linguistic matching is based on manual inputs.
  • DIKE is a hybrid schema-based matcher utilizing both element and structure-level information
  • ARTEMIS the schema integration component of the MOMIS mediator system, matches classes based on their name affinity and structure affinity. MOMIS has a description logic engine to exploit constraints. The classes of the input schemas are clustered to obtain global classes for the mediated schema. Linguistic matching is based on manual inputs using an interface with WordNet. ARTEMIS is a hybrid schema-based matcher utilizing both element and structure-level information.
  • each of the above solutions does not provide an adequate solution to the generic problem of matching schemas. While some of the above solutions may be adequate for a given matching task, due to a design for the particular task, the solution is not a general all purpose approach to model matching. Others were not designed for matching per se, but rather were designed for some other purpose such as schema integration, and thus the techniques applied to matching for these solutions make compromises that do not generalize adequately. Still other existing algorithms are too slow on today's hardware for interactive use, as a result of exhaustive calculations and the like.
  • the present invention provides systems and methods for automatically and generically matching models, such as may be provided in a matching application or matching component, or provided in a general purpose system for managing models.
  • the methods are generic since the methods apply to hierarchical data sets outside of any particular data model or application. Similarity coefficients are calculated for, and mappings can be discovered between, schema elements based on their names, data types, constraints, and schema structure, using a broad set of techniques. Some of these techniques include the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree structure where much of the schema content resides.
  • FIG. 1 illustrates two exemplary schemas representing an exemplary matching problem solved in accordance with the present invention
  • FIG. 2A is a block diagram representing an exemplary network environment having a variety of computing devices in which the present invention may be implemented;
  • FIG. 2B is a block diagram representing an exemplary non-limiting computing device in which the present invention may be implemented
  • FIG. 3 illustrates two exemplary schemas and corresponding mappings based upon similarity coefficients generated in accordance with the present invention
  • FIG. 4 illustrates an exemplary second pass calculation of structural similarity between two models in accordance with the invention
  • FIG. 5 illustrates an exemplary non-limiting top-level architecture of an exemplary system in which the present invention may operate
  • FIG. 6 illustrates an exemplary process diagram for processing two schemas to produce a mapping therebetween in accordance with the invention
  • FIG. 7 is a block diagram illustrating exemplary relationships among model elements in accordance with a generically defined object model of the invention.
  • FIG. 8 illustrates exemplary handling of multiple paths from the root of a model to a particular model element in accordance with the invention
  • FIG. 9A illustrates exemplary modeling of a foreign key with respect to two SQL tables in accordance with the present invention
  • FIG. 9B illustrates an exemplary RefInt model element that represents a referential integrity constraint in accordance with the invention
  • FIG. 10A illustrates an exemplary model representation of a RefInt in a relational schema in accordance with the invention
  • FIG. 10B illustrates an exemplary model representation of a RefInt in an external Data Representation (XDR) schema in accordance with a non-limiting exemplary embodiment of the invention
  • FIG. 11 illustrates exemplary encoding of a RefInt in a data tree for an SQL schema in accordance with the invention
  • FIG. 12 illustrates exemplary disambiguation of matchings between elements that are referenced by a RefInt in accordance with the invention.
  • FIG. 13 illustrates exemplary introduction of a node in response to encountering a referential constraint, such as a foreign key, in a schema in accordance with the present invention.
  • methods and systems are provided for automatically creating similarity coefficients between elements of two given schemas or models.
  • a mapping between the models can be produced from the similarity coefficients.
  • the algorithm(s) described by the present invention can automatically create similarity coefficients and a mapping between a SQL schema and an XML schema, although it will be appreciated that the invention is generic and not limited to any particular model type or schema. This is primarily accomplished by computing similarity coefficients between pairs of elements, with a pair of elements including one element from the first schema model and one element from the second schema model.
  • the model match algorithm of the invention is driven by at least three kinds of information in a data model: linguistic information about the names of model elements, type information about model elements and structural information about how model elements in a model are related.
  • the algorithm may make use of dictionaries and thesauri to interpret the linguistic information.
  • the present invention thus provides algorithms for generic schema matching, outside of any particular data model or application, showing that a rich range of techniques is available based upon the taxonomy described above in the background.
  • the invention proposes new algorithm(s) that discover similarity coefficients between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches.
  • the invention includes the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree structure where much of the schema content of the subtree's root node resides.
  • the invention provides a solution to the schema matching problem (1) that includes automatic model matching that is both element-based and structure-based, (2) that utilizes the similarity of the subtrees of the two schemas and that is biased toward similarity of atomic elements, e.g., leaves, of a hierarchical tree, where much content describing the degree of similarity is captured, (3) that exploits internal structure, but is not overly misled by variations in that structure, (4) that exploits keys, referential constraints and views where they exist, (5) that makes context dependent matches of a shared type definition that is used in several larger structures and (6) that generates one to one or one to n mappings, (7) wherein adjustments may be made if desired and wherein a user may make input or correction to the process.
  • atomic elements e.g., leaves
  • a hierarchical tree where much content describing the degree of similarity is captured
  • keys referential constraints and views where they exist
  • (5) that makes context dependent matches of a shared type definition that is used in several larger structures and (6)
  • the invention is schema-based and not instance-based and assumes some hierarchy to the schemas being matched.
  • the interconnected elements of a schema hierarchy are modeled as a tree structure having branches and leaves.
  • a simple relational schema is an example of a schema tree since such a schema contains tables, which contain columns.
  • An XML schema with no shared elements is another simple example. With such an XML schema, elements include subelements, which in turn include other subelements or attributes.
  • the model may also be enriched to capture additional semantics, making the invention apply as generically as possible, as described in the below section on modeling and the generic object model.
  • the present invention provides systems and methods that are consistent with a given set of similarity relationships between elements of the two models.
  • the given similarity relationships may include that “PO” is similar to “purchase order” with weight 0.8 and that “PO” is similar to “post office” with weight 0.7. So, an element of one model named “PO” is more similar to a node in the other model named “purchase order” than one named “post office.” Therefore, if model 1 contains an element named “PO” and model 2 contains two elements named “purchase order” and “post office,” then all else being equal, “purchase order” is a better match for “PO” than “post office.”
  • the present invention further provides systems and methods that are consistent with key and foreign key definitions, if any, in the two models. For example, when matching two relational schemas, if a column C 1 is a key of a table T 1 in model 1 , then it is desirable to map C 1 to a column C 2 that is a key of its table T 2 in model 2 .
  • the present invention further provides systems and methods that relate objects of similar structure. For example, if an object m 1 of model 1 is mapped to an object m 2 of model 2 , then the objects in m 1 's neighborhood are mapped to the objects in m 2 's neighborhood and those neighborhoods are assigned a similar structural relationship to reflect the similarity of object m 1 to object m 2 .
  • the present invention further provides systems and methods that relate objects that have similar leaf sets. For example, if the leaf elements under InvoiceInfo in one model are more similar to those under BillingInfo than to those under EmployeeInfo, then it is better to map InvoiceInfo to BillingInfo than to EmployeeInfo.
  • the algorithm(s) of the present invention are fast, i.e., the algorithm(s) are fast enough, for example, to be used by an interactive design tool or other real-time application.
  • the invention recognizes that two nodes are similar if (1) the model elements corresponding to the two nodes are inherently similar, such as if the model elements are linguistically similar, and if (2) the subtrees rooted at the two nodes are similar.
  • the invention also recognizes that the similarity of two subtrees is not always reflected by the similarity of their immediate children.
  • the leaves of the subtree give a better estimate of the data described by the subtree, since they refer to the atomic data elements that the model is ultimately describing, and since intervening structure may be superfluous.
  • the invention further recognizes that the similarity of two leaves in hierarchical tree structures depends on their similarity and the similarity of their structural vicinity.
  • the matching algorithm of the invention works generally as follows.
  • the structural similarity of each pair of leaf nodes s and t in the source (domain) model and target (range) model, respectively, are initialized.
  • the structural similarity may be initialized to the compatibility of the nodes' corresponding data-types.
  • the nodes of the two trees are enumerated in inverse topological order, such as post-order.
  • a weighted similarity calculation is made that takes both inherent and structural similarity of the node pair into account.
  • Inherent similarity takes into account only the individual nodes being compared and may be, for example, their linguistic similarity.
  • Structural similarity takes into account the similarity of the subtrees of the node pair, e.g., the leaf sets of the node pair may be considered.
  • the weighted similarity calculation for the node pair (s, t) may then be utilized in connection with either increasing or decreasing the similarity of the subtrees of the node pairs. This reflects that if the nodes are similar, likely the children or leaves rooted by the nodes will be similar as well and by the same token, that if the nodes are dissimilar, then it is likely that the children or leaves of the nodes will be dissimilar.
  • the weight for computing a weighted mean and various thresholds may be set as tuning parameters.
  • the structural similarity of the two subtrees is determined based on the best matches between corresponding subtrees, e.g., leaf nodes.
  • a good computation for the structural similarity of a node pair (s, t) returns a high value when the number of strong matches of subtree s and subtree t is above a certain threshold, such as half, and a low value otherwise.
  • the similarity computations of the invention thus have a mutually recursive flavor.
  • Two elements are similar if their subtree node sets are similar.
  • the similarity of the subtree nodes is increased if they have ancestors that are similar.
  • the similarity of intermediate substructure also influences subtree similarity: if the subtree structures of two elements are highly similar, then multiple element pairs in the subtrees will be highly similar, which leads to higher structural similarity of the leaves (due to multiple similarity increases).
  • Inverse topological order traversal of the schemas ensures that before two elements e 1 and e 2 are compared, all the elements in their subtrees have already been compared.
  • the invention thus matches models in a bottom-up fashion, making it rather different from top-down approaches.
  • the disadvantage of a top-down technique is that it depends very heavily on a good matching at the top level of the schema hierarchy. The results will not be good if the children of the roots of the two models are very different, but merely present a different normalization of the same schema.
  • a top-down approach may be more efficient when the schemas are very similar.
  • the bottom-up approach of the invention is more conservative and does not suffer from the case of false-negatives, but at the cost of more computation; nonetheless, the performance of the invention in real-time minimizes the impact of this cost.
  • Various levels of subtree may be considered in accordance with the invention. Instead of comparing all of the leaves of a node pair, the invention may consider only the immediate descendants of the elements being compared. Using the leaves for measuring structural similarity identifies most of the matches that this alternative scheme does. However, using the leaves ensures that schemas, which have a moderately different substructure (e.g., nesting of elements), but essentially the same data content (similar leaves), are correctly matched.
  • a post-processing step may be performed on the structural similarity values to construct a mapping between the two models. For example, as part of the post-processing, the two trees can again be traversed in inverse topological order, and each node of the target can be matched with the node of source with which it has highest structural similarity.
  • a computer or other client or server device can be deployed as part of a computer network, or in a distributed computing environment.
  • the present invention pertains to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes.
  • the present invention may apply to an environment with server computers and client computers deployed in a network environment or distributed computing environment, having remote or local storage.
  • the present invention may also be applied to standalone computing devices, having programming language functionality, interpretation and execution capabilities for generating, receiving and transmitting information in connection with services.
  • Distributed computing facilitates sharing of computer resources and services by direct exchange between computing devices and systems. These resources and services include the exchange of information, cache storage, and disk storage for files. Distributed computing takes advantage of network connectivity, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, a variety of devices may have data sets for which it would be desirable to perform the matching algorithms of the present invention.
  • FIG. 2A provides a schematic diagram of an exemplary networked or distributed computing environment.
  • the distributed computing environment comprises computing objects 10 a , 10 b , etc. and computing objects or devices 110 a , 110 b , 110 c , etc.
  • These objects may comprise programs, methods, data stores, programmable logic, etc.
  • the objects comprise portions of the same or different devices such as PDAs, televisions, MP3 players, televisions, personal computers, etc.
  • Each object can communicate with another object by way of the communications network 14 .
  • This network may itself comprise other computing objects and computing devices that provide services to the system of FIG. 2A .
  • each object 10 or 110 may contain data such that it would be desirable to match that data to other data of other objects 10 or 110 .
  • one of the objects may possess SQL data
  • another of the objects may possess XML data, and it may be desirable to provide a mapping between the associated schemas.
  • computers which may have traditionally been used solely as clients, communicate directly among themselves and can act as both clients and servers, assuming whatever role is most efficient for the network. This reduces the load on servers and allows all of the clients to access resources available on other clients thereby increasing the capability and efficiency of the entire network.
  • Distributed computing can help businesses deliver services and capabilities more efficiently across diverse geographic boundaries. Moreover, distributed computing can move data closer to the point where data is consumed acting as a network caching mechanism. Distributed computing also allows computing networks to dynamically work together using intelligent agents. Agents reside on peer computers and communicate various kinds of information back and forth. Agents may also initiate tasks on behalf of other peer systems. For instance, intelligent agents can be used to prioritize tasks on a network, change traffic flow, search for files locally or determine anomalous behavior such as a virus and stop it before it affects the network. All sorts of other services may be contemplated as well. As one of ordinary skill in the distributed computing arts can appreciate, the matching algorithm(s) of the present invention may be implemented in such an environment.
  • an object such as 110 c
  • an object may be hosted on another computing device 10 or 110 .
  • the physical environment depicted may show the connected devices as computers, such illustration is merely exemplary and the physical environment may alternatively be depicted or described comprising various digital devices such as PDAs, televisions, MP3 players, etc., software objects such as interfaces, COM objects and the like.
  • computing systems may be connected together by wireline or wireless systems, by local networks or widely distributed networks.
  • networks are coupled to the Internet, which provides the infrastructure for widely distributed computing and encompasses many different networks.
  • Data Services may enter the home as broadband (e.g., either DSL or Cable modem) and is accessible within the home using either wireless (e.g., HomeRF or 802.11b) or wired (e.g., Home PNA, Cat 5, even power line) connectivity.
  • Voice traffic may enter the home either as wired (e.g., Cat 3) or wireless (e.g., cell phones) and may be distributed within the home using Cat 3 wiring.
  • Entertainment Media may enter the home either through satellite or cable and is typically distributed in the home using coaxial cable.
  • IEEE 1394 and DVI are also emerging as digital interconnects for clusters of media devices. All of these network environments and others that may emerge as protocol standards may be interconnected to form an intranet that may be connected to the outside world by way of the Internet.
  • the matching algorithm(s) of the present invention may provide such common ground by providing mappings between the disparately structured and named data.
  • the Internet commonly refers to the collection of networks and gateways that utilize the TCP/IP suite of protocols, which are well-known in the art of computer networking.
  • TCP/IP is an acronym for “Transport Control Protocol/Interface Program.”
  • the Internet can be described as a system of geographically distributed remote computer networks interconnected by computers executing networking protocols that allow users to interact and share information over the networks. Because of such wide-spread information sharing, remote networks such as the Internet have thus far generally evolved into an open system for which developers can design software applications for performing specialized operations or services, essentially without restriction.
  • the network infrastructure enables a host of network topologies such as client/server, peer-to-peer, or hybrid architectures.
  • the “client” is a member of a class or group that uses the services of another class or group to which it is not related.
  • a client is a process (i.e., roughly a set of instructions or tasks) that requests a service provided by another program.
  • the client process utilizes the requested service without having to “know” any working details about the other program or the service itself.
  • a client/server architecture particularly a networked system
  • a client is usually a computer that accesses shared network resources provided by another computer, e.g., a server.
  • computers 110 a , 110 b , etc. can be thought of as clients and computer 10 a , 10 b , etc. can be thought of as the server where server 10 a , 10 b , etc. maintains the data that is then replicated in the client computers 110 a , 110 b , etc.
  • a server is typically a remote computer system accessible over a remote network such as the Internet.
  • the client process may be active in a first computer system, and the server process may be active in a second computer system, communicating with one another over a communications medium, thus providing distributed functionality and allowing multiple clients to take advantage of the information-gathering capabilities of the server.
  • Client and server communicate with one another utilizing the functionality provided by a protocol layer.
  • Hypertext-Transfer Protocol is a common protocol that is used in conjunction with the World Wide Web (WWW) or, simply, the “Web.”
  • WWW World Wide Web
  • a computer network address such as a Universal Resource Locator (URL) or an Internet Protocol (IP) address is used to identify the server or client computers to each other.
  • the network address can be referred to as a Universal Resource Locator address.
  • communication can be provided over a communications medium.
  • the client and server may be coupled to one another via TCP/IP connections for high-capacity communication.
  • FIG. 2A illustrates an exemplary networked or distributed environment, with a server in communication with client computers via a network/bus, in which the present invention may be employed.
  • a number of servers 10 a , 10 b , etc. are interconnected via a communications network/bus 14 , which may be a LAN, WAN, intranet, the Internet, etc., with a number of client or remote computing devices 110 a , 110 b , 110 c , 110 d , 110 e , etc., such as a portable computer, handheld computer, thin client, networked appliance, or other device, such as a VCR, TV, oven, light, heater and the like in accordance with the present invention.
  • the present invention may apply to any computing device in connection with which it is desirable to communicate to another computing device with respect to matching services.
  • the servers 10 can be Web servers with which the clients 110 a , 110 b , 110 c , 110 d , 110 e , etc. communicate via any of a number of known protocols such as HTTP.
  • Servers 10 may also serve as clients 110 , as may be characteristic of a distributed computing environment. Communications may be wired or wireless, where appropriate.
  • Client devices 110 may or may not communicate via communications network/bus 14 , and may have independent communications associated therewith. For example, in the case of a TV or VCR, there may or may not be a networked aspect to the control thereof.
  • Each client computer 110 and server computer 10 may be equipped with various application program modules or objects 135 and with connections or access to various types of storage elements or objects, across which files may be stored or to which portion(s) of files may be downloaded or migrated.
  • Any computer 10 a , 10 b , 110 a , 110 b , etc. may be responsible for the maintenance and updating of a database 20 or other storage element in accordance with the present invention, such as a database 20 for storing schema or model data in accordance with the present invention.
  • the present invention can be utilized in a computer network environment having client computers 110 a , 110 b , etc. that can access and interact with a computer network/bus 14 and server computers 10 a , 10 b , etc. that may interact with client computers 110 a , 110 b , etc. and other devices 111 and databases 20 .
  • FIG. 2B and the following discussion are intended to provide a brief general description of a suitable computing environment in which the invention may be implemented. It should be understood, however, that handheld, portable and other computing devices and computing objects of all kinds are contemplated for use in connection with the present invention. While a general purpose computer is described below, this is but one example, and the present invention requires only a thin client having network/bus interoperability and interaction. Thus, the present invention may be implemented in an environment of networked hosted services in which very little or minimal client resources are implicated, e.g., a networked environment in which the client device serves merely as an interface to the network/bus, such as an object placed in an appliance. In essence, anywhere that data may be stored or to which data may be retrieved is a desirable, or suitable, environment for operation of the matching algorithm(s) of the invention.
  • the invention can be implemented via an operating system, for use by a developer of services for a device or object, and/or included within application software which aids in matching data sets.
  • Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices.
  • program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types.
  • the functionality of the program modules may be combined or distributed as desired in various embodiments.
  • those skilled in the art will appreciate that the invention may be practiced with other computer system configurations.
  • PCs personal computers
  • automated teller machines server computers
  • hand-held or laptop devices multi-processor systems
  • microprocessor-based systems programmable consumer electronics
  • network PCs appliances
  • lights environmental control elements
  • minicomputers mainframe computers and the like.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network/bus or other data transmission medium.
  • program modules may be located in both local and remote computer storage media including memory storage devices, and client nodes may in turn behave as server nodes.
  • FIG. 2B thus illustrates an example of a suitable computing system environment 100 in which the invention may be implemented, although as made clear above, the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100 .
  • an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110 .
  • Components of computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
  • the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus).
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer 110 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110 .
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, such as information processed according to the invention or information incident to carrying out the algorithms of the invention.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120 .
  • FIG. 2B illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • the computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
  • FIG. 2B illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 , such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 141 is typically connected to the system bus 121 through an non-removable memory interface such as interface 140
  • magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
  • hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . Note that these components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 . Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161 , commonly referred to as a mouse, trackball or touch pad.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus 121 , but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
  • computers may also include other peripheral output devices such as speakers 197 and printer 196 , which may be connected through an output peripheral interface 195 .
  • the computer 110 may operate in a networked or distributed environment using logical connections to one or more remote computers, such as a remote computer 180 .
  • the remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 , although only a memory storage device 181 has been illustrated in FIG. 2B .
  • the logical connections depicted in FIG. 2B include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks/buses.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in homes, offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 .
  • the computer 110 When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173 , such as the Internet.
  • the modem 172 which may be internal or external, may be connected to the system bus 121 via the user input interface 160 , or other appropriate mechanism.
  • program modules depicted relative to the computer 110 may be stored in the remote memory storage device.
  • FIG. 2B illustrates remote application programs 185 as residing on memory device 181 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • .Net is a computing framework that has been developed in light of the convergence of personal computing and the Internet. Individuals and business users alike are provided with a seamlessly interoperable and Web-enabled interface for applications and computing devices, making computing activities increasingly Web browser or network-oriented.
  • the .Net platform includes servers, building-block services, such as Web-based data storage and downloadable device software.
  • the .Net platform provides (1) the ability to make the entire range of computing devices work together and to have user information automatically updated and synchronized on all of them, (2) increased interactive capability for Web sites, enabled by greater use of XML rather than Hyptertext Markup Language (HTML), (3) online services that feature customized access and delivery of products and services to the user from a central starting point for the management of various applications, such as e-mail, for example, or software, such as Office .Net, (4) centralized data storage, which will increase efficiency and ease of access to information, as well as synchronization of information among users and devices, (5) the ability to integrate various communications media, such as e-mail, faxes, and telephones, (6) for developers, the ability to create reusable modules, thereby increasing productivity and reducing the number of programming errors and (7) many other cross-platform integration features as well.
  • various communications media such as e-mail, faxes, and telephones
  • portions of the invention may also be implemented via an operating system or a “middle man” object between a network and device or object, such that data matching services may be performed by, supported in or accessed via all of Microsoft's .NET languages and services.
  • schemas S 3 and S 4 of FIG. 3 aspects of the present invention may be illustrated in connection with matching two similar schemas PO and Purchase Order.
  • the schemas are encoded as graphs, where nodes represent schema elements. Although even a casual observer can see the schemas are very similar, there is much variation in the naming and the structure that makes algorithmic matching quite challenging.
  • the present invention approaches the matching problem by computing similarity coefficients between elements of the two schemas, from which a mapping between the elements may be deduced.
  • the coefficients, in the [0,1] range, are calculated in two phases: inherent matching and structural matching.
  • the first phase inherent matching, which may be linguistic matching, matches individual schema elements based on their names, data types, domains, etc.
  • One or more dictionaries and/or thesauri can be used to help match names by identifying short forms (Qty for Quantity m2), acronyms (UoM for UnitOfMeasure m3) and synonyms (Bill and Invoice m4).
  • the result is a linguistic similarity coefficient, lsim, between each pair of elements, e.g., lsim1 for m1, lsim2 for m2, etc.
  • the second phase is the structural matching of schema elements based on the similarity of their contexts or vicinities. For example, Line is mapped to ItemNumber m5 because their parents, i.e., Item, match and the other two children of Item, i.e., Qty for Quantity and UoM for UnitOfMeasure, already match.
  • the structural match depends in part on linguistic matches calculated in phase one. For example, City and Street under POBillTo match City m6 and Street m7 under InvoiceTo, rather than under DeliverTo, because Bill is a synonym of Invoice but not of Deliver.
  • the result is a structural similarity coefficient, ssim, e.g., ssim1 for m1, ssim2 for m2, etc. for each pair of elements.
  • wsim weighted similarity
  • mapping generation wherein pairs of schema elements with maximal weighted similarity are chosen for mappings between the schema elements.
  • the linguistic matching of the invention is based primarily on schema element names. In the absence of data instances, such names are probably the most useful source of information for matching. The invention also makes modest use of data types and schema structure in this phase.
  • Inherent matching proceeds in three steps in one embodiment: normalization, categorization and comparison. The steps of normalization, categorization and comparison are described in much more detail below in the section relating to inherent similarities. For now, however, it can be understood that as a result of the comparison of the inherent matching, a set of inherent similarity coefficients lsim are generated as between node pairs of the models being compared.
  • ssim structural similarity
  • wsim weighted similarity
  • the structural similarity of two leaves is initialized to the type compatibility of their corresponding data types, although the structural similarity may be initialized to other values respecting subtrees as well.
  • this initialization value ([0,0.5]) is a lookup in a compatibility table. Identical data types have a compatibility of 0.5. As described below, a value of 0.5 allows for later increases or decreases in structural similarity based on increases or decreases in confidence.
  • the elements in the two trees are enumerated in inverse topological order, such as post-order, which is uniquely defined for a given tree. Both the inner and outer loops are executed in this order.
  • the first step in the loop computes the structural similarity of two elements (s, t). For leaves, this is the value of ssim that was initialized in the earlier loop.
  • the structural similarity is computed as a measure of the number of leaf level matches in the subtrees rooted at the elements that are being compared, reflecting the intuition that when leaf structure is similar, so will be the structure of the root elements.
  • the invention indicates that a leaf in one schema has a strong link to a leaf in the other schema if their weighted similarity exceeds a threshold th accept .
  • Exceeding the threshold th accept indicates a potentially acceptable mapping.
  • the algorithm of the invention recognizes when the leaves in two subtrees match, even if the subtree structures that contain them do not match precisely. This is often the case when the top-level organization of the same data is very different in the two models. This is why it is beneficial to use leaves rather than internal nodes when comparing two subtrees.
  • a second pass in the calculation of the structural similarity may be utilized. For example, in FIG. 4 , suppose the subtrees under the Address and Address elements are identical, as shown by the identical triangles underneath. Then, during the post-order traversal, the Address element in Model 1 will have the same structural similarity to both the Address and Address elements of Model 2 . Then, suppose the Contact elements in the two models are compared in the structural similarity calculation and it is determined that they have a structural similarity greater than the threshold th high , thereby causing the similarity of their leaf sets to be increased.
  • the initial structural similarity used for the second pass structural similarity calculation would be higher than its value resulting from the first calculation, because the leaf sets' similarity was raised. Moreover, this higher value would now cause the Address element of Model 1 to have a higher structural similarity to the Address element of Model 2 than to the Address of Model 2 , thereby changing the result of the match.
  • the second pass does not increase the similarity of leaf sets. Therefore, only two passes are utilized, i.e., if a third pass were performed, the third pass would yield the same value as the second pass since none of the inputs to the second pass's structural similarity calculation will have changed, and inherent similarity remains the same.
  • Mapping generation is one process that can benefit from a second pass calculation by recomputing the similarities of the nonleaf elements, since the updating of leaf similarities during tree match may have affected the structural similarity of nonleaf nodes after they were first calculated. After this recalculation, a scheme similar to leaf level mapping generation can be used.
  • the mapping that is produced consists of a list of mapping elements or correspondences.
  • a further step may be to enrich the structure of the map itself. For example, the mapping element between two XML elements e 1 and e 2 may have as its subelements the mapping elements between matching XML attributes of e 1 and e 2 .
  • mapping elements may be generated by using any one or more of the computed linguistic, structural and weighted similarities.
  • the invention might just use leaf level mapping elements. For each leaf element t in the target schema, if the leaf element s in the source schema with highest weighted similarity to t is acceptable (wsim(s, t) ⁇ th accept ), then a mapping element from s to t is returned. This resulting mapping may be 1:n, since a source element may map to many target elements.
  • the exact nature of a mapping is often dependent on requirements of the module that accepts these mappings. For example, query discovery might require a one to one mapping instead of the 1 to n mapping. Such requirements need to be captured by a data model specific or tool specific mapping generator that takes the computed similarities as input.
  • initial mappings are provided.
  • the matcher algorithm utilizes a user supplied initial mapping to help initialize leaf similarities prior to structural matching, described above.
  • the linguistic similarity of elements marked as similar in the initial map is initialized to a predefined maximum value.
  • a user can make corrections to a generated result map, and then rerun the match with the corrected input map, thereby generating an improved map.
  • initial maps are a way to incorporate user interaction into the matching process. In one embodiment, this is information about two leaves, branches or nodes in two schemas being matched that map. This information may also be broken down by the user as to whether the input is being made based on actual user knowledge of structural information and/or linguistic information.
  • a pruning leaves process is provided.
  • an element e high in the tree has a large number of leaves.
  • These leaves increase the computation time, even though many of them are irrelevant for matching e. Therefore, it may be better to consider only nodes in a subtree of depth k rooted at node e, thereby pruning the leaves. While comparing nearly identical schemas, it might seem wasteful to compare the leaves. To avoid this, the immediate children of the nodes are first compared. If a very good match is detected, then the leaf level similarity computation is skipped.
  • mappings are displayed by a standalone application such as BIZTALK MAPPER®, which can compile them into extensible Stylesheet Language (XSL) translation scripts.
  • XSL extensible Stylesheet Language
  • a mapping service may also be downloaded from a server in a network, provided by an application service provider, provided as part of an operating system, etc.
  • the parameter th high is used in connection with the determination as to whether wsim(s,t) ⁇ th high . If so, then the structural similarity between all pairs of leaves in the two subtrees rooted at s and t is increased. While the invention does not lie in any particular value of this parameter, the parameter should be chosen to be greater than th accept .
  • An exemplary value for th high is 0.6.
  • the parameter th low is used in connection with the determination as to whether wsim(s,t) ⁇ th low , If so, then the structural similarity between all pairs of leaves in the two subtrees rooted at s and t is decreased. While the invention does not lie in any particular value of this parameter, the parameter should be chosen to be less than th accept .
  • An exemplary value for th low is 0.35.
  • the parameter c inc is the multiplicative factor by which leaf structural similarities are increased.
  • the parameter c inc is typically a function of maximum schema depth or depth to which nodes are considered for structural similarity.
  • An exemplary value for the parameter c inc is 1.2.
  • the parameter c dec is the multiplicative factor by which leaf structural similarities are decreased.
  • the parameter c dec is set to be about c inc ⁇ 1 .
  • an exemplary value for the parameter c dec is 0.9.
  • the parameter th accept is used in connection with the determination of whether wsim(s,t) ⁇ th accept , suggesting whether s and t have a strong link or have a valid mapping element.
  • An exemplary value for the parameter th accept is 0.5.
  • the parameter w struct is the structural similarity contribution to wsim. Typically, this value is different for leaves and nonleaves, with the value being lower for leaf-leaf pairs than for nonleaf pairs. An exemplary range for this value is from 0.5 to 0.6.
  • the present invention improves on past methods in many respects, for example, by including a substantial linguistic matching step and by biasing matches by leaves of a schema. While merely one novel feature described herein, no prior art techniques have been known to relate objects that have similar leaf sets in the manner employed by the present invention.
  • the invention makes such consideration due to the observation that leaves describe the technical content of a schema, e.g., the columns of a table or the attributes and leaf elements of an XML model, which is often a more important match criterion than internal structure.
  • the internal structure is sometimes arbitrary, where different designers group the same information in different ways due to differences in taste. Sometimes the differences are due to limitations of the data models in which schemas are represented. For example, in SQL, table definitions are flat, whereas XML schemas can have nested subelements to represent substructure.
  • the algorithm may be implemented as an independent component, or integrated into a particular application.
  • the present invention may also be combined with other techniques, such as machine learning applied to instances, natural language technology, and pattern matching to reuse known matches.
  • the invention thus provides a general-purpose schema matching component that can be used in systems for schema integration, data migration, etc.
  • FIG. 5 illustrates an exemplary non-limiting top-level architecture of an exemplary system in which the present invention may operate.
  • Import-export module 580 and generic model matching algorithm 570 may be combined in a single component 540 , such as a COM component, e.g., a dynamic link library (DLL) that can be loaded by any application that requests component 540 .
  • DLL dynamic link library
  • Two schemas are accepted, encoded in some format such as the XML Document Object Model (DOM) 550 .
  • DOM XML Document Object Model
  • relational schemas can be represented in the SQL subset of Semantic Modeling Format (SMF), which is an XML-based data exchange format used by the English Query facility in MICROSOFT® SQL Server, while XML schemas can be represented in either SMF or XDR format, both of which are XML and therefore it is known to parse them into DOM format.
  • SMF Semantic Modeling Format
  • XML schemas can be represented in either SMF or XDR format, both of which are XML and therefore it is known to parse them into DOM format.
  • the system then produces an output map, which may also be in DOM format.
  • XML DOM 550 as the input and output format to communicate between the graphical user interface (GUI) and model matching component 540 is merely a convenience, and any format may be accommodated since the invention provided is a generic solution. Any format that can be imported into the generic object model is satisfactory.
  • the import/export module 580 converts the DOM representation 550 of the input schemas into the internal object model 560 of model matching component 540 .
  • the matching algorithm 570 operates on two models represented in the internal object model 560 and computes a node similarity matrix, which may be transformed into a map, which is also represented in the internal object model 560 .
  • the algorithm 570 is generic and depends only on the generic object model, which is unaffected by the data model used to represent the input models. Conversion of schemas to a generic object model is described in more detail below in the section regarding generic object modeling.
  • the generic model matching component 540 is designed to be extensible. In one embodiment, its top-level procedure simply calls multiple matching algorithms in sequence, all of which have the same interface. Each matching algorithm can be implemented as a separate sub-module. These sub-modules can pass matching information between each other through the top-level procedure. This modular structure allows new model matching algorithms to be added without altering the overall structure of model matching component 540 .
  • the exemplary system of FIG. 5 may include two different matching algorithms (i.e., sub-modules) combined to perform the matching algorithms of the present invention, or the two matching algorithms may be integrated.
  • the first algorithm may match individual elements of the schemas by using linguistic information about the name of each element and by using each element's data type. Other type-oriented information can be added to the generic object model so that the algorithm can exploit items such as whether there are null elements, default values, whether values are members of an enumeration and whether elements are mandatory or optional.
  • the second algorithm may be the structure-matching component that exploits the hierarchical or graph-like structure of the schemas.
  • This sub-module may match elements whose neighborhoods in the two schemas also match.
  • These two algorithms may produce corresponding similarity coefficients, from which weighted similarity coefficients may be constructed, and from which a resultant map may be constructed based on a combination thereof.
  • a single component could perform both the linguistic and structural analysis.
  • a modeling application 520 may open 510 or save 501 a file having data sets, or mapping data for the data sets, etc.
  • a driver 530 assists in retrieving 502 or saving 509 data from or to a data store, and also makes calls 503 and receives results from a model match component 540 in accordance with the invention.
  • calls are made in XML DOM format 550 .
  • An import/export object 580 of match component 540 imports models 504 from and exports mappings 507 for the data sets of DOM 550 .
  • the invention abstracts the data sets to a generic object model 560 and calls 505 the model match module 570 to perform the model match algorithm(s) of the invention.
  • Model match module 570 returns 506 the results in terms understood by the generic object model 560 utilized by the invention.
  • the user can modify a generated result map, making corrections, and then perform the model match again with the corrected map as an input, thereby generating an improved map.
  • initial mappings provide a means of capturing user interaction with the model matching process.
  • one implementation of the invention may be to incorporate the algorithm(s) into a matching application or tool that provides a user interface for mapping two schemas, with appropriate user interaction with the mapping process to subjectively validate the quality of result.
  • the performance of the algorithm(s) of the invention may comprise several phases, as shown in FIG. 6 .
  • the inherency matching component involves elements 600 to 660 and operates on the name of model elements and certain other information that may be data model specific, such as data types and names of Strong Containers.
  • the structural matching component involves element 665 . As described earlier, from inherency matching coefficients and structural matching coefficients, a mapping may be produced between two schemas.
  • a Conversion of Names to Normal Form component 600 includes three sub-components, split 605 , expand 610 and eliminate 615 , to normalize the input name data.
  • source SS and target TS schemas are input to any one or more of the embodiments of the model match algorithms of the invention and are tokenized by split sub-component 605 to convert the name(s) of the model elements to a normal form.
  • common abbreviations and acronyms are maintained in a data store 620 and are used to substitute for the true content by expand sub-component 610 .
  • Eliminate sub-component 615 eliminates expletives, prepositions and conjunctions. A list of expletives, prepositions, conjunctions and other unhelpful input items may be stored in a data store 625 .
  • categorization 630 after converting a name to a set of word tokens, additional word tokens are added to the normal form to describe each model element's data type, if it has one, and concepts to which it is related. These additions are mostly driven by the content of another data store 635 , which associates words with concepts. It can be appreciated that data stores 620 , 625 , 635 etc. may also be integrated. Categorization is performed separately for each model SS and TS, since the notion of compatibility may be different for a single model than for a pair of models.
  • model elements are grouped into categories based on common tokens. Each category is associated with a set of keywords that describe the category. Once categorized, name similarity is calculated using a name similarity algorithm, which may include an analysis of synonyms and hypernyms 645 and/or an analysis of other relations 650 .
  • the invention is not limited to analysis based upon sub-component 645 .
  • Other options 650 include querying a semantic network tool 660 , which builds relationships and computes similarities among words by parsing a dictionary or thesaurus. However, performing such queries on the fly might be time consuming. On-the-fly querying of the semantic network tool 660 could be avoided by a pre-processing step that uses information in the semantic network tool 660 to populate the thesaurus 655 . Or it could be a post-processing step after the matching process that adds new similarity relationships in the thesaurus for word pairs that were not found during the matching.
  • model elements are not name matched because they do not have a name or their name is not significant. For example, a key does not have a name, but the columns that comprise the key do. Only model elements that have been tagged to be name-matched are actually name-matched. This tagging is dependent on the mapping of elements of the particular data model to the internal object model. For SQL schemas, the schema, tables and columns are tagged to be matched. For XML, the ElementTypes and AttributeTypes are tagged to be matched.
  • schema matching component 665 in addition to linguistic matching, the hierarchical relationships in the schema are leveraged to infer mappings. This is achieved using the above-described tree matching algorithm that matches tree representations of the different data paths in the two schemas. Thus, at some point in the process, a transformation is applied to the schemas to represent them as trees of data paths for structural analysis.
  • the tree-matching algorithm 665 operates on a pair of data path trees to produce structural similarity coefficients. Each pair of nodes of the two trees being compared then have an associated pair of similarity coefficients, namely the inherent similarity of the two model elements to which they correspond, and the structural similarity of the two nodes computed by the schema matching algorithm 665 . The effective similarity is then calculated to be a weighted function of these two coefficients.
  • the tokenization of the invention parses names into tokens by a customizable tokenizer using punctuation, upper case, special symbols, digits, etc.
  • a customizable tokenizer using punctuation, upper case, special symbols, digits, etc.
  • POLines ⁇ PO, Lines ⁇ For example, POLines ⁇ PO, Lines ⁇ .
  • Abbreviations and acronyms may also be expanded, e.g., ⁇ PO, Lines ⁇ Purchase, Order, Lines ⁇ . Elimination is also performed, when appropriate, wherein tokens that are articles, prepositions, expletives or conjunctions are marked to be ignored during comparison.
  • Tagging may also be performed whereby a schema element that has a token related to a known concept is tagged with the concept name, e.g., elements with tokens price, cost and value are all associated with the concept money.
  • each name token is marked as being one of five token types: a number, a special symbol (e.g., #), a common word which token type includes prepositions and conjunctions, a concept as explained above or content (all the rest).
  • Thesauri can thus play a role in linguistic matching.
  • the effect of dropping the thesaurus varies.
  • the tokenization performed by the invention, followed by stemming, can aid in the automatic selection of possible word meanings during name matching and make it easier to use off-the-shelf thesauri.
  • One implementation includes using a module to incrementally learn synonyms and abbreviations from mappings that are performed over time. The use of linguistic similarity and structural similarity over time can provide a synergy of benefit to these results.
  • the invention clusters schema elements belonging to the two schemas into categories.
  • a category is a group of elements that can be identified by a set of keywords, which are derived from concepts, data types, and element names.
  • the category money includes each schema element that is associated with money, i.e., “money” appears in its name or it is tagged with the concept of Money.
  • the purpose of categorization is to reduce the number of element-to-element comparisons.
  • the invention may compare those elements that belong to compatible categories. Two categories are compatible if their respective sets of keywords are “name similar,” a phrase defined below.
  • Categories and keywords are determined with the following: concept tagging, data types and containers.
  • Concept tagging refers to assigning a category per unique concept tag in the schema.
  • Data types refer to assigning a category for each broad data type, e.g., all elements with a numeric data type are grouped together in a category with the keyword Number. Like all categorization criteria, data types are used primarily to prune the matching and do not contribute significantly to the linguistic similarity result.
  • containers a schema element that “contains” other elements defines a category. For example, Street and City are contained by Address and hence can be grouped into a category with keyword Address. Containment is described in more detail below. The invention constructs separate categories for each schema.
  • each schema element For each element, the invention inserts the element into an existing category (same data type, same concept, or same container) if possible, or otherwise creates new categories.
  • each schema element may belong to multiple categories.
  • Each relationship is either a containment or non-containment relationship, and is directed from its origin object to its destination object.
  • a model is identified by a root object and includes all objects that are reachable from the root by following containment relationships in the origin-to-destination (container-to-containee) direction.
  • the similarity of two name tokens t 1 and t 2 is looked up in one or more synonym and/or hypernym thesauri.
  • Each thesaurus entry is annotated with a coefficient in the range [0,1] that indicates the strength of the relationship.
  • the invention matches substrings of the words t 1 and t 2 to identify common prefixes or suffixes.
  • the name similarity (ns) of two sets of name tokens T 1 and T 2 is the average of the best similarity of each token with a token in the other set.
  • ns ⁇ ( T 1 , T 2 ) ⁇ t 1 ⁇ T 1 ⁇ [ max t 2 ⁇ T 2 ⁇ sim ⁇ ( t 1 , t 2 ) ] + ⁇ t 2 ⁇ T 2 ⁇ [ max ⁇ ⁇ sim t 1 ⁇ T 1 ⁇ ( t 1 , t 2 ) ] ⁇ T 1 ⁇ + ⁇ T 2 ⁇
  • Two categories are compatible if the name similarity of their token sets exceeds a given threshold, th ns .
  • the parameter th ns is the name similarity threshold for determining compatible categories. This value is used for pruning the number of element-to-element linguistic comparisons, and thus a variety of choices for assigning the actual value are available. For example, 0.5 may be chosen for th ns , although other values may be suitable depending upon a desired amount of pruning.
  • the invention calculates the linguistic similarity of each pair of elements from compatible categories.
  • Linguistic similarity is based on the name similarity of elements, which is computed as a weighted mean of the per token type name similarity, wherein each token is one of the exemplary five types listed above.
  • lsim inherent similarity, or linguistic similarity
  • the result of this phase is a table of linguistic similarity coefficients between elements in the two schemas.
  • the similarity is assumed to be zero for schema elements that do not belong to any compatible categories.
  • the invention thus matches one data model with another data model, calculating inherent similarity coefficients and structural coefficients, with an emphasis upon similarity of subtree structure.
  • data model or schema
  • the following description of data models is presented.
  • One of ordinary skill in the art will be able to appreciate that a wide variety of models are contemplated and that any hierarchically organized data that may form a tree structure is suited to the invention's application. How to model particular features common to a variety of particular data models in a generic sense is also described. For instance, the modeling of referential integrity constraints is described in detail to show how some particular data models operate, and how they may be generalized for purposes of applying the matching operations of the invention.
  • a model is a complex structure that describes a design artifact.
  • a relational schema is a model that describes a relational database, i.e., tables, columns, constraints, etc.
  • An XML DTD or an XML schema expressed in XML Schema Definition Language (XSD) or an XDR Schema is a model that describes the structure of an XML document.
  • An object hierarchy is a model that describes the classes, relationships and inheritances of the C++ interfaces in an application or in an object store. Further examples of models are UML descriptions, workflow definitions, Web-site maps, and other models mentioned herein.
  • an object-oriented data model is used to describe models and mappings.
  • Graph-oriented terminology is sometimes used to describe models, such as when referring to objects as nodes and relationships as edges.
  • Each relationship of a model is either a containment or non-containment relationship, and is directed from its origin object to its destination object.
  • a model is identified by a root object and includes all objects that are reachable from the root by following containment relationships in the origin-to-destination, i.e., the container-to-containee direction.
  • a mapping is a model that relates a domain model to a range model, or a source model to a target model.
  • the root of the mapping connects the root of the domain model to the root of the range model. Every other mapping object in the mapping has relationships to zero or more domain objects and relationships to zero or more range objects.
  • a mapping may also contain an expression that explains the relationship between the domain and range objects to which it connects.
  • Match (M 1 , M 2 , ⁇ ) returns a mapping from model M 1 to M 2 that is consistent with the similarity relation ⁇ , which is a binary relation defined over individual objects.
  • is a binary relation defined over individual objects.
  • the relation ⁇ is shown here as a parameter, it is currently implemented as a combination of context, e.g., a shared thesaurus, and algorithms which may be optionally plugged into a match implementation, as described by the foregoing implementations.
  • a generic object model is defined, which standardizes the comparison of disparately formatted models.
  • any format may be represented with the generic object model, and thus the input format of a hierarchically represented data structure becomes irrelevant to the extent it may be represented with the generic object model.
  • model element the smallest unit of metadata is termed a model element. Distinguishing the different types of relationships between model elements is a key aspect of designing the generic object model. At least three relationships are common to a wide variety of data models, and these relationships are depicted in FIG. 7 as between model elements 700 , containers 700 a and aggregates 700 b.
  • the Strongly Contains relationship relates a model element, called a container 700 a to another model element 700 .
  • Each model element is strongly contained by at most one container 700 a .
  • the concept of container 700 a is sufficiently useful that in one embodiment, a container is defined as a class, which is a specialization of model element 700 .
  • a Strong Containment relationship captures the following two kinds of semantics: delete propagation and naming. With delete propagation, if a container 700 a is deleted, then all of the model elements 700 it contains are deleted. With naming, a model element 700 can be named by concatenating the name of its container 700 a , a delimiter, e.g., “.” or “/”, and the name of the model element 700 .
  • the Aggregates relationship connects a model element, called an aggregate 700 b , to other model elements 700 . Like Strong Containment, this relationship groups together a set of related model elements 700 . However, the relationship is weaker than Strong Containment, in that it does not propagate delete or affect naming. Rather, the aggregates relationship captures the semantics of prevent delete, i.e the target of an aggregation relationship cannot be deleted. In other words, the aggregation relationship must be deleted before the target can be deleted. For example, a typical aggregation relationship is the relationship from a compound key to each of the columns that comprise the key.
  • the IsDerivedFrom relationship connects two model elements 700 .
  • the IsDerivedFrom relationship is a generalization of isA and is TypeOf relationships, which are used in all object-modeling methodologies.
  • the IsDerivedFrom relationship captures two kinds of semantics: delete prevention and shortcutting. With shortcutting, the target can be replaced by the source. For example, a specialization can be replaced by its generalization, or an object can be replaced by its type definition. These shortcutting semantics of IsDerivedFrom are not commonly used in object modeling; however, shortcutting semantics can be important for model match. Examples of IsDerivedFrom relationships are ones between an element and its ElementType or an attribute and its AttributeType in XDR schemas.
  • model elements can be related by other types of relationships. StronglyContains and IsDerivedFrom relationships are both containment relationships. Thus, a model is defined by a root and contains objects reachable by following Strong Containment and IsDerivedFrom relationships.
  • the present invention distinguishes between model elements that are instantiated as data instances, such as elements and attributes in XDR and tables and columns in SQL, from those that are constraints on instances of other model elements, such as attribute type definitions in XDR and key definitions in SQL.
  • the model element property IsInstantiated is true for the former, false for the latter. This distinction can be useful when performing structural matching of models.
  • model D is the domain of a mapping.
  • D contains a model element d, which has two parents via Strong Containment and/or IsDerivedFrom relationships. Since d may have two different meanings, one for each of its parents, it could be mapped to two different elements of a range model, one for each parent. This implies that a model match algorithm needs to perform context-dependent bookkeeping for each model element.
  • a model that represents an XSD complexType Order has elements Customer and Supplier, as shown in FIG. 8 .
  • Addr i.e., address
  • these are represented as two separate Addr elements as shown.
  • both Addr elements are of the same complexType, e.g., Address.
  • Address is represented only once and is referenced by the type attribute (shown with a double box) of the two Addr elements.
  • complextype Address has two parents, namely, the two Addr elements via IsDerivedFrom relationships.
  • complexType Address has some XSD attributes, such as Street, City, and State. These attribute definitions explain two different parts of Order, namely the sub-structure of Addr in Customer and of Addr in Supplier. Therefore, when creating a mapping from Order to another model, e.g., Purchase-Order, the attributes of Addr in Customer might map to different model elements in Purchase-Order than the attributes of Addr in Supplier. For example, Order.Customer.Addr.Street might map to Purchase-Order.Customer-Street and Order.Supplier.Addr.Street might map to Purchase-Order.SupplierStreet.
  • the generic object model of the invention considers each path to a node with multiple parents independently. Each such path is a data-path. During the execution of a match operation, all data-paths are expanded, thereby effectively transforming the DAG into a tree. As a side note, while the use of the word data-path comes from the intuition that it is a sequence of “data” containment relationships, a better term might be “name-path” or “ID-path.”
  • a schema is a rooted graph whose nodes are elements.
  • the invention uses the terms nodes and elements interchangeably.
  • the elements are tables, columns, user defined types, keys, etc.
  • the elements are XML elements and attributes (and simpleTypes, complex Types, and keys/keyrefs in XML Schema (XSD)).
  • Elements are interconnected by three types of relationships, which together lead to nontree schema graphs. The first is containment, which models physical containment in the sense that each element (except the root) is contained by exactly one other element.
  • a table contains its columns, and is contained by its relational schema.
  • An XML attribute is contained by an XML element.
  • the schema trees presented in examples so far are essentially containment hierarchies.
  • a second type of relationship is aggregation. Like containment, aggregation groups elements, but is weaker (allows multiple parents and has no delete propagation). For instance, a compound key aggregates columns of a table. Thus, a schema graph need not be a tree, i.e., a column can have two parents: a table and a compound key.
  • the third type of relationship is IsDerivedFrom, which abstracts IsA and IsTypeOf relationships to model shared type information. Schemas that use them can be arbitrary graphs (e.g., cycles due to recursive types). In XSD, an IsDerivedFrom relationship connects an XML element to its complex type.
  • IsDerivedFrom connects a subtype to its supertype.
  • IsDerivedFrom shortcuts containment: if an element e IsDerivedFrom a type t, then t's members are implicitly members of e. For example, if USAddress specializes Address, then an element Street contained by Address is implicitly contained by USAddress too.
  • Each path defines a context, and thus is a candidate for a different mapping for e.
  • the invention materializes all such paths. To accomplish this, the algorithm performs a preorder traversal of the schema, creating a private copy of the subschema rooted at the target t of each IsDerivedFrom for each of t's parents, which is essentially type substitution.
  • the invention adds a schema tree node whose successors are the nodes corresponding to elements reachable via any number of IsDerivedFrom relationships followed by a single containment. Some elements are tagged not-instantiated (e.g., keys) during the schema tree construction and are ignored during this process.
  • Strong Containment, IsDerivedFrom, and aggregate relationships can be used to model hierarchical schemas, such as XML schemas without any IDs and IDREFs, or a SQL schema without any foreign keys.
  • This alone places a restriction on the expressive power of a model.
  • a fourth relationship may be introduced, termed a referential integrity relationship or referential integrity constraint in the database literature.
  • a referential integrity relationship models an existential dependency between model elements in different parts of a schema.
  • a model element that represents a referential integrity constraint is called a RefInt.
  • Referential integrity constraints are supported in most data models.
  • referential integrity relationships include the relationship between a foreign key column in a table and the primary key in another table, the relationship between an ID and an IDREF in a DTD and the relationship between a keyref and a key in XSD.
  • Referential constraints are directed from a source, e.g., foreign key column, to a target, e.g., primary key to which the foreign key refers.
  • Such RefInt elements aggregate the source, and reference the target of such a relationship, whereby “reference” is a new relationship type.
  • a RefInt can model compound keys and multi-attribute keyrefs.
  • FIG. 9A the modeling of a foreign key 910 with respect to two SQL Tables 900 a and 900 b , foreign key column 920 and primary key column 930 is shown in FIG. 9A .
  • Referential integrity relationships are directed.
  • the foreign key column 920 is the source
  • the primary key column 930 by which it is constrained is the target.
  • the source and the target can in general be sets of model elements, e.g., a compound key.
  • the foreign key references the single compound primary key elements 930 of the target table, which aggregates the key columns 920 of that table.
  • FIG. 9B illustrates the relationship between a model element 940 , a model aggregate 950 and a RefInt 960 .
  • a RefInt 960 is a specialization of a model aggregate 950 , which is a specialization of a model element 940 .
  • a RefInt 960 also has a reference relationship.
  • a RefInt 960 aggregates the model elements that are the source and references the target of the referential constraint.
  • a RefInt model element can either be instantiated (e.g., IDREFs) or not instantiated (e.g., foreign keys), as indicated by an isInstantiated flag.
  • IDREFs instantiated
  • foreign keys e.g., foreign keys
  • FIGS. 10A and 10B The model representation of RefInts in relational and XDR schemas is shown in FIGS. 10A and 10B , respectively.
  • a data path tree may be augmented with additional nodes, where useful. More particularly, a data path tree that is built by exploiting Strong Containment and IsDerivedFrom relationships may be augmented with additional nodes to take advantage of RefInts in the similarity computation.
  • foreign keys are taken advantage of by interpreting them as join views.
  • the foreign key node in the schema is replaced by a single data path node representing the join of the two tables.
  • the second is that since the match algorithm operates by matching data tree elements, representing a referential constraint as such a node makes it the subject of a match.
  • the interpretation of a RefInt as a join view is illustrated in the example below.
  • FIG. 11 illustrates encoding a RefInt in a data tree for SQL schemas.
  • a similar procedure is applicable in XSD and XDR schemas.
  • An additional node is added that has as its children the columns of the two tables, with one exception: the foreign key columns are not duplicated, since they are the same in both tables (the choice of primary or foreign key columns is arbitrary).
  • a data path DAG is formed instead of a tree, because the referenced model elements have two parents, e.g., OrderID, CustomerID, SSN, and Address in FIG. 11 .
  • the augmented node is a child of the schema node of the data path DAG (e.g., OrderFK in FIG.
  • This encoding of a RefInt not only causes foreign keys to be matched between two models, but also disambiguates matchings between elements that are referenced by the RefInt. For example, suppose the model of FIG. 11 is being matched with the model of FIG. 12 . In FIG. 12 , only recent customers (RecentCust) have orders. Old customers (OldCust) are customers who have not placed orders in a long time. Therefore, the foreign key is from CustomerID in Order to RecentCust and not to OldCust. When matching this model against FIG. 11 , the nodes named OrderFK in the two models will be compared for similarity and will be found to match, based both on their linguistic similarity and structural similarity.
  • XSD The approach to XSD includes the following considerations: First, keys and keyrefs in XSD are typed and context-sensitive, qualified by Xpath expressions, and not context-free like ID/IDREFs. Only those nodes that match the Xpath expressions need to be considered during augmentation. Second, keys and keyrefs can have multiple attributes, but unlike compound foreign keys, these attributes need not be contained by a single parent. This leads to a need for careful consideration of nodes to be assigned as children of the augmented node.
  • the present invention interprets referential constraints as potential join views.
  • the present invention introduces a node that represents the join of the participating tables, illustrated in more detail in FIG. 13 .
  • This technique reifies the referential constraint as a node that can be matched.
  • the technique works since the referential constraint implies that the join is meaningful. It is of note that the join view node has as its children the columns from both the tables. The common ancestor of the two tables is thus made the parent of the new join view node.
  • augmented nodes have two benefits. First, if two pairs of tables in the two schemas are related by similar referential constraints, then when the join views for the constraints are matched, the structural similarities of those tables' columns are increased. This improves the structural match. Second, this enables the discovery of mappings between a join view in one schema and, a single table or other join views in the second schema.
  • the additional join view nodes create a directed acyclic graph (DAG) of schema paths. Since the inverse topological ordering of a DAG, equivalent to post-order for a tree, is not unique, the algorithm is not Church-Rosser, i.e., the final similarities depend on the order in which nodes are compared. To make it Church-Rosser, additional ordering constraints may be added.
  • the RefInt nodes may be compared after the table nodes; however, determining which ordering would be best is still an open problem. If a table has multiple foreign keys, one node may be added for each of them. There is also the option of adding a node for each combination of these foreign keys (valid join views). In the interest of maintaining tractability, however, this step may be skipped. Similarly, the join view node that is added may also have a foreign key column of the target table. The invention may also expand these further, thus escalating expansion of referential constraints, but both for computation reasons and due to the lower relevance of tables at further distances, such a technique may be foregone.
  • a feature of optionality is provided.
  • Elements of schemas may be marked as optional, i.e., as nonrequired attributes of XML elements.
  • the leaves reachable from a schema tree node n are divided into two classes: optional and required.
  • a leaf is optional if it has at least one optional node on each path from n to the leaf.
  • the structural similarity coefficient expression is changed to reduce the weight of optional leaves that have no strong links, i.e., they are not considered in both the numerator and denominator of the ssim calculation. Therefore, nodes are penalized less for unmappable optional leaves than unmappable required leaves, so the matching is more tolerant to the former.
  • FIG. 1 In another embodiment of the invention, different views are accommodated. View definitions are treated like referential constraints.
  • a schema tree node is added whose children are the elements specified in the view. Such a schema tree node represents a common context for these elements and can be matched with views or tables of the other schema.
  • a lazy expansion process is provided.
  • a schema tree construction expands elements into each possible context, much like type substitution. This expansion duplicates elements, leading to repeated comparisons of identical subtrees. For example, in the example provided in FIG. 3 , the Address element is duplicated in multiple contexts within the PurchaseOrder schema and each of these duplicates is compared separately to elements of PO. These duplicate comparisons may be avoided by a lazy schema tree expansion, which compares elements of the schema graph before converting it to a tree. The elements are enumerated in inverse topological order of containment and IsDerivedFrom relationships.
  • the invention is able to match POBillTo, POShipTo and POLines to InvoiceTo, DeliverTo and Items respectively.
  • context-dependent mappings generated by constructing schema trees are useful when inferring different mappings for the same element in different contexts.
  • mapping results for a certain tool or application might not be the best achievable by the algorithm since improvements may be possible by adjusting a few of the parameters.
  • Tuning performance parameters in some cases requires expert knowledge of these tools.
  • a module for autotuning parameters is provided. Based upon the analysis of volumes of data, taking the complexity of the structure and linguistics of the schemas into account, a mechanism can be provided for automatically setting the parameters of the invention prior to matching. Alternatively, a “sliding bar” of results may be presented to the user, giving the user an opportunity at a glance to choose results from a variety of parameter sets.
  • the techniques for calculating similarity coefficients and a mapping between models in accordance with the present invention may be applied to a variety of applications and devices.
  • the model matching techniques of the invention may be applied to the operating system of a computing device, provided as a separate object on the device, as part of the object itself, as a downloadable object from a server, as a “middle man” between a device or object and the network, etc.
  • the similarity coefficients and mapping data generated may be stored for later use, or output to another independent, dependent or related process or service.
  • the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both.
  • the methods and apparatus of the present invention may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
  • the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • One or more programs that may utilize the model matching of the present invention are preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system.
  • the program(s) can be implemented in assembly or machine language, if desired.
  • the language may be a compiled or interpreted language, and combined with hardware implementations.
  • the methods and apparatus of the present invention may also be practiced via communications embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, a video recorder or the like, or a receiving machine having the matching capabilities as described in exemplary embodiments above becomes an apparatus for practicing the invention.
  • a machine such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, a video recorder or the like, or a receiving machine having the matching capabilities as described in exemplary embodiments above becomes an apparatus for practicing the invention.
  • PLD programmable logic device
  • client computer a video recorder or the like
  • a receiving machine having the matching capabilities as described in exemplary embodiments above becomes an apparatus for practicing the invention.
  • the program code When implemented
  • model matching algorithm(s) of the various embodiments described herein are generically applicable, independent of any particular data model. Accordingly, it is to be understood that while various examples herein are described in the context of a particular format, such as SQL, XML, UML, DTD, XSD, XDR and the like, this is for illustrative purposes only, and the techniques of the invention can be applied not only to any schema format now known, but also to any hereafter-developed data format.
  • the present invention may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices. Therefore, the present invention should not be limited to any single embodiment, but rather should be construed in breadth and scope in accordance with the appended claims.

Abstract

Systems and methods for automatically and generically matching models are provided, such as may be provided in a matching application or matching component, or provided in a general purpose system for managing models. The methods are generic since the methods apply to hierarchical data sets outside of any particular data model or application. Similarity coefficients are calculated for, and mappings are discovered between, schema elements based on their names, data types, constraints, and schema structure, using a broad set of techniques. Some of these techniques include the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree, or leaf, structure where much of the schema content resides.

Description

    CROSS REFERENCE OF RELATED APPLICATIONS
  • This application is a continuation of co-pending U.S. application Ser. No.10/028,912, filed on Dec. 12, 2001, entitled “Methods and Systems for Model Matching,” and identified by Attorney Docket No. MSFT-0591/164153.01.
  • COPYRIGHT NOTICE AND PERMISSION
  • A portion of the disclosure of this patent document may contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the. Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document Copyright ® 2001, Microsoft Corporation.
  • FIELD OF THE INVENTION
  • The present invention relates to model or schema matching, or more generally to the matching of separate hierarchical data sets. More particularly, the present invention relates to methods and systems for matching models, or schemas, that discover similarity coefficients between schema elements, including analyses based on one or more of schema names, schema data types, schema constraints and schema structure.
  • BACKGROUND OF THE INVENTION
  • Match is a schema manipulation operation that takes two schemas, models or otherwise hierarchically represented data as input and returns a mapping that identifies corresponding elements in the two schemas. Schema matching is a critical step in many applications. For example, in Ebusiness, match helps to map messages between different extensible markup language (XML) formats. In data warehousing, match helps to map data sources into warehouse schemas. In mediators, match helps to identify points of integration between heterogeneous databases. Schema matching thus far has primarily been studied as a piece of other applications. For example, schema integration uses matching to find similar structures in heterogeneous schemas, which are then used as integration points. Data translation uses matching to find simple data transformations. Given the continued evolution and importance of XML and other message mapping, match solutions are similarly likely to become increasingly important in the future.
  • Schema matching is challenging for many reasons. First and foremost, schemas for identical concepts may have structural and naming differences. In addition, schemas may model similar, but yet slightly different, content. Schemas may be expressed in different data models. Schemas may use similar words that may nonetheless have different meanings, etc.
  • Given these problems, today, schema matching is done manually by domain experts, sometimes using a graphical tool that can graphically depict a first schema according to its hierarchical structure on one side, and a second schema according to its hierarchical structure on another side. The graphical tool enables a user to select and visually represent a chosen mapping to exact matches automatically, although even minor name and structure variations may lead them astray. Despite match being such a pervasive, important and difficult problem, model matching has not yet been studied independently except as it may apply to other more narrow problems, such as those named above, and thus a generic solution for schema matching that can apply to many different data models and application domains remains to be provided. Moreover, such a wide variety of tools would benefit from a matching solution that an independent match component or module that can be incorporated into or downloaded for such tools would be of great utility.
  • For a more detailed definition, a schema consists of a set of related elements, such as tables, columns, classes, XML elements or attributes, etc. The result of the match operation is a mapping between elements of two schemas. Thus, a mapping consists of a set of mapping elements, each of which indicates that certain elements of schema S1 are related to certain elements of schema S2. For example, as illustrated in FIG. 1, a mapping between purchase order schemas PO and POrder may include a mapping element that relates element Lines.Item.Line of S1 to element Items.Item.ItemNumber of S2, as shown by the dotted line. While a mapping element may have an associated expression that specifies its semantics, mappings are treated herein as nondirectional.
  • A model or schema is thus a complex structure that describes a design artifact. Examples of models are Structured Query Language (SQL) schemas, XML schemas, Unified Modeling Language (UML) models, interface definitions in a programming language, Web site maps, make scripts, object models, project models or any hierarchically organized data sets. Many uses of models require building mappings between models. For example, a common application is mapping one XML schema to another, to drive the translation of XML messages. Another common application is mapping a SQL schema into an XML schema to facilitate the export of SQL query results in an XML format, or to populate a SQL database with XML data based upon an XML schema. Today, a mapping is usually produced by a human designer, often using a visual modeling tool that can graphically represent the models and mappings. To reduce the effort of the human designer, it would be desirable to provide a tool that at a minimum provides an intelligent initial mapping as a starting point for the designer. Thus, it would be desirable to provide a robust algorithm that automatically creates a mapping between two given models.
  • Also, there is a related problem of query discovery, which operates on mapping expressions to obtain queries for actual data translation. Both types of discovery are needed. Each is a rich and complex problem that deserves independent study. Query discovery is already recognized as an independent problem, where it is usually assumed that a mapping either is given or is trivial. Herein, the problem of schema matching is analyzed.
  • It is recognized that the problem of schema matching is inherently subjective. Schemas may not completely capture the semantics of the data they describe, and there may be several plausible mappings between two schemas, making the concept of a single best mapping ill defined. This subjectivity makes it valuable to have user input to guide the match for user validation of the result. This guidance may come via an initial mapping, a dictionary or thesaurus, a library of known mappings, etc. Thus, the goal of schema matching and one not yet adequately achieved by today's algorithms is: Given two input schemas in any data model, optional auxiliary information and an input mapping, compute a mapping between schema elements of the two input schemas that passes user validation.
  • The following is a taxonomy of currently known matching techniques. Schema matchers can be characterized by the following orthogonal criteria. With respect to schema-based vs. instance-based criteria, schema-based matchers consider only schema information, not instance data. Schema information includes names, descriptions, relationships, constraints, etc. Instance-based matchers either use metadata and statistics collected from data instances to annotate the schema, or directly find correlated schema elements, e.g., using machine learning.
  • With respect to element vs. structure granularity, an element-level matcher computes a mapping between individual schema elements, e.g., an attribute matcher. A structure-level matcher compares combinations of elements that appear together in a schema, e.g., classes or tables whose attribute sets only match approximately.
  • With respect to linguistic-based matching, a linguistic matcher uses names of schema elements and other textual descriptions. Name matching involves: putting the name into a canonical form by stemming and tokenization, comparing equality of names, comparing synonyms and hypernyms using generic and domain specific thesauri and matching substrings. Information retrieval (IR) techniques can be used to compare descriptions that annotate some schema elements.
  • With respect to constraint-based matching, a constraint-based matcher uses schema constraints, such as data types and value ranges, uniqueness, requiredness, cardinalities, etc. A constraint-based matcher might also use intraschema relationships, such as referential integrity.
  • With respect to matching cardinality, schema matchers differ in the cardinality of the mappings they compute. Some only produce one to one mappings between schema elements. Others produce n to one mappings, e.g., matchings that map the combination of DailyWages and WorkingDays in the source schema to MonthlyPay in the target.
  • With respect to auxiliary information, schema matchers differ in their use of auxiliary information sources such as dictionaries, thesauri and input match mismatch information. Reusing past match information can also help, for example, to compute a mapping that is the composition of mappings that were performed earlier.
  • With respect to individual vs. combinational matching, an individual matcher uses a single algorithm to perform the match. Combinational matchers can be one of two types: hybrid matchers and composite matchers. Hybrid matchers use multiple criteria to perform the matching. Composite matchers run independent match algorithms on the two schemas and combine the results.
  • In light of the above taxonomy, there are a number of known matching algorithms. The SEMINT system is an instance-based matcher that associates attributes in the two schemas with match signatures. The SEMINT system includes 15 constraint-based and 5 content-based criteria derived from instance values and normalized to the [0,1] interval, so that each attribute is a point in 20-dimensional space. Attributes of one schema are clustered with respect to their Euclidean distance. A neural network is trained on the cluster centers and then is used to obtain the most relevant cluster for each attribute of the second schema. SEMINT is a hybrid element-level matcher, but does not utilize schema structure, as the latter cannot be mapped into a numerical value.
  • The DELTA system groups all available metadata about an attribute into a text string and then applies IR techniques to perform matching. Like SEMINT, the DELTA system does not make much use of schema structure.
  • The LSD system uses a multilevel learning scheme to perform one to one matching of XML Document Type Definition (DTD) tags. A number of base learners that use different instance-level matching schemes are trained to assign tags of a mediated schema to data instances of a source schema. A metalearner combines the predictions of the base learners. LSD is thus a multi strategy instance-based matcher.
  • The SKAT prototype implements schema-based matching following a rule-based approach. Rules are formulated in first order logic to express match and mismatch relationships and methods are defined to derive new matches. The SKAT prototype supports name matching and simple structural matches based on isA hierarchies.
  • The TranScm prototype uses schema matching to drive data translation. The schema is translated to an internal graph representation. Multiple handcrafted matching rules are applied in order at each node. The matching is done top down with the rules at higher level nodes typically requiring the matching of descendants. This top down approach performs well only when the top level structures of the two schemas are quite similar. The TranScm prototype represents an element level and schema-based matcher.
  • The DIKE system integrates multiple Entity Relationship (ER) schemas by exploiting the principle that the similarity of schema elements depends on the similarity of elements in their vicinity. The relevance of elements is inversely proportional to their distance from the elements being compared, so nearby elements influence a match more than ones farther away. Linguistic matching is based on manual inputs. DIKE is a hybrid schema-based matcher utilizing both element and structure-level information
  • ARTEMIS, the schema integration component of the MOMIS mediator system, matches classes based on their name affinity and structure affinity. MOMIS has a description logic engine to exploit constraints. The classes of the input schemas are clustered to obtain global classes for the mediated schema. Linguistic matching is based on manual inputs using an interface with WordNet. ARTEMIS is a hybrid schema-based matcher utilizing both element and structure-level information.
  • However, each of the above solutions does not provide an adequate solution to the generic problem of matching schemas. While some of the above solutions may be adequate for a given matching task, due to a design for the particular task, the solution is not a general all purpose approach to model matching. Others were not designed for matching per se, but rather were designed for some other purpose such as schema integration, and thus the techniques applied to matching for these solutions make compromises that do not generalize adequately. Still other existing algorithms are too slow on today's hardware for interactive use, as a result of exhaustive calculations and the like.
  • There is thus a need for a mechanism or component that provides a complete general purpose schema matching solution. There is further a need for a general solution that considers all of the issues surrounding the above-described taxonomy, and includes a plurality of optimally combined algorithms. There is further a need for a method that automatically generates similarity coefficients for use in mapping two models. There is still further a need for a solution that is as consistent as possible with a given set of similarity relationships between elements of the two models. There are additional needs to be as consistent as possible with key and foreign key definitions in the two models, to relate objects of similar structure, to relate objects that have similar subtree structure and to relate objects that have similar leaf sets. There is also a need for an algorithm that achieves the above needs, but that is also fast enough to be used in real-time, e.g., by an interactive design tool.
  • SUMMARY OF THE INVENTION
  • In view of the foregoing, the present invention provides systems and methods for automatically and generically matching models, such as may be provided in a matching application or matching component, or provided in a general purpose system for managing models. The methods are generic since the methods apply to hierarchical data sets outside of any particular data model or application. Similarity coefficients are calculated for, and mappings can be discovered between, schema elements based on their names, data types, constraints, and schema structure, using a broad set of techniques. Some of these techniques include the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree structure where much of the schema content resides.
  • Other features and embodiments of the present invention are described below.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The system and methods for model matching are further described with reference to the accompanying drawings in which:
  • FIG. 1 illustrates two exemplary schemas representing an exemplary matching problem solved in accordance with the present invention;
  • FIG. 2A is a block diagram representing an exemplary network environment having a variety of computing devices in which the present invention may be implemented;
  • FIG. 2B is a block diagram representing an exemplary non-limiting computing device in which the present invention may be implemented;
  • FIG. 3 illustrates two exemplary schemas and corresponding mappings based upon similarity coefficients generated in accordance with the present invention;
  • FIG. 4 illustrates an exemplary second pass calculation of structural similarity between two models in accordance with the invention;
  • FIG. 5 illustrates an exemplary non-limiting top-level architecture of an exemplary system in which the present invention may operate;
  • FIG. 6 illustrates an exemplary process diagram for processing two schemas to produce a mapping therebetween in accordance with the invention;
  • FIG. 7 is a block diagram illustrating exemplary relationships among model elements in accordance with a generically defined object model of the invention;
  • FIG. 8 illustrates exemplary handling of multiple paths from the root of a model to a particular model element in accordance with the invention;
  • FIG. 9A illustrates exemplary modeling of a foreign key with respect to two SQL tables in accordance with the present invention;
  • FIG. 9B illustrates an exemplary RefInt model element that represents a referential integrity constraint in accordance with the invention;
  • FIG. 10A illustrates an exemplary model representation of a RefInt in a relational schema in accordance with the invention;
  • FIG. 10B illustrates an exemplary model representation of a RefInt in an external Data Representation (XDR) schema in accordance with a non-limiting exemplary embodiment of the invention;
  • FIG. 11 illustrates exemplary encoding of a RefInt in a data tree for an SQL schema in accordance with the invention;
  • FIG. 12 illustrates exemplary disambiguation of matchings between elements that are referenced by a RefInt in accordance with the invention; and
  • FIG. 13 illustrates exemplary introduction of a node in response to encountering a referential constraint, such as a foreign key, in a schema in accordance with the present invention.
  • DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
  • Overview
  • In accordance with the present invention, methods and systems are provided for automatically creating similarity coefficients between elements of two given schemas or models. A mapping between the models can be produced from the similarity coefficients. For example, the algorithm(s) described by the present invention can automatically create similarity coefficients and a mapping between a SQL schema and an XML schema, although it will be appreciated that the invention is generic and not limited to any particular model type or schema. This is primarily accomplished by computing similarity coefficients between pairs of elements, with a pair of elements including one element from the first schema model and one element from the second schema model.
  • The model match algorithm of the invention is driven by at least three kinds of information in a data model: linguistic information about the names of model elements, type information about model elements and structural information about how model elements in a model are related. In addition to the models themselves, the algorithm may make use of dictionaries and thesauri to interpret the linguistic information.
  • The present invention thus provides algorithms for generic schema matching, outside of any particular data model or application, showing that a rich range of techniques is available based upon the taxonomy described above in the background. The invention proposes new algorithm(s) that discover similarity coefficients between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. In various embodiments, the invention includes the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree structure where much of the schema content of the subtree's root node resides.
  • In various non-limiting embodiments, the invention provides a solution to the schema matching problem (1) that includes automatic model matching that is both element-based and structure-based, (2) that utilizes the similarity of the subtrees of the two schemas and that is biased toward similarity of atomic elements, e.g., leaves, of a hierarchical tree, where much content describing the degree of similarity is captured, (3) that exploits internal structure, but is not overly misled by variations in that structure, (4) that exploits keys, referential constraints and views where they exist, (5) that makes context dependent matches of a shared type definition that is used in several larger structures and (6) that generates one to one or one to n mappings, (7) wherein adjustments may be made if desired and wherein a user may make input or correction to the process.
  • While the invention shares some general approaches with known algorithms, the invention does not implement any particular one of the algorithms themselves. For instance, while aspects of the overall techniques of the invention include a rating match quality in the [0,1] interval and a clustering of similar terms (SEMINT) as well as matching structures based on a local vicinity (DIKE, ARTEMIS), none of the prior art techniques generate similarity coefficients for each node pair of two models being matched based upon both linguistic and structural similarity, wherein similarities associated with the subtree of a root node are updated in accordance with the similarity coefficient calculations for the root node. Other novel aspects of the invention are described in more detail below.
  • The invention is schema-based and not instance-based and assumes some hierarchy to the schemas being matched. In this regard, the interconnected elements of a schema hierarchy are modeled as a tree structure having branches and leaves. A simple relational schema is an example of a schema tree since such a schema contains tables, which contain columns. An XML schema with no shared elements is another simple example. With such an XML schema, elements include subelements, which in turn include other subelements or attributes. The model may also be enriched to capture additional semantics, making the invention apply as generically as possible, as described in the below section on modeling and the generic object model.
  • The present invention provides systems and methods that are consistent with a given set of similarity relationships between elements of the two models. For example, the given similarity relationships may include that “PO” is similar to “purchase order” with weight 0.8 and that “PO” is similar to “post office” with weight 0.7. So, an element of one model named “PO” is more similar to a node in the other model named “purchase order” than one named “post office.” Therefore, if model1 contains an element named “PO” and model2 contains two elements named “purchase order” and “post office,” then all else being equal, “purchase order” is a better match for “PO” than “post office.”
  • The present invention further provides systems and methods that are consistent with key and foreign key definitions, if any, in the two models. For example, when matching two relational schemas, if a column C1 is a key of a table T1 in model1, then it is desirable to map C1 to a column C2 that is a key of its table T2 in model2.
  • The present invention further provides systems and methods that relate objects of similar structure. For example, if an object m1 of model1 is mapped to an object m2 of model2, then the objects in m1's neighborhood are mapped to the objects in m2's neighborhood and those neighborhoods are assigned a similar structural relationship to reflect the similarity of object m1 to object m2.
  • The present invention further provides systems and methods that relate objects that have similar leaf sets. For example, if the leaf elements under InvoiceInfo in one model are more similar to those under BillingInfo than to those under EmployeeInfo, then it is better to map InvoiceInfo to BillingInfo than to EmployeeInfo.
  • Lastly, the algorithm(s) of the present invention are fast, i.e., the algorithm(s) are fast enough, for example, to be used by an interactive design tool or other real-time application.
  • The invention recognizes that two nodes are similar if (1) the model elements corresponding to the two nodes are inherently similar, such as if the model elements are linguistically similar, and if (2) the subtrees rooted at the two nodes are similar. The invention also recognizes that the similarity of two subtrees is not always reflected by the similarity of their immediate children. The leaves of the subtree give a better estimate of the data described by the subtree, since they refer to the atomic data elements that the model is ultimately describing, and since intervening structure may be superfluous. The invention further recognizes that the similarity of two leaves in hierarchical tree structures depends on their similarity and the similarity of their structural vicinity.
  • The matching algorithm of the invention works generally as follows. The structural similarity of each pair of leaf nodes s and t in the source (domain) model and target (range) model, respectively, are initialized. For example, the structural similarity may be initialized to the compatibility of the nodes' corresponding data-types. Then, the nodes of the two trees are enumerated in inverse topological order, such as post-order. For each node pair (s,t) encountered during traversal of the two trees, a weighted similarity calculation is made that takes both inherent and structural similarity of the node pair into account. Inherent similarity takes into account only the individual nodes being compared and may be, for example, their linguistic similarity. Structural similarity takes into account the similarity of the subtrees of the node pair, e.g., the leaf sets of the node pair may be considered. The weighted similarity calculation for the node pair (s, t) may then be utilized in connection with either increasing or decreasing the similarity of the subtrees of the node pairs. This reflects that if the nodes are similar, likely the children or leaves rooted by the nodes will be similar as well and by the same token, that if the nodes are dissimilar, then it is likely that the children or leaves of the nodes will be dissimilar. The weight for computing a weighted mean and various thresholds may be set as tuning parameters.
  • The structural similarity of the two subtrees is determined based on the best matches between corresponding subtrees, e.g., leaf nodes. A good computation for the structural similarity of a node pair (s, t) returns a high value when the number of strong matches of subtrees and subtreet is above a certain threshold, such as half, and a low value otherwise.
  • The similarity computations of the invention thus have a mutually recursive flavor. Two elements are similar if their subtree node sets are similar. The similarity of the subtree nodes is increased if they have ancestors that are similar. The similarity of intermediate substructure also influences subtree similarity: if the subtree structures of two elements are highly similar, then multiple element pairs in the subtrees will be highly similar, which leads to higher structural similarity of the leaves (due to multiple similarity increases). Inverse topological order traversal of the schemas ensures that before two elements e1 and e2 are compared, all the elements in their subtrees have already been compared. This ensures that e1's leaves and e2's leaves capture the similarity of e1's intermediate subtree structure and e2's intermediate subtree structure before e1 and e2 are compared. The structural similarity of two nodes with a large difference in the number of leaves is unlikely to be very good. Such comparisons lead to a large number of element similarities that are below a threshold. This can be prevented by the algorithm(s) of the invention by only comparing elements that have a similar number of leaves in their subtrees, e.g., within a factor of 2. In addition to only comparing relevant elements, such a pruning step decreases the number of element pairs for comparison, and thus speeds operation of the algorithm(s).
  • The invention thus matches models in a bottom-up fashion, making it rather different from top-down approaches. The disadvantage of a top-down technique is that it depends very heavily on a good matching at the top level of the schema hierarchy. The results will not be good if the children of the roots of the two models are very different, but merely present a different normalization of the same schema. However, a top-down approach may be more efficient when the schemas are very similar. The bottom-up approach of the invention is more conservative and does not suffer from the case of false-negatives, but at the cost of more computation; nonetheless, the performance of the invention in real-time minimizes the impact of this cost.
  • Various levels of subtree may be considered in accordance with the invention. Instead of comparing all of the leaves of a node pair, the invention may consider only the immediate descendants of the elements being compared. Using the leaves for measuring structural similarity identifies most of the matches that this alternative scheme does. However, using the leaves ensures that schemas, which have a moderately different substructure (e.g., nesting of elements), but essentially the same data content (similar leaves), are correctly matched.
  • If matches between internal nodes of the data path tree are important to the result, then a second pass in the calculation of structural similarity is utilized. The reason is that the first pass of calculating structural similarities has the effect of updating the similarities of the subtree structures. So, at the end of the calculation the structural similarity of some pairs of elements may no longer be consistent with the similarities of their subtrees.
  • After the calculation of structural similarity is completed for all nodes of the models, i.e., for all nodes that were not pruned from the calculation, a post-processing step may be performed on the structural similarity values to construct a mapping between the two models. For example, as part of the post-processing, the two trees can again be traversed in inverse topological order, and each node of the target can be matched with the node of source with which it has highest structural similarity.
  • Other aspects of the invention are set forth below.
  • Exemplary Networked and Distributed Environments
  • One of ordinary skill in the art can appreciate that a computer or other client or server device can be deployed as part of a computer network, or in a distributed computing environment. In this regard, the present invention pertains to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes. The present invention may apply to an environment with server computers and client computers deployed in a network environment or distributed computing environment, having remote or local storage. The present invention may also be applied to standalone computing devices, having programming language functionality, interpretation and execution capabilities for generating, receiving and transmitting information in connection with services.
  • Distributed computing facilitates sharing of computer resources and services by direct exchange between computing devices and systems. These resources and services include the exchange of information, cache storage, and disk storage for files. Distributed computing takes advantage of network connectivity, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, a variety of devices may have data sets for which it would be desirable to perform the matching algorithms of the present invention.
  • FIG. 2A provides a schematic diagram of an exemplary networked or distributed computing environment. The distributed computing environment comprises computing objects 10 a, 10 b, etc. and computing objects or devices 110 a, 110 b, 110 c, etc. These objects may comprise programs, methods, data stores, programmable logic, etc. The objects comprise portions of the same or different devices such as PDAs, televisions, MP3 players, televisions, personal computers, etc. Each object can communicate with another object by way of the communications network 14. This network may itself comprise other computing objects and computing devices that provide services to the system of FIG. 2A. In accordance with an aspect of the invention, each object 10 or 110 may contain data such that it would be desirable to match that data to other data of other objects 10 or 110. For example, where one of the objects may possess SQL data, another of the objects may possess XML data, and it may be desirable to provide a mapping between the associated schemas.
  • In a distributed computing architecture, computers, which may have traditionally been used solely as clients, communicate directly among themselves and can act as both clients and servers, assuming whatever role is most efficient for the network. This reduces the load on servers and allows all of the clients to access resources available on other clients thereby increasing the capability and efficiency of the entire network.
  • Distributed computing can help businesses deliver services and capabilities more efficiently across diverse geographic boundaries. Moreover, distributed computing can move data closer to the point where data is consumed acting as a network caching mechanism. Distributed computing also allows computing networks to dynamically work together using intelligent agents. Agents reside on peer computers and communicate various kinds of information back and forth. Agents may also initiate tasks on behalf of other peer systems. For instance, intelligent agents can be used to prioritize tasks on a network, change traffic flow, search for files locally or determine anomalous behavior such as a virus and stop it before it affects the network. All sorts of other services may be contemplated as well. As one of ordinary skill in the distributed computing arts can appreciate, the matching algorithm(s) of the present invention may be implemented in such an environment.
  • It can also be appreciated that an object, such as 110 c, may be hosted on another computing device 10 or 110. Thus, although the physical environment depicted may show the connected devices as computers, such illustration is merely exemplary and the physical environment may alternatively be depicted or described comprising various digital devices such as PDAs, televisions, MP3 players, etc., software objects such as interfaces, COM objects and the like.
  • There are a variety of systems, components, and network configurations that support distributed computing environments. For example, computing systems may be connected together by wireline or wireless systems, by local networks or widely distributed networks. Currently, many of the networks are coupled to the Internet, which provides the infrastructure for widely distributed computing and encompasses many different networks.
  • In home networking environments, there are at least four disparate network transport media that may each support a unique protocol such as Power line, data (both wireless and wired), voice (e.g., telephone) and entertainment media. Most home control devices such as light switches and appliances may use power line for connectivity. Data Services may enter the home as broadband (e.g., either DSL or Cable modem) and is accessible within the home using either wireless (e.g., HomeRF or 802.11b) or wired (e.g., Home PNA, Cat 5, even power line) connectivity. Voice traffic may enter the home either as wired (e.g., Cat 3) or wireless (e.g., cell phones) and may be distributed within the home using Cat 3 wiring. Entertainment Media may enter the home either through satellite or cable and is typically distributed in the home using coaxial cable. IEEE 1394 and DVI are also emerging as digital interconnects for clusters of media devices. All of these network environments and others that may emerge as protocol standards may be interconnected to form an intranet that may be connected to the outside world by way of the Internet. In short, a variety of disparate sources exist for the storage and transmission of data, and consequently, moving forward, computing devices will require ways of sharing data based upon common ground. The matching algorithm(s) of the present invention may provide such common ground by providing mappings between the disparately structured and named data.
  • The Internet commonly refers to the collection of networks and gateways that utilize the TCP/IP suite of protocols, which are well-known in the art of computer networking. TCP/IP is an acronym for “Transport Control Protocol/Interface Program.” The Internet can be described as a system of geographically distributed remote computer networks interconnected by computers executing networking protocols that allow users to interact and share information over the networks. Because of such wide-spread information sharing, remote networks such as the Internet have thus far generally evolved into an open system for which developers can design software applications for performing specialized operations or services, essentially without restriction.
  • Thus, the network infrastructure enables a host of network topologies such as client/server, peer-to-peer, or hybrid architectures. The “client” is a member of a class or group that uses the services of another class or group to which it is not related. Thus, in computing, a client is a process (i.e., roughly a set of instructions or tasks) that requests a service provided by another program. The client process utilizes the requested service without having to “know” any working details about the other program or the service itself. In a client/server architecture, particularly a networked system, a client is usually a computer that accesses shared network resources provided by another computer, e.g., a server. In the example of FIG. 2A, computers 110 a, 110 b, etc. can be thought of as clients and computer 10 a, 10 b, etc. can be thought of as the server where server 10 a, 10 b, etc. maintains the data that is then replicated in the client computers 110 a, 110 b, etc.
  • A server is typically a remote computer system accessible over a remote network such as the Internet. The client process may be active in a first computer system, and the server process may be active in a second computer system, communicating with one another over a communications medium, thus providing distributed functionality and allowing multiple clients to take advantage of the information-gathering capabilities of the server.
  • Client and server communicate with one another utilizing the functionality provided by a protocol layer. For example, Hypertext-Transfer Protocol (HTTP) is a common protocol that is used in conjunction with the World Wide Web (WWW) or, simply, the “Web.” Typically, a computer network address such as a Universal Resource Locator (URL) or an Internet Protocol (IP) address is used to identify the server or client computers to each other. The network address can be referred to as a Universal Resource Locator address. For example, communication can be provided over a communications medium. In particular, the client and server may be coupled to one another via TCP/IP connections for high-capacity communication.
  • Thus, FIG. 2A illustrates an exemplary networked or distributed environment, with a server in communication with client computers via a network/bus, in which the present invention may be employed. In more detail, a number of servers 10 a, 10 b, etc., are interconnected via a communications network/bus 14, which may be a LAN, WAN, intranet, the Internet, etc., with a number of client or remote computing devices 110 a, 110 b, 110 c, 110 d, 110 e, etc., such as a portable computer, handheld computer, thin client, networked appliance, or other device, such as a VCR, TV, oven, light, heater and the like in accordance with the present invention. It is thus contemplated that the present invention may apply to any computing device in connection with which it is desirable to communicate to another computing device with respect to matching services.
  • In a network environment in which the communications network/bus 14 is the Internet, for example, the servers 10 can be Web servers with which the clients 110 a, 110 b, 110 c, 110 d, 110 e, etc. communicate via any of a number of known protocols such as HTTP. Servers 10 may also serve as clients 110, as may be characteristic of a distributed computing environment. Communications may be wired or wireless, where appropriate. Client devices 110 may or may not communicate via communications network/bus 14, and may have independent communications associated therewith. For example, in the case of a TV or VCR, there may or may not be a networked aspect to the control thereof. Each client computer 110 and server computer 10 may be equipped with various application program modules or objects 135 and with connections or access to various types of storage elements or objects, across which files may be stored or to which portion(s) of files may be downloaded or migrated. Any computer 10 a, 10 b, 110 a, 110 b, etc. may be responsible for the maintenance and updating of a database 20 or other storage element in accordance with the present invention, such as a database 20 for storing schema or model data in accordance with the present invention. Thus, the present invention can be utilized in a computer network environment having client computers 110 a, 110 b, etc. that can access and interact with a computer network/bus 14 and server computers 10 a, 10 b, etc. that may interact with client computers 110 a, 110 b, etc. and other devices 111 and databases 20.
  • Exemplary Computing Device
  • FIG. 2B and the following discussion are intended to provide a brief general description of a suitable computing environment in which the invention may be implemented. It should be understood, however, that handheld, portable and other computing devices and computing objects of all kinds are contemplated for use in connection with the present invention. While a general purpose computer is described below, this is but one example, and the present invention requires only a thin client having network/bus interoperability and interaction. Thus, the present invention may be implemented in an environment of networked hosted services in which very little or minimal client resources are implicated, e.g., a networked environment in which the client device serves merely as an interface to the network/bus, such as an object placed in an appliance. In essence, anywhere that data may be stored or to which data may be retrieved is a desirable, or suitable, environment for operation of the matching algorithm(s) of the invention.
  • Although not required, the invention can be implemented via an operating system, for use by a developer of services for a device or object, and/or included within application software which aids in matching data sets. Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations. Other well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers (PCs), automated teller machines, server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, appliances, lights, environmental control elements, minicomputers, mainframe computers and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network/bus or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices, and client nodes may in turn behave as server nodes.
  • FIG. 2B thus illustrates an example of a suitable computing system environment 100 in which the invention may be implemented, although as made clear above, the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
  • With reference to FIG. 2B, an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus).
  • Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, such as information processed according to the invention or information incident to carrying out the algorithms of the invention. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 2B illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
  • The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 2B illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156, such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through an non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
  • The drives and their associated computer storage media discussed above and illustrated in FIG. 2B provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 2B, for example, hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus 121, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
  • The computer 110 may operate in a networked or distributed environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in FIG. 2B. The logical connections depicted in FIG. 2B include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks/buses. Such networking environments are commonplace in homes, offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 2B illustrates remote application programs 185 as residing on memory device 181. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • .NET Framework
  • .Net is a computing framework that has been developed in light of the convergence of personal computing and the Internet. Individuals and business users alike are provided with a seamlessly interoperable and Web-enabled interface for applications and computing devices, making computing activities increasingly Web browser or network-oriented. In general, the .Net platform includes servers, building-block services, such as Web-based data storage and downloadable device software.
  • Generally speaking, the .Net platform provides (1) the ability to make the entire range of computing devices work together and to have user information automatically updated and synchronized on all of them, (2) increased interactive capability for Web sites, enabled by greater use of XML rather than Hyptertext Markup Language (HTML), (3) online services that feature customized access and delivery of products and services to the user from a central starting point for the management of various applications, such as e-mail, for example, or software, such as Office .Net, (4) centralized data storage, which will increase efficiency and ease of access to information, as well as synchronization of information among users and devices, (5) the ability to integrate various communications media, such as e-mail, faxes, and telephones, (6) for developers, the ability to create reusable modules, thereby increasing productivity and reducing the number of programming errors and (7) many other cross-platform integration features as well. While exemplary embodiments herein are described in connection with software residing on a computing device, portions of the invention may also be implemented via an operating system or a “middle man” object between a network and device or object, such that data matching services may be performed by, supported in or accessed via all of Microsoft's .NET languages and services.
  • Model Matching—Exemplary Computations and Embodiments
  • Having described exemplary computing devices and computing environments in which the present invention may be implemented, various non-limiting embodiments of the systems and methods for automatically and generically matching models in accordance with the invention are set forth below. Various embodiments of the invention described below include one or more of the integrated use of linguistic and structural matching, context dependent matching of shared types and a bias toward subtree structure where much of the schema content resides. The systems and methods are generic since they may be applied to hierarchical data sets outside of any particular data model or application.
  • By way of the exemplary schemas S3 and S4 of FIG. 3, aspects of the present invention may be illustrated in connection with matching two similar schemas PO and Purchase Order. The schemas are encoded as graphs, where nodes represent schema elements. Although even a casual observer can see the schemas are very similar, there is much variation in the naming and the structure that makes algorithmic matching quite challenging.
  • The present invention approaches the matching problem by computing similarity coefficients between elements of the two schemas, from which a mapping between the elements may be deduced. The coefficients, in the [0,1] range, are calculated in two phases: inherent matching and structural matching.
  • The first phase, inherent matching, which may be linguistic matching, matches individual schema elements based on their names, data types, domains, etc. One or more dictionaries and/or thesauri can be used to help match names by identifying short forms (Qty for Quantity m2), acronyms (UoM for UnitOfMeasure m3) and synonyms (Bill and Invoice m4). The result is a linguistic similarity coefficient, lsim, between each pair of elements, e.g., lsim1 for m1, lsim2 for m2, etc.
  • The second phase is the structural matching of schema elements based on the similarity of their contexts or vicinities. For example, Line is mapped to ItemNumber m5 because their parents, i.e., Item, match and the other two children of Item, i.e., Qty for Quantity and UoM for UnitOfMeasure, already match. The structural match depends in part on linguistic matches calculated in phase one. For example, City and Street under POBillTo match City m6 and Street m7 under InvoiceTo, rather than under DeliverTo, because Bill is a synonym of Invoice but not of Deliver. The result is a structural similarity coefficient, ssim, e.g., ssim1 for m1, ssim2 for m2, etc. for each pair of elements.
  • After calculating the inherent and structural coefficients for each node pair, a weighted similarity (wsim) is calculated for each node pair, which is a function, such as the mean or weighted mean, of lsim and ssim. One such weighted similarity calculation is as follows:
    wsim=w struct * ssim+(1−w struct)* lsim,
    where the constant wstruct is in the range 0 to 1.
  • Then, an optional additional step that may be performed is mapping generation, wherein pairs of schema elements with maximal weighted similarity are chosen for mappings between the schema elements. The inherent matching phase, structural matching phase and mapping generation techniques of the invention are described in more detail in connection with various exemplary embodiments below.
  • The linguistic matching of the invention is based primarily on schema element names. In the absence of data instances, such names are probably the most useful source of information for matching. The invention also makes modest use of data types and schema structure in this phase. Inherent matching, such as linguistic matching, proceeds in three steps in one embodiment: normalization, categorization and comparison. The steps of normalization, categorization and comparison are described in much more detail below in the section relating to inherent similarities. For now, however, it can be understood that as a result of the comparison of the inherent matching, a set of inherent similarity coefficients lsim are generated as between node pairs of the models being compared.
  • For structure matching, an algorithm is presented herein for hierarchical schemas, i.e., tree structures generated from the generic modeling performed in accordance with the invention. As presented in more detail below, the generic modeling may be extended to include richer schemas that have shared data types and referential integrities. For each pair of the source and target tree structures, the algorithm computes a structural similarity, ssim, which is a measure of the similarity of the contexts in which the elements occur in the two schemas. From ssim and lsim, a weighted similarity wsim is computed according to a function, such as a mean calculation, that may be weighted. The above wsim calculation is illustrative in this regard.
  • The below describes an exemplary tree matching algorithm in accordance with the inherent, structural and weighted similarity computations of the invention:
    TreeMatch(SourceTree S, TargetTree T)
    for each s ∈S, t ∈T where s, t are leaves
      set ssim (s, t) = datatype-compatibility(s, t)
    S′ = post-order(S), T′ = post-order(T)
    for each s in S′
      for each t in T′
      compute ssim(s, t) = structural-similarity(s, t)
      wsim(s, t) = Wstruct * ssim(s,t) + (1−Wstruct) * lsim(s, t)
      if wsim(s, t) > thhigh
        increase-struct-similarity(leaves(s), leaves(t), Cinc)
      if wsim(s, t) < thlow
        decrease-struct- similarity(leaves(s), leaves(t), Cdec)
  • In one non-limiting embodiment, the structural similarity of two leaves is initialized to the type compatibility of their corresponding data types, although the structural similarity may be initialized to other values respecting subtrees as well. In one implementation, this initialization value ([0,0.5]) is a lookup in a compatibility table. Identical data types have a compatibility of 0.5. As described below, a value of 0.5 allows for later increases or decreases in structural similarity based on increases or decreases in confidence.
  • After initialization, the elements in the two trees are enumerated in inverse topological order, such as post-order, which is uniquely defined for a given tree. Both the inner and outer loops are executed in this order. The first step in the loop computes the structural similarity of two elements (s, t). For leaves, this is the value of ssim that was initialized in the earlier loop. When one of the elements is not a leaf, the structural similarity is computed as a measure of the number of leaf level matches in the subtrees rooted at the elements that are being compared, reflecting the intuition that when leaf structure is similar, so will be the structure of the root elements. The invention indicates that a leaf in one schema has a strong link to a leaf in the other schema if their weighted similarity exceeds a threshold thaccept. Exceeding the threshold thaccept indicates a potentially acceptable mapping. In one implementation, the structural similarity is estimated as the fraction of leaves in the two subtrees that have at least one strong link (and are hence mappable) to some leaf in the other subtree, as represented by the following exemplary equation: ssim ( s , t ) = { x | x leaves ( s ) y leaves ( t ) , stronglink ( x , y ) } { x | x leaves ( t ) y leaves ( s ) , stronglink ( y , x ) } leaves ( s ) leaves ( t )
    where leaves(s)=set of leaves in the subtree rooted at s. Two leaves have a strong link if their weighted similarity is greater than a pre-defined threshold. Once the inherent and structural values are known for the model elements being compared, the weighted similarity is computed.
  • Then, if the two elements being compared are highly similar, i.e., if their weighted similarity exceeds the threshold thhigh, then the structural similarity (ssim) of each pair of leaves in the two subtrees (one from each schema) may be increased by the factor cinc (ssim not to exceed 1 in this example). The rationale is that leaves with highly similar ancestors occur in similar contexts. So the presence of such ancestors should reinforce their structural similarity. For example, in FIG. 3, if POBillTo is highly similar to InvoiceTo, then the structural similarity of their leaves City and Street would be increased, to bind them more tightly than to other City and Street pairs. For similar reasons, if the weighted similarity is less than the threshold thlow, the structural similarities of leaves in the subtrees may be decreased by the factor cdec. The linguistic similarity, however, remains unchanged.
  • The algorithm of the invention recognizes when the leaves in two subtrees match, even if the subtree structures that contain them do not match precisely. This is often the case when the top-level organization of the same data is very different in the two models. This is why it is beneficial to use leaves rather than internal nodes when comparing two subtrees.
  • Since at the end of the calculation the structural similarity of some pairs of elements may no longer be consistent with the similarities of their leaves, a second pass in the calculation of the structural similarity may be utilized. For example, in FIG. 4, suppose the subtrees under the Address and Address elements are identical, as shown by the identical triangles underneath. Then, during the post-order traversal, the Address element in Model1 will have the same structural similarity to both the Address and Address elements of Model2. Then, suppose the Contact elements in the two models are compared in the structural similarity calculation and it is determined that they have a structural similarity greater than the threshold thhigh, thereby causing the similarity of their leaf sets to be increased. If the structural similarity of the two Address elements were then recalculated during a second pass, the initial structural similarity used for the second pass structural similarity calculation would be higher than its value resulting from the first calculation, because the leaf sets' similarity was raised. Moreover, this higher value would now cause the Address element of Model1 to have a higher structural similarity to the Address element of Model2 than to the Address of Model2, thereby changing the result of the match.
  • Unlike the first pass of calculating structural similarity, however, the second pass does not increase the similarity of leaf sets. Therefore, only two passes are utilized, i.e., if a third pass were performed, the third pass would yield the same value as the second pass since none of the inputs to the second pass's structural similarity calculation will have changed, and inherent similarity remains the same. The second pass calculation may thus be considered an optional further step to the algorithm:
    For each node s of the source tree,
      For each node t of the target tree,
        wsim(s, t) = Wstruct * ssim(s,t) + (1−wstruct) * lsim(s, t)

    where ssim(s, t) may be calculated as before.
  • Mapping generation is one process that can benefit from a second pass calculation by recomputing the similarities of the nonleaf elements, since the updating of leaf similarities during tree match may have affected the structural similarity of nonleaf nodes after they were first calculated. After this recalculation, a scheme similar to leaf level mapping generation can be used. The mapping that is produced consists of a list of mapping elements or correspondences. A further step may be to enrich the structure of the map itself. For example, the mapping element between two XML elements e1 and e2 may have as its subelements the mapping elements between matching XML attributes of e1 and e2.
  • The outputs of schema matching are sets of inherent and structural similarity coefficients, from which weighted similarity coefficients are calculated. Thus, with respect to mapping generation more generally, mapping elements may be generated by using any one or more of the computed linguistic, structural and weighted similarities. In the simplest case, the invention might just use leaf level mapping elements. For each leaf element t in the target schema, if the leaf element s in the source schema with highest weighted similarity to t is acceptable (wsim(s, t)≧thaccept), then a mapping element from s to t is returned. This resulting mapping may be 1:n, since a source element may map to many target elements. The exact nature of a mapping is often dependent on requirements of the module that accepts these mappings. For example, query discovery might require a one to one mapping instead of the 1 to n mapping. Such requirements need to be captured by a data model specific or tool specific mapping generator that takes the computed similarities as input.
  • In one embodiment of the invention, initial mappings are provided. In this case, the matcher algorithm utilizes a user supplied initial mapping to help initialize leaf similarities prior to structural matching, described above. The linguistic similarity of elements marked as similar in the initial map is initialized to a predefined maximum value. Such a hint can lead to higher structural similarity of ancestors of the two leaves, and hence a better overall match. Additionally, a user can make corrections to a generated result map, and then rerun the match with the corrected input map, thereby generating an improved map. Thus, initial maps are a way to incorporate user interaction into the matching process. In one embodiment, this is information about two leaves, branches or nodes in two schemas being matched that map. This information may also be broken down by the user as to whether the input is being made based on actual user knowledge of structural information and/or linguistic information.
  • In another embodiment of the invention, a pruning leaves process is provided. In a deeply nested schema tree with a large number of elements, an element e high in the tree has a large number of leaves. These leaves increase the computation time, even though many of them are irrelevant for matching e. Therefore, it may be better to consider only nodes in a subtree of depth k rooted at node e, thereby pruning the leaves. While comparing nearly identical schemas, it might seem wasteful to compare the leaves. To avoid this, the immediate children of the nodes are first compared. If a very good match is detected, then the leaf level similarity computation is skipped.
  • The invention as described above operates on XML and relational schemas, which techniques may be applied to other schemas. The output mappings are displayed by a standalone application such as BIZTALK MAPPER®, which can compile them into extensible Stylesheet Language (XSL) translation scripts. As described in the sections regarding exemplary computing and network environments, such a mapping service may also be downloaded from a server in a network, provided by an application service provider, provided as part of an operating system, etc.
  • The following is a brief description of the criteria for setting the different thresholds and parameters used in the algorithm and presents some typical values for them. The exemplary values listed are non-limiting in this regard, and are recited merely to illustrate one example for each. One of ordinary skill in the art can appreciate that parameters, by their very nature, may be changed to reflect various design nuances or challenges.
  • The parameter thhigh is used in connection with the determination as to whether wsim(s,t)≧thhigh. If so, then the structural similarity between all pairs of leaves in the two subtrees rooted at s and t is increased. While the invention does not lie in any particular value of this parameter, the parameter should be chosen to be greater than thaccept. An exemplary value for thhigh is 0.6.
  • The parameter thlow is used in connection with the determination as to whether wsim(s,t)≦thlow, If so, then the structural similarity between all pairs of leaves in the two subtrees rooted at s and t is decreased. While the invention does not lie in any particular value of this parameter, the parameter should be chosen to be less than thaccept. An exemplary value for thlow is 0.35.
  • The parameter cinc is the multiplicative factor by which leaf structural similarities are increased. The parameter cinc is typically a function of maximum schema depth or depth to which nodes are considered for structural similarity. An exemplary value for the parameter cinc is 1.2.
  • The parameter cdec is the multiplicative factor by which leaf structural similarities are decreased. Typically, the parameter cdec is set to be about cinc −1. For example, an exemplary value for the parameter cdec is 0.9.
  • The parameter thaccept is used in connection with the determination of whether wsim(s,t)≧thaccept, suggesting whether s and t have a strong link or have a valid mapping element. An exemplary value for the parameter thaccept is 0.5.
  • The parameter wstruct is the structural similarity contribution to wsim. Typically, this value is different for leaves and nonleaves, with the value being lower for leaf-leaf pairs than for nonleaf pairs. An exemplary range for this value is from 0.5 to 0.6.
  • The present invention improves on past methods in many respects, for example, by including a substantial linguistic matching step and by biasing matches by leaves of a schema. While merely one novel feature described herein, no prior art techniques have been known to relate objects that have similar leaf sets in the manner employed by the present invention. The invention makes such consideration due to the observation that leaves describe the technical content of a schema, e.g., the columns of a table or the attributes and leaf elements of an XML model, which is often a more important match criterion than internal structure. The internal structure is sometimes arbitrary, where different designers group the same information in different ways due to differences in taste. Sometimes the differences are due to limitations of the data models in which schemas are represented. For example, in SQL, table definitions are flat, whereas XML schemas can have nested subelements to represent substructure.
  • The algorithm may be implemented as an independent component, or integrated into a particular application. The present invention may also be combined with other techniques, such as machine learning applied to instances, natural language technology, and pattern matching to reuse known matches. The invention thus provides a general-purpose schema matching component that can be used in systems for schema integration, data migration, etc.
  • FIG. 5 illustrates an exemplary non-limiting top-level architecture of an exemplary system in which the present invention may operate. Import-export module 580 and generic model matching algorithm 570 may be combined in a single component 540, such as a COM component, e.g., a dynamic link library (DLL) that can be loaded by any application that requests component 540. Two schemas are accepted, encoded in some format such as the XML Document Object Model (DOM) 550. For example, relational schemas can be represented in the SQL subset of Semantic Modeling Format (SMF), which is an XML-based data exchange format used by the English Query facility in MICROSOFT® SQL Server, while XML schemas can be represented in either SMF or XDR format, both of which are XML and therefore it is known to parse them into DOM format. The system then produces an output map, which may also be in DOM format. There can also be an optional input map that serves as a hint to the matching algorithm.
  • The use of XML DOM 550 as the input and output format to communicate between the graphical user interface (GUI) and model matching component 540 is merely a convenience, and any format may be accommodated since the invention provided is a generic solution. Any format that can be imported into the generic object model is satisfactory. The import/export module 580 converts the DOM representation 550 of the input schemas into the internal object model 560 of model matching component 540.
  • The matching algorithm 570 operates on two models represented in the internal object model 560 and computes a node similarity matrix, which may be transformed into a map, which is also represented in the internal object model 560. Thus, the algorithm 570 is generic and depends only on the generic object model, which is unaffected by the data model used to represent the input models. Conversion of schemas to a generic object model is described in more detail below in the section regarding generic object modeling.
  • The generic model matching component 540 is designed to be extensible. In one embodiment, its top-level procedure simply calls multiple matching algorithms in sequence, all of which have the same interface. Each matching algorithm can be implemented as a separate sub-module. These sub-modules can pass matching information between each other through the top-level procedure. This modular structure allows new model matching algorithms to be added without altering the overall structure of model matching component 540.
  • The exemplary system of FIG. 5 may include two different matching algorithms (i.e., sub-modules) combined to perform the matching algorithms of the present invention, or the two matching algorithms may be integrated. The first algorithm may match individual elements of the schemas by using linguistic information about the name of each element and by using each element's data type. Other type-oriented information can be added to the generic object model so that the algorithm can exploit items such as whether there are null elements, default values, whether values are members of an enumeration and whether elements are mandatory or optional.
  • The second algorithm may be the structure-matching component that exploits the hierarchical or graph-like structure of the schemas. This sub-module may match elements whose neighborhoods in the two schemas also match. These two algorithms may produce corresponding similarity coefficients, from which weighted similarity coefficients may be constructed, and from which a resultant map may be constructed based on a combination thereof. As mentioned, a single component could perform both the linguistic and structural analysis.
  • Thus, a modeling application 520 may open 510 or save 501 a file having data sets, or mapping data for the data sets, etc. In this exemplary embodiment, a driver 530 assists in retrieving 502 or saving 509 data from or to a data store, and also makes calls 503 and receives results from a model match component 540 in accordance with the invention. In this embodiment, calls are made in XML DOM format 550. An import/export object 580 of match component 540 imports models 504 from and exports mappings 507 for the data sets of DOM 550. Once imported, the invention abstracts the data sets to a generic object model 560 and calls 505 the model match module 570 to perform the model match algorithm(s) of the invention. Model match module 570 returns 506 the results in terms understood by the generic object model 560 utilized by the invention. The user can modify a generated result map, making corrections, and then perform the model match again with the corrected map as an input, thereby generating an improved map. Thus, initial mappings provide a means of capturing user interaction with the model matching process.
  • Thus, one implementation of the invention may be to incorporate the algorithm(s) into a matching application or tool that provides a user interface for mapping two schemas, with appropriate user interaction with the mapping process to subjectively validate the quality of result.
  • In one configuration, the performance of the algorithm(s) of the invention may comprise several phases, as shown in FIG. 6. The inherency matching component involves elements 600 to 660 and operates on the name of model elements and certain other information that may be data model specific, such as data types and names of Strong Containers. The structural matching component involves element 665. As described earlier, from inherency matching coefficients and structural matching coefficients, a mapping may be produced between two schemas.
  • A Conversion of Names to Normal Form component 600 includes three sub-components, split 605, expand 610 and eliminate 615, to normalize the input name data. First, source SS and target TS schemas are input to any one or more of the embodiments of the model match algorithms of the invention and are tokenized by split sub-component 605 to convert the name(s) of the model elements to a normal form. With respect to abbreviations and acronyms, common abbreviations and acronyms are maintained in a data store 620 and are used to substitute for the true content by expand sub-component 610. Eliminate sub-component 615 eliminates expletives, prepositions and conjunctions. A list of expletives, prepositions, conjunctions and other unhelpful input items may be stored in a data store 625.
  • As to categorization 630, after converting a name to a set of word tokens, additional word tokens are added to the normal form to describe each model element's data type, if it has one, and concepts to which it is related. These additions are mostly driven by the content of another data store 635, which associates words with concepts. It can be appreciated that data stores 620, 625, 635 etc. may also be integrated. Categorization is performed separately for each model SS and TS, since the notion of compatibility may be different for a single model than for a pair of models.
  • After adding these tokens to the normal form, model elements are grouped into categories based on common tokens. Each category is associated with a set of keywords that describe the category. Once categorized, name similarity is calculated using a name similarity algorithm, which may include an analysis of synonyms and hypernyms 645 and/or an analysis of other relations 650.
  • The invention is not limited to analysis based upon sub-component 645. Other options 650 include querying a semantic network tool 660, which builds relationships and computes similarities among words by parsing a dictionary or thesaurus. However, performing such queries on the fly might be time consuming. On-the-fly querying of the semantic network tool 660 could be avoided by a pre-processing step that uses information in the semantic network tool 660 to populate the thesaurus 655. Or it could be a post-processing step after the matching process that adds new similarity relationships in the thesaurus for word pairs that were not found during the matching.
  • Once tokenized, the linguistic similarity of two model elements s and t, standing for the source and target models, respectively, is computed using the name similarity of the tokenized normal forms and data type similarity.
  • Some model elements are not name matched because they do not have a name or their name is not significant. For example, a key does not have a name, but the columns that comprise the key do. Only model elements that have been tagged to be name-matched are actually name-matched. This tagging is dependent on the mapping of elements of the particular data model to the internal object model. For SQL schemas, the schema, tables and columns are tagged to be matched. For XML, the ElementTypes and AttributeTypes are tagged to be matched.
  • With respect to schema matching component 665, in addition to linguistic matching, the hierarchical relationships in the schema are leveraged to infer mappings. This is achieved using the above-described tree matching algorithm that matches tree representations of the different data paths in the two schemas. Thus, at some point in the process, a transformation is applied to the schemas to represent them as trees of data paths for structural analysis. The tree-matching algorithm 665 operates on a pair of data path trees to produce structural similarity coefficients. Each pair of nodes of the two trees being compared then have an associated pair of similarity coefficients, namely the inherent similarity of the two model elements to which they correspond, and the structural similarity of the two nodes computed by the schema matching algorithm 665. The effective similarity is then calculated to be a weighted function of these two coefficients.
  • Inherent Similarities
  • As related above, one type of similarity that is taken into account by the matching algorithm is inherent similarity. This type of similarity attempts to take into account those kinds of similarities between schemas that do not relate to the structure, i.e., the hierarchical relationships between model elements.
  • As mentioned, prior to the computation of inherent similarity coefficients, certain normalization and categorization of the model elements is performed. With respect to normalization, many semantically similar schema element names contain abbreviations, acronyms, punctuation, etc. that make them syntactically different. To make them comparable, the invention normalizes them into sets of name tokens, as follows:
  • The tokenization of the invention parses names into tokens by a customizable tokenizer using punctuation, upper case, special symbols, digits, etc. For example, POLines→{PO, Lines}. Abbreviations and acronyms may also be expanded, e.g., {PO, Lines}→{Purchase, Order, Lines}. Elimination is also performed, when appropriate, wherein tokens that are articles, prepositions, expletives or conjunctions are marked to be ignored during comparison. Tagging may also be performed whereby a schema element that has a token related to a known concept is tagged with the concept name, e.g., elements with tokens price, cost and value are all associated with the concept money. The abbreviations, acronyms, ignored words and concepts may be determined by one or more thesaurus lookups. A thesaurus can include terms used in common language as well as domain-specific references, e.g., specialized terms used in purchase orders. In an exemplary embodiment, each name token is marked as being one of five token types: a number, a special symbol (e.g., #), a common word which token type includes prepositions and conjunctions, a concept as explained above or content (all the rest).
  • Thesauri can thus play a role in linguistic matching. The effect of dropping the thesaurus varies. The tokenization performed by the invention, followed by stemming, can aid in the automatic selection of possible word meanings during name matching and make it easier to use off-the-shelf thesauri. One implementation includes using a module to incrementally learn synonyms and abbreviations from mappings that are performed over time. The use of linguistic similarity and structural similarity over time can provide a synergy of benefit to these results.
  • With respect to categorization, the invention clusters schema elements belonging to the two schemas into categories. A category is a group of elements that can be identified by a set of keywords, which are derived from concepts, data types, and element names. For example, the category money includes each schema element that is associated with money, i.e., “money” appears in its name or it is tagged with the concept of Money. The purpose of categorization is to reduce the number of element-to-element comparisons. By clustering similar elements into categories, the invention may compare those elements that belong to compatible categories. Two categories are compatible if their respective sets of keywords are “name similar,” a phrase defined below.
  • Categories and keywords are determined with the following: concept tagging, data types and containers. Concept tagging refers to assigning a category per unique concept tag in the schema. Data types refer to assigning a category for each broad data type, e.g., all elements with a numeric data type are grouped together in a category with the keyword Number. Like all categorization criteria, data types are used primarily to prune the matching and do not contribute significantly to the linguistic similarity result. With respect to containers, a schema element that “contains” other elements defines a category. For example, Street and City are contained by Address and hence can be grouped into a category with keyword Address. Containment is described in more detail below. The invention constructs separate categories for each schema. For each element, the invention inserts the element into an existing category (same data type, same concept, or same container) if possible, or otherwise creates new categories. In this regard, each schema element may belong to multiple categories. Each relationship is either a containment or non-containment relationship, and is directed from its origin object to its destination object. A model is identified by a root object and includes all objects that are reachable from the root by following containment relationships in the origin-to-destination (container-to-containee) direction.
  • With respect to the phrases “name similar” or “name similarity,” the similarity of two name tokens t1 and t2, defined mathematically as sim(t1, t2), is looked up in one or more synonym and/or hypernym thesauri. Each thesaurus entry is annotated with a coefficient in the range [0,1] that indicates the strength of the relationship. In the absence of such entries, the invention matches substrings of the words t1 and t2 to identify common prefixes or suffixes. The name similarity (ns) of two sets of name tokens T1 and T2 is the average of the best similarity of each token with a token in the other set. Name similarity, in an exemplary embodiment, is calculated according to the following equation: ns ( T 1 , T 2 ) = t 1 T 1 [ max t 2 T 2 sim ( t 1 , t 2 ) ] + t 2 T 2 [ max sim t 1 T 1 ( t 1 , t 2 ) ] T 1 + T 2
  • Two categories are compatible if the name similarity of their token sets exceeds a given threshold, thns. The parameter thns is the name similarity threshold for determining compatible categories. This value is used for pruning the number of element-to-element linguistic comparisons, and thus a variety of choices for assigning the actual value are available. For example, 0.5 may be chosen for thns, although other values may be suitable depending upon a desired amount of pruning.
  • With respect to comparison, the invention calculates the linguistic similarity of each pair of elements from compatible categories. Linguistic similarity is based on the name similarity of elements, which is computed as a weighted mean of the per token type name similarity, wherein each token is one of the exemplary five types listed above. If T1i and T2i are the tokens of elements m1 and m2 of type i, the name similarity of m1 and m2 is computed as follows: ns ( m 1 , m 2 ) = i TokenType w i × ns ( T 1 i , T 2 i ) i TokenType w i × ( T 1 i + T 2 i ) , where Σ w i = 1
  • Content and concept tokens are assigned a greater weight (wi), since these token types are more relevant than numbers and conjunctions, prepositions, etc. In one implementations, the inherent similarity, or linguistic similarity (lsim), is computed by scaling the name similarity of the model elements by the maximum similarity of categories to which they belong: lsim ( m 1 , m 2 ) = ns ( m 1 , m 2 ) × max c 1 C 1 , c 2 C 2 ns ( c 1 , c 2 )
    where C1 and C2 are the sets of categories to which m1 and m2 belong, respectively.
  • The result of this phase is a table of linguistic similarity coefficients between elements in the two schemas. The similarity is assumed to be zero for schema elements that do not belong to any compatible categories.
  • Models
  • The invention thus matches one data model with another data model, calculating inherent similarity coefficients and structural coefficients, with an emphasis upon similarity of subtree structure. For purposes of construing what is meant by data model, or schema, in accordance with the invention, the following description of data models is presented. One of ordinary skill in the art will be able to appreciate that a wide variety of models are contemplated and that any hierarchically organized data that may form a tree structure is suited to the invention's application. How to model particular features common to a variety of particular data models in a generic sense is also described. For instance, the modeling of referential integrity constraints is described in detail to show how some particular data models operate, and how they may be generalized for purposes of applying the matching operations of the invention.
  • A model is a complex structure that describes a design artifact. For example, a relational schema is a model that describes a relational database, i.e., tables, columns, constraints, etc. An XML DTD or an XML schema expressed in XML Schema Definition Language (XSD) or an XDR Schema is a model that describes the structure of an XML document. An object hierarchy is a model that describes the classes, relationships and inheritances of the C++ interfaces in an application or in an object store. Further examples of models are UML descriptions, workflow definitions, Web-site maps, and other models mentioned herein.
  • In exemplary non-limiting embodiments of the present invention, an object-oriented data model is used to describe models and mappings. Graph-oriented terminology is sometimes used to describe models, such as when referring to objects as nodes and relationships as edges. Each relationship of a model is either a containment or non-containment relationship, and is directed from its origin object to its destination object. A model is identified by a root object and includes all objects that are reachable from the root by following containment relationships in the origin-to-destination, i.e., the container-to-containee direction.
  • A mapping is a model that relates a domain model to a range model, or a source model to a target model. The root of the mapping connects the root of the domain model to the root of the range model. Every other mapping object in the mapping has relationships to zero or more domain objects and relationships to zero or more range objects. A mapping may also contain an expression that explains the relationship between the domain and range objects to which it connects.
  • The match operation on models and mappings is as follows: Match (M1, M2, ≅) returns a mapping from model M1 to M2 that is consistent with the similarity relation ≅, which is a binary relation defined over individual objects. Although the relation ≅ is shown here as a parameter, it is currently implemented as a combination of context, e.g., a shared thesaurus, and algorithms which may be optionally plugged into a match implementation, as described by the foregoing implementations.
  • In accordance with the present invention, a generic object model is defined, which standardizes the comparison of disparately formatted models. As described in more detail below, any format may be represented with the generic object model, and thus the input format of a hierarchically represented data structure becomes irrelevant to the extent it may be represented with the generic object model.
  • For the generic object model, the smallest unit of metadata is termed a model element. Distinguishing the different types of relationships between model elements is a key aspect of designing the generic object model. At least three relationships are common to a wide variety of data models, and these relationships are depicted in FIG. 7 as between model elements 700, containers 700 a and aggregates 700 b.
  • The Strongly Contains relationship relates a model element, called a container 700 a to another model element 700. Each model element is strongly contained by at most one container 700 a. The concept of container 700 a is sufficiently useful that in one embodiment, a container is defined as a class, which is a specialization of model element 700. A Strong Containment relationship captures the following two kinds of semantics: delete propagation and naming. With delete propagation, if a container 700 a is deleted, then all of the model elements 700 it contains are deleted. With naming, a model element 700 can be named by concatenating the name of its container 700 a, a delimiter, e.g., “.” or “/”, and the name of the model element 700.
  • For example, if a relational schema Customers strongly contains a table Customer, which strongly contains a column CName, then the column's full name-may be Customers.Customer.CName, which uniquely distinguishes it from any other column. The column ceases to exist if either the table or the schema that contains it is deleted.
  • The Aggregates relationship connects a model element, called an aggregate 700 b, to other model elements 700. Like Strong Containment, this relationship groups together a set of related model elements 700. However, the relationship is weaker than Strong Containment, in that it does not propagate delete or affect naming. Rather, the aggregates relationship captures the semantics of prevent delete, i.e the target of an aggregation relationship cannot be deleted. In other words, the aggregation relationship must be deleted before the target can be deleted. For example, a typical aggregation relationship is the relationship from a compound key to each of the columns that comprise the key.
  • The IsDerivedFrom relationship connects two model elements 700. The IsDerivedFrom relationship is a generalization of isA and is TypeOf relationships, which are used in all object-modeling methodologies. The IsDerivedFrom relationship captures two kinds of semantics: delete prevention and shortcutting. With shortcutting, the target can be replaced by the source. For example, a specialization can be replaced by its generalization, or an object can be replaced by its type definition. These shortcutting semantics of IsDerivedFrom are not commonly used in object modeling; however, shortcutting semantics can be important for model match. Examples of IsDerivedFrom relationships are ones between an element and its ElementType or an attribute and its AttributeType in XDR schemas.
  • In other embodiments of the invention, model elements can be related by other types of relationships. StronglyContains and IsDerivedFrom relationships are both containment relationships. Thus, a model is defined by a root and contains objects reachable by following Strong Containment and IsDerivedFrom relationships.
  • The present invention distinguishes between model elements that are instantiated as data instances, such as elements and attributes in XDR and tables and columns in SQL, from those that are constraints on instances of other model elements, such as attribute type definitions in XDR and key definitions in SQL. The model element property IsInstantiated is true for the former, false for the latter. This distinction can be useful when performing structural matching of models.
  • The present invention assumes that the containment relationships that connect the objects in a model form a directed acyclic graph (DAG). Disallowing cycles implies that recursive types such as bill-of-materials and organization charts cannot be represented. There can thus be multiple paths from the root of a model to a particular model element; however, this leads to a significant complexity in matching: Suppose model D is the domain of a mapping. Suppose D contains a model element d, which has two parents via Strong Containment and/or IsDerivedFrom relationships. Since d may have two different meanings, one for each of its parents, it could be mapped to two different elements of a range model, one for each parent. This implies that a model match algorithm needs to perform context-dependent bookkeeping for each model element.
  • To make this more concrete, the use of types in a schema expressed in XSD may be considered. Suppose a model that represents an XSD complexType Order has elements Customer and Supplier, as shown in FIG. 8. Suppose Addr, i.e., address, is a sub-element of both Customer and Supplier. In XSD, these are represented as two separate Addr elements as shown. In addition, suppose both Addr elements are of the same complexType, e.g., Address. In XSD, Address is represented only once and is referenced by the type attribute (shown with a double box) of the two Addr elements. Thus, when representing all of these components of Order in the generic object model utilized with the invention to genericize disparate object models, complextype Address has two parents, namely, the two Addr elements via IsDerivedFrom relationships.
  • Suppose complexType Address has some XSD attributes, such as Street, City, and State. These attribute definitions explain two different parts of Order, namely the sub-structure of Addr in Customer and of Addr in Supplier. Therefore, when creating a mapping from Order to another model, e.g., Purchase-Order, the attributes of Addr in Customer might map to different model elements in Purchase-Order than the attributes of Addr in Supplier. For example, Order.Customer.Addr.Street might map to Purchase-Order.Customer-Street and Order.Supplier.Addr.Street might map to Purchase-Order.SupplierStreet. Unfortunately, since the model element representing attribute Address.Street is shared by Customer.Street and Supplier.Street, if Address.Street is simply mapped to Purchase-Order.CustomerStreet and Purchase-Order.SupplierStreet, an ambiguity results. Namely, it is ambiguous that the relationship Address.Street to Purchase-Order.CustomerStreet is in the context of Customer while the relationship Address.Street to Purchase-Order.SupplierStreet is in the context of Supplier.
  • To avoid the context-dependent bookkeeping implied by this example, the generic object model of the invention considers each path to a node with multiple parents independently. Each such path is a data-path. During the execution of a match operation, all data-paths are expanded, thereby effectively transforming the DAG into a tree. As a side note, while the use of the word data-path comes from the intuition that it is a sequence of “data” containment relationships, a better term might be “name-path” or “ID-path.”
  • The schemas that have been examined herein so far have been trees. Real world schemas are rarely trees, since they share substructure and have referential constraints. The techniques of the present invention may be extended to these cases, but first a generic schema model that captures more semantics is presented, leading to nontree schemas.
  • In a generic schema model, a schema is a rooted graph whose nodes are elements. The invention uses the terms nodes and elements interchangeably. In a relational schema, the elements are tables, columns, user defined types, keys, etc. In an XML schema the elements are XML elements and attributes (and simpleTypes, complex Types, and keys/keyrefs in XML Schema (XSD)). Elements are interconnected by three types of relationships, which together lead to nontree schema graphs. The first is containment, which models physical containment in the sense that each element (except the root) is contained by exactly one other element. For example, a table contains its columns, and is contained by its relational schema. An XML attribute is contained by an XML element. The schema trees presented in examples so far are essentially containment hierarchies. A second type of relationship is aggregation. Like containment, aggregation groups elements, but is weaker (allows multiple parents and has no delete propagation). For instance, a compound key aggregates columns of a table. Thus, a schema graph need not be a tree, i.e., a column can have two parents: a table and a compound key. The third type of relationship is IsDerivedFrom, which abstracts IsA and IsTypeOf relationships to model shared type information. Schemas that use them can be arbitrary graphs (e.g., cycles due to recursive types). In XSD, an IsDerivedFrom relationship connects an XML element to its complex type. In object oriented models, IsDerivedFrom connects a subtype to its supertype. IsDerivedFrom shortcuts containment: if an element e IsDerivedFrom a type t, then t's members are implicitly members of e. For example, if USAddress specializes Address, then an element Street contained by Address is implicitly contained by USAddress too.
  • With respect to matching shared types, when matching schemas are expressed in the above model, the linguistic matching process that was described earlier is unaffected. The invention may, however, choose not to linguistically match certain elements, e.g., those with no significant name, such as keys. Structure matching is affected. Before this step, the schema is converted to a tree, for two reasons: to reuse the structure matching algorithm for schema trees and to cope with context dependent mappings. An element, such as a shared type, can be the target of many IsDerivedFrom relationships. Such an element e might map to different elements relative to each of e's parents. For example, reconsidering the XML schemas in FIG. 3, suppose the PurchaseOrder schema was altered so that Address is a shared element, referenced by both DeliverTo and InvoiceTo. POShipTo.Street and POBillTo.Street now both map to Address.Street in Purchase Order, but for each of them the mapping needs to qualify Address.Street to be in the context of either DeliverTo or InvoiceTo. Including both of the mappings without their contexts is ambiguous, e.g., complicating query discovery. Thus, context dependent mappings are needed. The invention achieves this by expanding the schema into a schema tree. There can be many paths of IsDerivedFrom and containment relationships from the root of a schema to an element e. Each path defines a context, and thus is a candidate for a different mapping for e. By converting a schema to a tree, the invention materializes all such paths. To accomplish this, the algorithm performs a preorder traversal of the schema, creating a private copy of the subschema rooted at the target t of each IsDerivedFrom for each of t's parents, which is essentially type substitution. In an exemplary embodiment, the algorithm is as follows:
    schema_tree = construct_schema_tree(schema.root, NULL)
    construct_schema_tree(Schema Element current_Se, Schema
    Tree Node current_stn)
      If current_se is the root or current_se was reached
      through a containment relationship
        If current_se is not_instantiated then return current_stn
        new_stn = new schema tree node corresponding to current_se
        set new_stn as a child of current_stn
        current_stn = new_stn
      for each outgoing containment or isDerivedFrom relation
        new_se = schema element that is the target of the relationship
        construct_schema_tree(new_se, current_stn)
    return current_stn
  • For each element, the invention adds a schema tree node whose successors are the nodes corresponding to elements reachable via any number of IsDerivedFrom relationships followed by a single containment. Some elements are tagged not-instantiated (e.g., keys) during the schema tree construction and are ignored during this process.
  • At this point, a representation has been formed on which the invention may run the tree match algorithm described in detail above. The similarities computed are now in terms of schema tree nodes. The resulting output mappings identify similar elements, qualified by contexts. This results in more expressive and less ambiguous mappings.
  • Thus, Strong Containment, IsDerivedFrom, and aggregate relationships can be used to model hierarchical schemas, such as XML schemas without any IDs and IDREFs, or a SQL schema without any foreign keys. This alone, however, places a restriction on the expressive power of a model. In order to alleviate this restriction, a fourth relationship may be introduced, termed a referential integrity relationship or referential integrity constraint in the database literature. A referential integrity relationship models an existential dependency between model elements in different parts of a schema. A model element that represents a referential integrity constraint is called a RefInt. Referential integrity constraints are supported in most data models.
  • Three examples of referential integrity relationships include the relationship between a foreign key column in a table and the primary key in another table, the relationship between an ID and an IDREF in a DTD and the relationship between a keyref and a key in XSD. Referential constraints are directed from a source, e.g., foreign key column, to a target, e.g., primary key to which the foreign key refers. Such RefInt elements aggregate the source, and reference the target of such a relationship, whereby “reference” is a new relationship type.
  • Both the aggregates and reference relationships are one to n. Thus, a RefInt can model compound keys and multi-attribute keyrefs. For example, the modeling of a foreign key 910 with respect to two SQL Tables 900 a and 900 b, foreign key column 920 and primary key column 930 is shown in FIG. 9A. Referential integrity relationships are directed. In the case of SQL schemas, the foreign key column 920 is the source, and the primary key column 930 by which it is constrained is the target. The source and the target can in general be sets of model elements, e.g., a compound key. The foreign key references the single compound primary key elements 930 of the target table, which aggregates the key columns 920 of that table.
  • In the general case, there is a single reference relationship. Multiple references allow for alternate targets for a single source of a referential constraint, e.g., an IDREF attribute in XDR references all the ID attributes in the schema, because each of them is a candidate target for an IDREF attribute (IDs and IDREFs are untyped in DTDs and XDR schemas). The 1 to n nature of the reference relationship thus allows a single IDREF attribute to reference multiple IDs in an XML DTD.
  • FIG. 9B illustrates the relationship between a model element 940, a model aggregate 950 and a RefInt 960. In this regard, a RefInt 960 is a specialization of a model aggregate 950, which is a specialization of a model element 940. In addition to the aggregate relationship that is inherited, a RefInt 960 also has a reference relationship. A RefInt 960 aggregates the model elements that are the source and references the target of the referential constraint.
  • A RefInt model element can either be instantiated (e.g., IDREFs) or not instantiated (e.g., foreign keys), as indicated by an isInstantiated flag. The model representation of RefInts in relational and XDR schemas is shown in FIGS. 10A and 10B, respectively.
  • Utilizing this knowledge of RefInts, a data path tree may be augmented with additional nodes, where useful. More particularly, a data path tree that is built by exploiting Strong Containment and IsDerivedFrom relationships may be augmented with additional nodes to take advantage of RefInts in the similarity computation.
  • In this regard, foreign keys are taken advantage of by interpreting them as join views. The foreign key node in the schema is replaced by a single data path node representing the join of the two tables. There are two motivations for such an abstraction. The first is that a referential constraint says that a join between two tables makes semantic sense, because values of the foreign key are guaranteed to be present as values of the key being referenced. The second is that since the match algorithm operates by matching data tree elements, representing a referential constraint as such a node makes it the subject of a match. The interpretation of a RefInt as a join view is illustrated in the example below.
  • For the sake of clarity, FIG. 11 illustrates encoding a RefInt in a data tree for SQL schemas. A similar procedure is applicable in XSD and XDR schemas. An additional node is added that has as its children the columns of the two tables, with one exception: the foreign key columns are not duplicated, since they are the same in both tables (the choice of primary or foreign key columns is arbitrary). As a result of this augmentation, a data path DAG is formed instead of a tree, because the referenced model elements have two parents, e.g., OrderID, CustomerID, SSN, and Address in FIG. 11. The augmented node is a child of the schema node of the data path DAG (e.g., OrderFK in FIG. 11, although the schema node is not shown in that figure). Since it is possible to enumerate the nodes of a DAG in inverse topological order, the tree comparison algorithm described earlier is applicable to this DAG as well. However, the algorithm loses its Church-Rosser property. That is, the result of the similarity computation might vary depending upon the order in which the table and view nodes are considered, since there doesn't exist a unique inverse topological order enumeration.
  • This encoding of a RefInt not only causes foreign keys to be matched between two models, but also disambiguates matchings between elements that are referenced by the RefInt. For example, suppose the model of FIG. 11 is being matched with the model of FIG. 12. In FIG. 12, only recent customers (RecentCust) have orders. Old customers (OldCust) are customers who have not placed orders in a long time. Therefore, the foreign key is from CustomerID in Order to RecentCust and not to OldCust. When matching this model against FIG. 11, the nodes named OrderFK in the two models will be compared for similarity and will be found to match, based both on their linguistic similarity and structural similarity. This match will cause the leaves of the trees rooted at OrderFK in the two models to be reinforced as per the step of the matching algorithm that reinforces similar (and dissimilar) leaf structure of nodes, so that the subtree Customer in FIG. 11 will match RecentCust in FIG. 12 rather than OldCust.
  • This example also illustrates the loss of the Church-Rosser property: The argument of the previous paragraph holds only if the similarity of the OrderFK elements of the two models is calculated before that of Customer and RecentCust. If the latter is calculated first, then the similarity of Customer to RecentCust will be the same for Customer and OldCust. This suggests that elements representing foreign keys should be matched before those that represent the (base) elements that those foreign keys connect. It also shows that a second pass calculation of similarities of non-leaf elements would produce more accurate results.
  • The presence of multiple foreign keys, some of which might be compound keys, in a single table presents a combinatorial challenge. Strictly speaking, each conceivable combination of keys presents an alternative view definition, and hence an additional data path node. However, in the interest of preventing a combinatorial blow-up, one additional node per foreign key is added to a table.
  • Another consideration is the cascading of view expansions. The additional node resulting from one join view might contain a column that is a foreign key to a different table. In accordance with the invention, such nodes are not expanded any further, in the interest of keeping the computation manageable.
  • A similar procedure is applicable in the case of ID/IDREF pairs in XDR schemas, with the following changes: First, since each IDREF attribute references every ID attribute in the schema, an extra node is added for each ID/IDREF attribute pair in the schema. Second, while all the children of a SQL table are leaves of the data path DAG, the same is not true for XML elements in an XML schema. Furthermore, a cycle could appear if an IDREF attribute references an ID attribute higher up in the schema. Cycles must be prevented to avoid an infinite loop in the algorithm for constructing data path trees. Cycles can be prevented by performing cycle detection during augmentation and then links are not added that result in cycle formation. It is noted that these are cycles in the data path tree, which are different from the cycles of contained and IsDerivedFrom relationships. Third, as there can be multiple ID (IDREF) attributes corresponding to a single Attribute Type in XDR, each such attribute contributes to additional nodes in the data path tree.
  • The approach to XSD includes the following considerations: First, keys and keyrefs in XSD are typed and context-sensitive, qualified by Xpath expressions, and not context-free like ID/IDREFs. Only those nodes that match the Xpath expressions need to be considered during augmentation. Second, keys and keyrefs can have multiple attributes, but unlike compound foreign keys, these attributes need not be contained by a single parent. This leads to a need for careful consideration of nodes to be assigned as children of the augmented node.
  • Thus, as mentioned, the present invention interprets referential constraints as potential join views. In one embodiment, for each foreign key, the present invention introduces a node that represents the join of the participating tables, illustrated in more detail in FIG. 13. This technique reifies the referential constraint as a node that can be matched. Intuitively, the technique works since the referential constraint implies that the join is meaningful. It is of note that the join view node has as its children the columns from both the tables. The common ancestor of the two tables is thus made the parent of the new join view node.
  • These augmented nodes have two benefits. First, if two pairs of tables in the two schemas are related by similar referential constraints, then when the join views for the constraints are matched, the structural similarities of those tables' columns are increased. This improves the structural match. Second, this enables the discovery of mappings between a join view in one schema and, a single table or other join views in the second schema. The additional join view nodes create a directed acyclic graph (DAG) of schema paths. Since the inverse topological ordering of a DAG, equivalent to post-order for a tree, is not unique, the algorithm is not Church-Rosser, i.e., the final similarities depend on the order in which nodes are compared. To make it Church-Rosser, additional ordering constraints may be added. For example, the RefInt nodes may be compared after the table nodes; however, determining which ordering would be best is still an open problem. If a table has multiple foreign keys, one node may be added for each of them. There is also the option of adding a node for each combination of these foreign keys (valid join views). In the interest of maintaining tractability, however, this step may be skipped. Similarly, the join view node that is added may also have a foreign key column of the target table. The invention may also expand these further, thus escalating expansion of referential constraints, but both for computation reasons and due to the lower relevance of tables at further distances, such a technique may be foregone.
  • In one embodiment of the invention, a feature of optionality is provided. Elements of schemas may be marked as optional, i.e., as nonrequired attributes of XML elements. To exploit this knowledge, the leaves reachable from a schema tree node n are divided into two classes: optional and required. A leaf is optional if it has at least one optional node on each path from n to the leaf. The structural similarity coefficient expression is changed to reduce the weight of optional leaves that have no strong links, i.e., they are not considered in both the numerator and denominator of the ssim calculation. Therefore, nodes are penalized less for unmappable optional leaves than unmappable required leaves, so the matching is more tolerant to the former.
  • In another embodiment of the invention, different views are accommodated. View definitions are treated like referential constraints. A schema tree node is added whose children are the elements specified in the view. Such a schema tree node represents a common context for these elements and can be matched with views or tables of the other schema.
  • In another embodiment of the invention, a lazy expansion process is provided. A schema tree construction expands elements into each possible context, much like type substitution. This expansion duplicates elements, leading to repeated comparisons of identical subtrees. For example, in the example provided in FIG. 3, the Address element is duplicated in multiple contexts within the PurchaseOrder schema and each of these duplicates is compared separately to elements of PO. These duplicate comparisons may be avoided by a lazy schema tree expansion, which compares elements of the schema graph before converting it to a tree. The elements are enumerated in inverse topological order of containment and IsDerivedFrom relationships. After comparing an element that is the target t of multiple IsDerivedFrom and containment relationships, multiple copies of the subtree rooted at t are made, including the structural similarities computed so far. This works because when two nodes are compared for the first time, their similarity depends on the similarity of their subtrees. Similarly, the similarity of the leaves reflect those nodes that have already been traversed thus far. Hence, the computed similarity values remain the same as in the case when the schema is expanded a priori. Thus, identical recomputation for the context dependent copies of the subtree may be avoided.
  • The analyses of the matching problem and the provision of a generic solution, as described herein, leads to a variety of observations. For instance, with respect to granularity of similarity computation, it has been observed that class-level similarity computation can sometimes lead to nonoptimal mappings. Thus, with the invention, single classes may be nested or normalized differently, with referential constraints, in different schemas.
  • Using the leaves in the schema tree for the structural similarity computation allows the invention approach to match similar schemas that have different nesting. Also, reporting mappings in terms of leaves allows a sophisticated query discovery module to generate the correct queries for data transformations.
  • Moreover, incorporating structure information beyond the immediate vicinity of a schema element leads to better matching. Thus, in the example of FIG. 3, the invention is able to match POBillTo, POShipTo and POLines to InvoiceTo, DeliverTo and Items respectively.
  • Furthermore, context-dependent mappings generated by constructing schema trees are useful when inferring different mappings for the same element in different contexts.
  • Some of the mapping results for a certain tool or application might not be the best achievable by the algorithm since improvements may be possible by adjusting a few of the parameters. Tuning performance parameters in some cases requires expert knowledge of these tools. Thus, in an exemplary embodiment, a module for autotuning parameters is provided. Based upon the analysis of volumes of data, taking the complexity of the structure and linguistics of the schemas into account, a mechanism can be provided for automatically setting the parameters of the invention prior to matching. Alternatively, a “sliding bar” of results may be presented to the user, giving the user an opportunity at a glance to choose results from a variety of parameter sets.
  • As mentioned above, while exemplary embodiments of the present invention have been described in connection with various computing devices and network architectures, the underlying concepts may be applied to any computing device or system in which it is desirable to match models. Thus, the techniques for calculating similarity coefficients and a mapping between models in accordance with the present invention may be applied to a variety of applications and devices. For instance, the model matching techniques of the invention may be applied to the operating system of a computing device, provided as a separate object on the device, as part of the object itself, as a downloadable object from a server, as a “middle man” between a device or object and the network, etc. The similarity coefficients and mapping data generated may be stored for later use, or output to another independent, dependent or related process or service. While exemplary programming languages, names and examples are chosen herein as representative of various choices, these languages, names and examples are not intended to be limiting. One of ordinary skill in the art will recognize that such languages, names and examples are choices that may vary depending upon which type system is implicated, and the rules for the type system. Further, while particular names for software components are utilized herein for distinguishing purposes, any name would be suitable and the present invention does not lie in the particular nomenclature utilized.
  • The various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs that may utilize the model matching of the present invention, e.g., through the use of a data processing API or the like, are preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
  • The methods and apparatus of the present invention may also be practiced via communications embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, a video recorder or the like, or a receiving machine having the matching capabilities as described in exemplary embodiments above becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates to invoke the functionality of the present invention. Additionally, any storage techniques used in connection with the present invention may invariably be a combination of hardware and software.
  • While the present invention has been described in connection with the preferred embodiments of the various figures, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiment for performing the same function of the present invention without deviating therefrom. For example, while exemplary embodiments of the invention are described in the context of a loosely coupled peer to peer network, one skilled in the art will recognize that the present invention is not limited thereto, and that the methods, as described in the present application may apply to any computing device or environment, such as a gaming console, handheld computer, portable computer, etc., whether wired or wireless, and may be applied to any number of such computing devices connected via a communications network, and interacting across the network. Furthermore, it should be emphasized that a variety of computer platforms, including handheld device operating systems and other application specific operating systems are contemplated, especially as the number of wireless networked devices continues to proliferate. Moreover, it is to be understood that the model matching algorithm(s) of the various embodiments described herein are generically applicable, independent of any particular data model. Accordingly, it is to be understood that while various examples herein are described in the context of a particular format, such as SQL, XML, UML, DTD, XSD, XDR and the like, this is for illustrative purposes only, and the techniques of the invention can be applied not only to any schema format now known, but also to any hereafter-developed data format. Still further, the present invention may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices. Therefore, the present invention should not be limited to any single embodiment, but rather should be construed in breadth and scope in accordance with the appended claims.

Claims (32)

1. A method for automatically and generically matching models comprising:
calculating similarity coefficients between schema elements of the models; and
matching the models based upon the calculated similarity coefficients wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements.
2. The method of claim 1 wherein the similarity coefficients are calculated based additionally on constraints of the schema elements.
3. The method of claim 2 wherein the similarity coefficients are calculated based additionally on data types of the schema elements.
4. A method for generating similarity coefficients between model elements when comparing a first data model having hierarchically organized first model elements and a second data model having hierarchically organized second model elements, comprising:
first generating a plurality of inherent similarity coefficients for each pair of model elements;
second generating a plurality of structural similarity coefficients for each pair of model elements;
third generating a plurality of weighted similarity coefficients for each pair of model elements; and
for each pair of model elements, altering the similarity of subtree elements rooted by the element pair if a pre-determined condition is met.
5. The method of claim 4 wherein the pre-determined condition being met requires a function based on said weighted similarity coefficient of said element pair meeting a predetermined condition.
6. The method of claim 4 wherein the plurality of weighted similarity coefficients is generated as a weighted function of said plurality of inherent similarity coefficients and said plurality of structural similarity coefficients.
7. The method of claim 4 wherein the plurality of structural similarity coefficients is generated based on a similarity of subtree elements rooted by the element pair, whereby each pair of model elements is assigned an initial structural similarity coefficient.
8. The method of claim 4 wherein each pair of model elements comprises a model element of said first model elements and a model element of said second model elements
9. A method for mapping messages between different extensible markup language (XML) formats comprising:
calculating similarity coefficients between schema elements of models corresponding to the XML messages wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements;
matching the models to the XML messages based upon the calculated similarity coefficients; and
returning a mapping that identifies corresponding elements in the schema elements.
10. The method of claim 9 wherein the similarity coefficients are calculated additionally based on constraints of the schema elements.
11. The method of claim 10 wherein the similarity coefficients are calculated additionally based on data types of the schema elements.
12. A method of processing disparate schemas containing customer information to identify and consolidate matching customer information comprising:
calculating similarity coefficients between schema elements of models corresponding to the schemas containing customer information wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements;
matching the models based upon the calculated similarity coefficients; and
returning a mapping that identifies corresponding elements in the schema elements.
13. The method of claim 12 wherein the similarity coefficients are calculated additionally based on constraints of the schema elements.
14. The method of claim 13 wherein the similarity coefficients are calculated additionally based on data types of the schema elements.
15. A system for automatically and generically matching models comprising:
a computing device operable for calculating similarity coefficients between schema elements of a first and a second model wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements and wherein the computing device is operable for matching the models based upon the calculated similarity coefficients.
16. The system of claim 15 further comprising a receiver operably coupled to the computing device for receiving data from a user about matching the first and the second model, said data used in connection with initializing the similarity coefficients.
17. The system of claim 16 wherein the similarity coefficients to be initialized are initial structural similarity coefficients.
18. The system of claim 15 wherein the computing device further comprises a post-processor operably embedded in the computing device for post-processing at least one of a plurality of inherent similarity coefficients, a plurality of structural similarity coefficients and weighted similarity coefficients to construct a mapping between the first and second models.
19. The system of claim 18 wherein the computing device is operable for transforming said first and second data models into a generic object model format irrespective of an input format of the first and second data models before such transformation.
20. A method for processing disparate schemas containing customer data comprising:
receiving a first customer data model having hierarchically organized first model elements;
comparing schema elements of the first customer data model elements and schema elements of a second customer data model having hierarchically organized second model elements; and
generating similarity coefficients between the first and second customer model elements based on the comparison of the schema elements wherein the similarity coefficients are calculated based on a structure of the schema elements and names of the schema elements.
21. The method of claim 20 wherein the generating similarity coefficients step comprises:
first generating a plurality of inherent similarity coefficients for the first and second model elements;
second generating a plurality of structural similarity coefficients for the first and second model elements;
third generating a plurality of weighted similarity coefficients for the first and second model elements; and
for the first and second model elements, altering the similarity of subtree elements rooted by the first and second model elements if a pre-determined condition is met.
22. A method for processing disparate schemas containing customer data comprising:
generating first customer data model having hierarchically organized first model elements;
transmitting the first customer data model to a second location of a second customer data model having hierarchically organized second model elements whereby schema elements of the first customer data model elements and schema elements of the second customer data model may be compared and similarity coefficients between the first and second customer data models may be generated based on based on a structure of the schema elements and names of the schema elements.
23. A computer readable medium having stored thereon a plurality of computer-executable modules, the computer executable modules, comprising:
means for calculating similarity coefficients between schema elements of the models; and
means for matching the models based upon the calculated similarity coefficients in operable communication with the calculating means wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements.
24. The computer readable medium according to claim 23 wherein the similarity coefficients are calculated based additionally on constraints of the schema elements and data types of the schema elements.
25. A computer readable medium having stored thereon a plurality of computer-executable modules, the computer executable modules, comprising:
a similarity coefficient generating mechanism for firstly generating a plurality of inherent similarity coefficients for each pair of model elements, for secondly generating a plurality of structural similarity coefficients for each pair of model elements, and for thirdly generating a plurality of weighted similarity coefficients for each pair of model elements; and
means for altering, for each pair of model elements, the similarity of subtree elements rooted by the element pair if a pre-determined condition is met, said altering means in operable communication with the similarity coefficient generating mechanism.
26. The computer readable medium according to claim 25 wherein the pre-determined condition being met requires a function based on said weighted similarity coefficient of said element pair meeting a predetermined condition.
27. The computer readable medium according to claim 25 wherein the plurality of structural similarity coefficients is generated based on a similarity of subtree elements rooted by the element pair, whereby each pair of model elements is assigned an initial structural similarity coefficient.
28. A computer readable medium having stored thereon a plurality of computer-executable modules for mapping messages between different extensible markup language (XML) formats, the computer executable modules, comprising:
means for calculating similarity coefficients between schema elements of models corresponding to the XML messages wherein the similarity coefficients are calculated based on a structure of the schema elements and on names of the schema elements;
means for matching the models to the XML messages based upon the calculated similarity coefficients, said means in operable communication with the calculating means; and
means for returning a mapping that identifies corresponding elements in the schema elements, said means in operable communication with the matching means.
29. The computer readable medium of claim 28 wherein the similarity coefficients are calculated additionally based on constraints of the schema elements and on data types of the schema elements.
30. A computer readable medium having stored thereon a plurality of computer-executable modules for processing disparate schemas containing customer data, the computer executable modules comprising:
means for receiving a first customer data model having hierarchically organized first model elements;
means for comparing schema elements of the first customer data model elements and schema elements of a second customer data model having hierarchically organized second model elements, said comparing means in operable communication with the receiving means; and
means for generating similarity coefficients between the first and second customer model elements wherein the similarity coefficients are calculated based on a structure of the schema elements and names of the schema elements, said generating means in operable communication with the comparing means.
31. The computer readable medium of claim 30 wherein the generating similarity coefficients means comprises plurality of computer-executable modules comprising:
a mechanism for firstly generating a plurality of inherent similarity coefficients for the first and second model elements, for secondly generating a plurality of structural similarity coefficients for the first and second model elements, and for thirdly generating a plurality of weighted similarity coefficients for the first and second model elements; and
means for altering, for each pair of model elements, the similarity of subtree elements rooted by the element pair if a pre-determined condition is met, said altering means in operable communication with the similarity coefficient generating mechanism.
32. A computer readable medium having stored thereon a plurality of computer-executable modules for processing disparate schemas containing customer data, the computer executable modules comprising:
means for generating first customer data model having hierarchically organized first model elements;
transmitting means, in operable communication with the generating means, for transmitting the first customer data model to a second location of a second customer data model having hierarchically organized second model elements whereby schema elements of the first customer data model elements and schema elements of the second customer data model may be compared and similarity coefficients between the first and second customer data models may be generated based on a structure of the schema elements and names of the schema elements.
US10/930,971 2001-12-20 2004-08-31 Methods and systems for model matching Abandoned US20050027681A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/930,971 US20050027681A1 (en) 2001-12-20 2004-08-31 Methods and systems for model matching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/028,912 US6826568B2 (en) 2001-12-20 2001-12-20 Methods and system for model matching
US10/930,971 US20050027681A1 (en) 2001-12-20 2004-08-31 Methods and systems for model matching

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/028,912 Continuation US6826568B2 (en) 2001-12-20 2001-12-20 Methods and system for model matching

Publications (1)

Publication Number Publication Date
US20050027681A1 true US20050027681A1 (en) 2005-02-03

Family

ID=21846208

Family Applications (3)

Application Number Title Priority Date Filing Date
US10/028,912 Expired - Lifetime US6826568B2 (en) 2001-12-20 2001-12-20 Methods and system for model matching
US10/930,971 Abandoned US20050027681A1 (en) 2001-12-20 2004-08-31 Methods and systems for model matching
US10/973,495 Expired - Lifetime US7444330B2 (en) 2001-12-20 2004-10-26 Methods and systems for model matching

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/028,912 Expired - Lifetime US6826568B2 (en) 2001-12-20 2001-12-20 Methods and system for model matching

Family Applications After (1)

Application Number Title Priority Date Filing Date
US10/973,495 Expired - Lifetime US7444330B2 (en) 2001-12-20 2004-10-26 Methods and systems for model matching

Country Status (1)

Country Link
US (3) US6826568B2 (en)

Cited By (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030195890A1 (en) * 2002-04-05 2003-10-16 Oommen John B. Method of comparing the closeness of a target tree to other trees using noisy sub-sequence tree processing
US20040044678A1 (en) * 2002-08-29 2004-03-04 International Business Machines Corporation Method and apparatus for converting legacy programming language data structures to schema definitions
US20040068498A1 (en) * 2002-10-07 2004-04-08 Richard Patchet Parallel tree searches for matching multiple, hierarchical data structures
US20060136435A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation System and method for context-sensitive decomposition of XML documents based on schemas with reusable element/attribute declarations
US20060136483A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation System and method of decomposition of multiple items into the same table-column pair
US20060161560A1 (en) * 2005-01-14 2006-07-20 Fatlens, Inc. Method and system to compare data objects
US20060173753A1 (en) * 2005-01-14 2006-08-03 Fatlens, Inc. Method and system for online shopping
US20060190931A1 (en) * 2005-02-18 2006-08-24 Scott George M Mapping assurance method and apparatus for integrating systems
US20060282402A1 (en) * 2005-06-10 2006-12-14 Canon Kabushiki Kaisha Information processing apparatus, method of controlling information processing apparatus, computer program, and storage medium
US20070005658A1 (en) * 2005-07-02 2007-01-04 International Business Machines Corporation System, service, and method for automatically discovering universal data objects
US20070028209A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Architecture that extends types using extension methods
US20070028163A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Lightweight application program interface (API) for extensible markup language (XML)
US20070028212A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Extending expression-based syntax for creating object instances
US20070027905A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Intelligent SQL generation for persistent object retrieval
US20070027849A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Integrating query-related operators in a programming language
US20070027907A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Code generation patterns
US20070027862A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Anonymous types for statically typed queries
US20070027906A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Retrieving and persisting objects from/to relational databases
US20070035558A1 (en) * 2005-08-11 2007-02-15 International Business Machines Corporation Visual model importation
US20070044083A1 (en) * 2005-07-29 2007-02-22 Microsoft Corporation Lambda expressions
US20070067343A1 (en) * 2005-09-21 2007-03-22 International Business Machines Corporation Determining the structure of relations and content of tuples from XML schema components
US20070083503A1 (en) * 2005-10-07 2007-04-12 Oracle International Corporation Generating a synonym dictionary representing a mapping of elements in different data models
US20070136353A1 (en) * 2005-12-09 2007-06-14 International Business Machines Corporation System and method for data model and content migration in content management application
US20070185868A1 (en) * 2006-02-08 2007-08-09 Roth Mary A Method and apparatus for semantic search of schema repositories
US20070192306A1 (en) * 2004-08-27 2007-08-16 Yannis Papakonstantinou Searching digital information and databases
US20080189303A1 (en) * 2007-02-02 2008-08-07 Alan Bush System and method for defining application definition functionality for general purpose web presences
US20080281842A1 (en) * 2006-02-10 2008-11-13 International Business Machines Corporation Apparatus and method for pre-processing mapping information for efficient decomposition of xml documents
US20080320440A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Fully capturing outer variables as data objects
US20090006315A1 (en) * 2007-06-29 2009-01-01 Sougata Mukherjea Structured method for schema matching using multiple levels of ontologies
US20090040941A1 (en) * 2006-04-14 2009-02-12 Huawei Technologies Co., Ltd. Method and system for measuring network performance
US20090063952A1 (en) * 2003-09-12 2009-03-05 Mukund Raghavachari System for validating a document conforming to a first schema with respect to a second schema
US20090271765A1 (en) * 2008-04-29 2009-10-29 Microsoft Corporation Consumer and producer specific semantics of shared object protocols
US20100005074A1 (en) * 2005-10-17 2010-01-07 Steve Endacott System and method for accessing data
US20100077174A1 (en) * 2008-09-19 2010-03-25 Nokia Corporation Memory allocation to store broadcast information
US20100094906A1 (en) * 2008-09-30 2010-04-15 Microsoft Corporation Modular forest automata
US20100121837A1 (en) * 2008-11-13 2010-05-13 Business Objects, S.A. Apparatus and Method for Utilizing Context to Resolve Ambiguous Queries
US20100251156A1 (en) * 2009-03-31 2010-09-30 American Express Travel Related Services Company, Inc. Facilitating Discovery and Re-Use of Information Constructs
US20110131545A1 (en) * 2005-02-18 2011-06-02 Vasile Patrascu Stepwise template integration method and system
US20110153539A1 (en) * 2009-12-17 2011-06-23 International Business Machines Corporation Identifying common data objects representing solutions to a problem in different disciplines
WO2011139258A2 (en) * 2007-02-26 2011-11-10 Microsoft Corporation Parameterized types and elements in xml schema
US20120078913A1 (en) * 2010-09-23 2012-03-29 Infosys Technologies Limited System and method for schema matching
WO2012060866A1 (en) * 2010-11-02 2012-05-10 Alibaba Group Holding Limited Determination of category information using multiple stages
US20120185421A1 (en) * 2011-01-14 2012-07-19 Naren Sundaravaradan System and method for tree discovery
US20120203743A1 (en) * 2008-12-16 2012-08-09 International Business Machines Corporation Re-establishing traceability
US20130019163A1 (en) * 2010-03-26 2013-01-17 British Telecommunications Public Limited Company System
US8539001B1 (en) * 2012-08-20 2013-09-17 International Business Machines Corporation Determining the value of an association between ontologies
US20140016038A1 (en) * 2012-05-28 2014-01-16 Tektronix, Inc. Heuristic method for drop frame detection in digital baseband video
US8635594B1 (en) * 2006-03-30 2014-01-21 Emc Corporation Script language for storage management operations
US8730843B2 (en) 2011-01-14 2014-05-20 Hewlett-Packard Development Company, L.P. System and method for tree assessment
US8739118B2 (en) 2010-04-08 2014-05-27 Microsoft Corporation Pragmatic mapping specification, compilation and validation
US8747115B2 (en) 2012-03-28 2014-06-10 International Business Machines Corporation Building an ontology by transforming complex triples
TWI449908B (en) * 2007-01-26 2014-08-21 Japan Steel Works Ltd Hydrogen residual sensor
US20150161181A1 (en) * 2013-12-09 2015-06-11 Andreas Doms Schema-based application model validation in a database
US20160343077A1 (en) * 2015-05-18 2016-11-24 Fmr Llc Probabilistic Analysis Trading Platform Apparatuses, Methods and Systems
US9589021B2 (en) 2011-10-26 2017-03-07 Hewlett Packard Enterprise Development Lp System deconstruction for component substitution
US20170069020A1 (en) * 2015-09-04 2017-03-09 Oracle International Corporation Xbrl comparative reporting
RU2617921C2 (en) * 2012-12-25 2017-04-28 Бейджинг Джингдонг Шэнгке Инфомейшн Текнолоджи Ко, Лтд. Category path recognition method and system
US9817918B2 (en) 2011-01-14 2017-11-14 Hewlett Packard Enterprise Development Lp Sub-tree similarity for component substitution
US20200097811A1 (en) * 2018-09-25 2020-03-26 International Business Machines Corporation Reinforcement learning by sharing individual data within dynamic groups
US10936819B2 (en) * 2019-02-19 2021-03-02 International Business Machines Corporation Query-directed discovery and alignment of collections of document passages for improving named entity disambiguation precision
US10956381B2 (en) * 2014-11-14 2021-03-23 Adp, Llc Data migration system
US11132358B2 (en) 2019-02-19 2021-09-28 International Business Machines Corporation Candidate name generation
US11226972B2 (en) 2019-02-19 2022-01-18 International Business Machines Corporation Ranking collections of document passages associated with an entity name by relevance to a query

Families Citing this family (425)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089218B1 (en) * 2004-01-06 2006-08-08 Neuric Technologies, Llc Method for inclusion of psychological temperament in an electronic emulation of the human brain
US8725493B2 (en) * 2004-01-06 2014-05-13 Neuric Llc Natural language parsing method to provide conceptual flow
US20070156625A1 (en) * 2004-01-06 2007-07-05 Neuric Technologies, Llc Method for movie animation
US8001067B2 (en) * 2004-01-06 2011-08-16 Neuric Technologies, Llc Method for substituting an electronic emulation of the human brain into an application to replace a human
US7925492B2 (en) 2004-01-06 2011-04-12 Neuric Technologies, L.L.C. Method for determining relationships through use of an ordered list between processing nodes in an emulated human brain
US8396824B2 (en) * 1998-05-28 2013-03-12 Qps Tech. Limited Liability Company Automatic data categorization with optimally spaced semantic seed terms
US20070294229A1 (en) * 1998-05-28 2007-12-20 Q-Phrase Llc Chat conversation methods traversing a provisional scaffold of meanings
US7711672B2 (en) * 1998-05-28 2010-05-04 Lawrence Au Semantic network methods to disambiguate natural language meaning
US6414036B1 (en) * 1999-09-01 2002-07-02 Van Beek Global/Ninkov Llc Composition for treatment of infections of humans and animals
US7624356B1 (en) 2000-06-21 2009-11-24 Microsoft Corporation Task-sensitive methods and systems for displaying command sets
US7191394B1 (en) 2000-06-21 2007-03-13 Microsoft Corporation Authoring arbitrary XML documents using DHTML and XSLT
US6948135B1 (en) 2000-06-21 2005-09-20 Microsoft Corporation Method and systems of providing information to computer users
US7000230B1 (en) 2000-06-21 2006-02-14 Microsoft Corporation Network-based software extensions
US6883168B1 (en) 2000-06-21 2005-04-19 Microsoft Corporation Methods, systems, architectures and data structures for delivering software via a network
US7346848B1 (en) 2000-06-21 2008-03-18 Microsoft Corporation Single window navigation methods and systems
US7155667B1 (en) * 2000-06-21 2006-12-26 Microsoft Corporation User interface for integrated spreadsheets and word processing tables
US7305667B1 (en) * 2001-06-15 2007-12-04 Oracle International Corporation Call back structures for user defined DOMs
US7321900B1 (en) 2001-06-15 2008-01-22 Oracle International Corporation Reducing memory requirements needed to represent XML entities
CA2355418A1 (en) * 2001-08-16 2003-02-16 Ibm Canada Limited-Ibm Canada Limitee A schema for sql statements
AU2002334721B2 (en) 2001-09-28 2008-10-23 Oracle International Corporation An index structure to access hierarchical data in a relational database system
US7434227B2 (en) * 2001-09-28 2008-10-07 Sap Ag Portable business information content and management system
WO2003056449A2 (en) * 2001-12-21 2003-07-10 Xmlcities, Inc. Extensible stylesheet designs using meta-tag and/or associated meta-tag information
KR100426312B1 (en) * 2001-12-28 2004-04-08 한국전자통신연구원 Method and apparatus for identifying software components of object-oriented programming system
JP4186474B2 (en) * 2002-02-20 2008-11-26 富士通株式会社 Profile combining apparatus, method and program
US20030159105A1 (en) * 2002-02-21 2003-08-21 Hiebert Steven P. Interpretive transformation system and method
JP4255239B2 (en) * 2002-03-29 2009-04-15 富士通株式会社 Document search method
US20030196168A1 (en) * 2002-04-10 2003-10-16 Koninklijke Philips Electronics N.V. Method and apparatus for modeling extensible markup language (XML) applications using the unified modeling language (UML)
KR100484138B1 (en) * 2002-05-08 2005-04-18 삼성전자주식회사 XML indexing method for regular path expression queries in relational database and data structure thereof.
US7548935B2 (en) * 2002-05-09 2009-06-16 Robert Pecherer Method of recursive objects for representing hierarchies in relational database systems
US7457810B2 (en) * 2002-05-10 2008-11-25 International Business Machines Corporation Querying markup language data sources using a relational query processor
US8015143B2 (en) * 2002-05-22 2011-09-06 Estes Timothy W Knowledge discovery agent system and method
US9400589B1 (en) 2002-05-30 2016-07-26 Consumerinfo.Com, Inc. Circular rotational interface for display of consumer credit information
US9710852B1 (en) 2002-05-30 2017-07-18 Consumerinfo.Com, Inc. Credit report timeline user interface
US8200622B2 (en) 2002-05-31 2012-06-12 Informatica Corporation System and method for integrating, managing and coordinating customer activities
US20030236764A1 (en) * 2002-06-19 2003-12-25 Lev Shur Data architecture to support shared data resources among applications
US7200589B1 (en) * 2002-10-03 2007-04-03 Hewlett-Packard Development Company, L.P. Format-independent advertising of data center resource capabilities
JP3880504B2 (en) * 2002-10-28 2007-02-14 インターナショナル・ビジネス・マシーンズ・コーポレーション Structured / hierarchical content processing apparatus, structured / hierarchical content processing method, and program
US7117196B2 (en) * 2002-11-22 2006-10-03 International Business Machines Corporation Method and system for optimizing leaf comparisons from a tree search
US7856454B2 (en) * 2002-12-20 2010-12-21 Siebel Systems, Inc. Data model for business relationships
US8538840B2 (en) 2002-12-20 2013-09-17 Siebel Systems, Inc. Financial services data model
TW200419413A (en) 2003-01-13 2004-10-01 I2 Technologies Inc Master data management system for centrally managing core reference data associated with an enterprise
US8266215B2 (en) * 2003-02-20 2012-09-11 Sonicwall, Inc. Using distinguishing properties to classify messages
US7406502B1 (en) * 2003-02-20 2008-07-29 Sonicwall, Inc. Method and system for classifying a message based on canonical equivalent of acceptable items included in the message
US7840614B2 (en) * 2003-02-20 2010-11-23 Bea Systems, Inc. Virtual content repository application program interface
US7299261B1 (en) 2003-02-20 2007-11-20 Mailfrontier, Inc. A Wholly Owned Subsidiary Of Sonicwall, Inc. Message classification using a summary
US7293286B2 (en) * 2003-02-20 2007-11-06 Bea Systems, Inc. Federated management of content repositories
US8166033B2 (en) * 2003-02-27 2012-04-24 Parity Computing, Inc. System and method for matching and assembling records
US8392298B2 (en) 2003-03-04 2013-03-05 Siebel Systems, Inc. Invoice adjustment data object for a common data object format
US8473399B2 (en) 2003-03-04 2013-06-25 Siebel Systems, Inc. Invoice data object for a common data object format
US6961733B2 (en) 2003-03-10 2005-11-01 Unisys Corporation System and method for storing and accessing data in an interlocking trees datastore
US8489470B2 (en) 2003-03-24 2013-07-16 Siebel Systems, Inc. Inventory location common object
US7904340B2 (en) 2003-03-24 2011-03-08 Siebel Systems, Inc. Methods and computer-readable medium for defining a product model
US7415672B1 (en) 2003-03-24 2008-08-19 Microsoft Corporation System and method for designing electronic forms
US7370066B1 (en) 2003-03-24 2008-05-06 Microsoft Corporation System and method for offline editing of data files
US8510179B2 (en) 2003-03-24 2013-08-13 Siebel Systems, Inc. Inventory transaction common object
US9704120B2 (en) 2003-03-24 2017-07-11 Oracle International Corporation Inventory balance common object
US20070226037A1 (en) * 2003-03-25 2007-09-27 Shailendra Garg Modeling of opportunity data
US7296017B2 (en) 2003-03-28 2007-11-13 Microsoft Corporation Validation of XML data files
US7913159B2 (en) 2003-03-28 2011-03-22 Microsoft Corporation System and method for real-time validation of structured data files
US20040199905A1 (en) * 2003-04-01 2004-10-07 International Business Machines Corporation System and method for translating data from a source schema to a target schema
US7272818B2 (en) * 2003-04-10 2007-09-18 Microsoft Corporation Creation of an object within an object hierarchy structure
US7051042B2 (en) 2003-05-01 2006-05-23 Oracle International Corporation Techniques for transferring a serialized image of XML data
US7386568B2 (en) * 2003-05-01 2008-06-10 Oracle International Corporation Techniques for partial rewrite of XPath queries in a relational database
JP4240293B2 (en) * 2003-05-27 2009-03-18 株式会社ソニー・コンピュータエンタテインメント Multimedia playback apparatus and multimedia playback method
US7451392B1 (en) 2003-06-30 2008-11-11 Microsoft Corporation Rendering an HTML electronic form by applying XSLT to XML using a solution
US9152735B2 (en) * 2003-07-24 2015-10-06 Alcatel Lucent Method and apparatus for composing XSL transformations with XML publishing views
US7568199B2 (en) * 2003-07-28 2009-07-28 Sap Ag. System for matching resource request that freeing the reserved first resource and forwarding the request to second resource if predetermined time period expired
US7703029B2 (en) 2003-07-28 2010-04-20 Sap Ag Grid browser component
US7673054B2 (en) 2003-07-28 2010-03-02 Sap Ag. Grid manageable application process management scheme
US7574707B2 (en) * 2003-07-28 2009-08-11 Sap Ag Install-run-remove mechanism
US7594015B2 (en) * 2003-07-28 2009-09-22 Sap Ag Grid organization
US7631069B2 (en) 2003-07-28 2009-12-08 Sap Ag Maintainable grid managers
US7546553B2 (en) * 2003-07-28 2009-06-09 Sap Ag Grid landscape component
US7203679B2 (en) * 2003-07-29 2007-04-10 International Business Machines Corporation Determining structural similarity in semi-structured documents
US7406660B1 (en) 2003-08-01 2008-07-29 Microsoft Corporation Mapping between structured data and a visual surface
US7334187B1 (en) 2003-08-06 2008-02-19 Microsoft Corporation Electronic form aggregation
US7512912B1 (en) * 2003-08-16 2009-03-31 Synopsys, Inc. Method and apparatus for solving constraints for word-level networks
US8219569B2 (en) 2003-08-25 2012-07-10 Oracle International Corporation In-place evolution of XML schemes
US7814047B2 (en) * 2003-08-25 2010-10-12 Oracle International Corporation Direct loading of semistructured data
US7490093B2 (en) 2003-08-25 2009-02-10 Oracle International Corporation Generating a schema-specific load structure to load data into a relational database based on determining whether the schema-specific load structure already exists
US7546288B2 (en) * 2003-09-04 2009-06-09 Microsoft Corporation Matching media file metadata to standardized metadata
US8694510B2 (en) 2003-09-04 2014-04-08 Oracle International Corporation Indexing XML documents efficiently
US8229932B2 (en) 2003-09-04 2012-07-24 Oracle International Corporation Storing XML documents efficiently in an RDBMS
US20060101018A1 (en) 2004-11-08 2006-05-11 Mazzagatti Jane C Method for processing new sequences being recorded into an interlocking trees datastore
US8516004B2 (en) * 2003-09-19 2013-08-20 Unisys Corporation Method for processing K node count fields using an intensity variable
US20050071359A1 (en) * 2003-09-25 2005-03-31 Elandassery Deepak S. Method for automated database schema evolution
WO2005031603A1 (en) * 2003-09-26 2005-04-07 British Telecommunications Public Limited Company Method and apparatus for processing electronic data
US20050071362A1 (en) * 2003-09-30 2005-03-31 Nelson Brent Dalmas Enterprises taxonomy formation method and system for an intellectual capital management system
US7124142B2 (en) * 2003-11-10 2006-10-17 Conversive, Inc. Method and system for responding to requests relating to complex data maintained in a structured form
US7810090B2 (en) 2003-12-17 2010-10-05 Sap Ag Grid compute node software application deployment
US7124062B2 (en) * 2003-12-30 2006-10-17 Sap Ag Services search method
US20070250464A1 (en) * 2004-01-06 2007-10-25 Neuric Technologies, Llc Historical figures in today's society
US20080243741A1 (en) * 2004-01-06 2008-10-02 Neuric Technologies, Llc Method and apparatus for defining an artificial brain via a plurality of concept nodes connected together through predetermined relationships
JP4398263B2 (en) * 2004-01-13 2010-01-13 富士通株式会社 Route design method
US7340471B2 (en) 2004-01-16 2008-03-04 Unisys Corporation Saving and restoring an interlocking trees datastore
US20050165866A1 (en) * 2004-01-28 2005-07-28 Bohannon Philip L. Method and apparatus for updating XML views of relational data
US8819072B1 (en) 2004-02-02 2014-08-26 Microsoft Corporation Promoting data from structured data files
US8037102B2 (en) 2004-02-09 2011-10-11 Robert T. and Virginia T. Jenkins Manipulating sets of hierarchical data
US7191175B2 (en) 2004-02-13 2007-03-13 Attenex Corporation System and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space
CN1658234B (en) * 2004-02-18 2010-05-26 国际商业机器公司 Method and device for generating hierarchy visual structure of semantic network
US7318063B2 (en) * 2004-02-19 2008-01-08 Microsoft Corporation Managing XML documents containing hierarchical database information
US20050187756A1 (en) * 2004-02-25 2005-08-25 Nokia Corporation System and apparatus for handling presentation language messages
US8260764B1 (en) * 2004-03-05 2012-09-04 Open Text S.A. System and method to search and generate reports from semi-structured data
US8312110B2 (en) * 2004-03-12 2012-11-13 Kanata Limited Content manipulation using hierarchical address translations across a network
JP4313703B2 (en) * 2004-03-12 2009-08-12 彼方株式会社 Information processing apparatus, system, method, and program
US20050228816A1 (en) * 2004-04-13 2005-10-13 Bea Systems, Inc. System and method for content type versions
US20060028252A1 (en) * 2004-04-13 2006-02-09 Bea Systems, Inc. System and method for content type management
US7788278B2 (en) * 2004-04-21 2010-08-31 Kong Eng Cheng Querying target databases using reference database records
US7930277B2 (en) 2004-04-21 2011-04-19 Oracle International Corporation Cost-based optimizer for an XML data repository within a database
US7496837B1 (en) 2004-04-29 2009-02-24 Microsoft Corporation Structural editing with schema awareness
US20060030292A1 (en) * 2004-05-20 2006-02-09 Bea Systems, Inc. Client programming for mobile client
US7650432B2 (en) * 2004-05-20 2010-01-19 Bea Systems, Inc. Occasionally-connected application server
US20050273847A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Programmable message processing stage for a service oriented architecture
US7310684B2 (en) * 2004-05-21 2007-12-18 Bea Systems, Inc. Message processing in a service oriented architecture
US20060031354A1 (en) * 2004-05-21 2006-02-09 Bea Systems, Inc. Service oriented architecture
US20060031432A1 (en) * 2004-05-21 2006-02-09 Bea Systens, Inc. Service oriented architecture with message processing pipelines
US20060069791A1 (en) * 2004-05-21 2006-03-30 Bea Systems, Inc. Service oriented architecture with interchangeable transport protocols
US20060031353A1 (en) * 2004-05-21 2006-02-09 Bea Systems, Inc. Dynamic publishing in a service oriented architecture
US8615601B2 (en) * 2004-05-21 2013-12-24 Oracle International Corporation Liquid computing
US20050270970A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Failsafe service oriented architecture
US20050273497A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Service oriented architecture with electronic mail transport protocol
US20050267892A1 (en) * 2004-05-21 2005-12-01 Patrick Paul B Service proxy definition
US20060031433A1 (en) * 2004-05-21 2006-02-09 Bea Systems, Inc. Batch updating for a service oriented architecture
US20060031930A1 (en) * 2004-05-21 2006-02-09 Bea Systems, Inc. Dynamically configurable service oriented architecture
US20050273521A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Dynamically configurable service oriented architecture
US20050273520A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Service oriented architecture with file transport protocol
US7774485B2 (en) 2004-05-21 2010-08-10 Bea Systems, Inc. Dynamic service composition and orchestration
US8112296B2 (en) 2004-05-21 2012-02-07 Siebel Systems, Inc. Modeling of job profile data
US7865390B2 (en) 2004-05-21 2011-01-04 Siebel Systems, Inc. Modeling of employee performance result data
US20050267947A1 (en) * 2004-05-21 2005-12-01 Bea Systems, Inc. Service oriented architecture with message processing pipelines
US20050273502A1 (en) * 2004-05-21 2005-12-08 Patrick Paul B Service oriented architecture with message processing stages
US20050273516A1 (en) * 2004-05-21 2005-12-08 Bea Systems, Inc. Dynamic routing in a service oriented architecture
US20050278374A1 (en) * 2004-05-21 2005-12-15 Bea Systems, Inc. Dynamic program modification
US20060007918A1 (en) * 2004-05-21 2006-01-12 Bea Systems, Inc. Scaleable service oriented architecture
US20060136555A1 (en) * 2004-05-21 2006-06-22 Bea Systems, Inc. Secure service oriented architecture
US7281018B1 (en) 2004-05-26 2007-10-09 Microsoft Corporation Form template data source change
US7774620B1 (en) 2004-05-27 2010-08-10 Microsoft Corporation Executing applications at appropriate trust levels
US9646107B2 (en) * 2004-05-28 2017-05-09 Robert T. and Virginia T. Jenkins as Trustee of the Jenkins Family Trust Method and/or system for simplifying tree expressions such as for query reduction
US20050278139A1 (en) * 2004-05-28 2005-12-15 Glaenzer Helmut K Automatic match tuning
US20050273721A1 (en) * 2004-06-07 2005-12-08 Yantis David B Data transformation system
EP1759315B1 (en) 2004-06-23 2010-06-30 Oracle International Corporation Efficient evaluation of queries using translation
US7593923B1 (en) 2004-06-29 2009-09-22 Unisys Corporation Functional operations for accessing and/or building interlocking trees datastores to enable their use with applications software
US7882147B2 (en) * 2004-06-30 2011-02-01 Robert T. and Virginia T. Jenkins File location naming hierarchy
US7620632B2 (en) * 2004-06-30 2009-11-17 Skyler Technology, Inc. Method and/or system for performing tree matching
US20060004729A1 (en) * 2004-06-30 2006-01-05 Reactivity, Inc. Accelerated schema-based validation
GB2416048A (en) * 2004-07-10 2006-01-11 Hewlett Packard Development Co Inferring data type in a multi stage process
WO2006011102A1 (en) * 2004-07-22 2006-02-02 Koninklijke Philips Electronics N.V. Determining a similarity between ontology concepts
US7290003B1 (en) * 2004-08-19 2007-10-30 Sun Microsystems, Inc. Migrating data using an intermediate self-describing format
DE102004043125B4 (en) * 2004-09-07 2017-10-05 Robert Bosch Gmbh throttling device
US7496571B2 (en) * 2004-09-30 2009-02-24 Alcatel-Lucent Usa Inc. Method for performing information-preserving DTD schema embeddings
US7692636B2 (en) 2004-09-30 2010-04-06 Microsoft Corporation Systems and methods for handwriting to a screen
US20060074632A1 (en) * 2004-09-30 2006-04-06 Nanavati Amit A Ontology-based term disambiguation
US7213041B2 (en) 2004-10-05 2007-05-01 Unisys Corporation Saving and restoring an interlocking trees datastore
US7716241B1 (en) 2004-10-27 2010-05-11 Unisys Corporation Storing the repository origin of data inputs within a knowledge store
US7908240B1 (en) 2004-10-28 2011-03-15 Unisys Corporation Facilitated use of column and field data for field record universe in a knowledge store
US7627591B2 (en) 2004-10-29 2009-12-01 Skyler Technology, Inc. Method and/or system for manipulating tree expressions
US7801923B2 (en) 2004-10-29 2010-09-21 Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust Method and/or system for tagging trees
US20060101397A1 (en) * 2004-10-29 2006-05-11 Microsoft Corporation Pseudo-random test case generator for XML APIs
US7348980B2 (en) 2004-11-08 2008-03-25 Unisys Corporation Method and apparatus for interface for graphic display of data from a Kstore
US7676477B1 (en) 2005-10-24 2010-03-09 Unisys Corporation Utilities for deriving values and information from within an interlocking trees data store
US20070162508A1 (en) * 2004-11-08 2007-07-12 Mazzagatti Jane C Updating information in an interlocking trees datastore
US7499932B2 (en) * 2004-11-08 2009-03-03 Unisys Corporation Accessing data in an interlocking trees data structure using an application programming interface
US7712022B2 (en) 2004-11-15 2010-05-04 Microsoft Corporation Mutually exclusive options in electronic forms
US7721190B2 (en) 2004-11-16 2010-05-18 Microsoft Corporation Methods and systems for server side form processing
US7636727B2 (en) * 2004-12-06 2009-12-22 Skyler Technology, Inc. Enumeration of trees from finite number of nodes
AU2005239653B2 (en) * 2004-11-30 2009-03-12 Canon Kabushiki Kaisha System and method for future-proofing devices using metaschema
US7882149B2 (en) * 2004-11-30 2011-02-01 Canon Kabushiki Kaisha System and method for future-proofing devices using metaschema
US7630995B2 (en) * 2004-11-30 2009-12-08 Skyler Technology, Inc. Method and/or system for transmitting and/or receiving data
US7904801B2 (en) 2004-12-15 2011-03-08 Microsoft Corporation Recursive sections in electronic forms
US8195693B2 (en) * 2004-12-16 2012-06-05 International Business Machines Corporation Automatic composition of services through semantic attribute matching
US7565383B2 (en) * 2004-12-20 2009-07-21 Sap Ag. Application recovery
US7793290B2 (en) 2004-12-20 2010-09-07 Sap Ag Grip application acceleration by executing grid application based on application usage history prior to user request for application execution
US7899834B2 (en) * 2004-12-23 2011-03-01 Sap Ag Method and apparatus for storing and maintaining structured documents
US8316059B1 (en) 2004-12-30 2012-11-20 Robert T. and Virginia T. Jenkins Enumeration of rooted partial subtrees
US8473449B2 (en) * 2005-01-06 2013-06-25 Neuric Technologies, Llc Process of dialogue and discussion
US7908291B2 (en) * 2005-01-07 2011-03-15 Oracle International Corporation Technique for creating self described data shared across multiple services
US7937651B2 (en) 2005-01-14 2011-05-03 Microsoft Corporation Structural editing operations for network forms
US8615530B1 (en) 2005-01-31 2013-12-24 Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust Method and/or system for tree transformation
US20060173865A1 (en) * 2005-02-03 2006-08-03 Fong Joseph S System and method of translating a relational database into an XML document and vice versa
JP4423327B2 (en) * 2005-02-08 2010-03-03 日本電信電話株式会社 Information communication terminal, information communication system, information communication method, information communication program, and recording medium recording the same
US7523131B2 (en) 2005-02-10 2009-04-21 Oracle International Corporation Techniques for efficiently storing and querying in a relational database, XML documents conforming to schemas that contain cyclic constructs
US8214353B2 (en) * 2005-02-18 2012-07-03 International Business Machines Corporation Support for schema evolution in a multi-node peer-to-peer replication environment
US7587101B1 (en) * 2005-02-28 2009-09-08 Adobe Systems Incorporated Facilitating computer-assisted tagging of object instances in digital images
US7681177B2 (en) 2005-02-28 2010-03-16 Skyler Technology, Inc. Method and/or system for transforming between trees and strings
US7725834B2 (en) 2005-03-04 2010-05-25 Microsoft Corporation Designer-created aspect for an electronic form template
EP1859350B1 (en) * 2005-03-16 2015-06-24 BRITISH TELECOMMUNICATIONS public limited company Monitoring computer-controlled processes
EP1708099A1 (en) * 2005-03-29 2006-10-04 BRITISH TELECOMMUNICATIONS public limited company Schema matching
US8356040B2 (en) 2005-03-31 2013-01-15 Robert T. and Virginia T. Jenkins Method and/or system for transforming between trees and arrays
CA2602640A1 (en) * 2005-04-01 2006-10-05 British Telecommunications Public Limited Company Adaptive classifier, and method of creation of classification parameters therefor
US8175889B1 (en) 2005-04-06 2012-05-08 Experian Information Solutions, Inc. Systems and methods for tracking changes of address based on service disconnect/connect data
US7409380B1 (en) 2005-04-07 2008-08-05 Unisys Corporation Facilitated reuse of K locations in a knowledge store
US8010515B2 (en) 2005-04-15 2011-08-30 Microsoft Corporation Query to an electronic form
US7353226B2 (en) * 2005-04-22 2008-04-01 The Boeing Company Systems and methods for performing schema matching with data dictionaries
US20060248371A1 (en) * 2005-04-28 2006-11-02 International Business Machines Corporation Method and apparatus for a common cluster model for configuring, managing, and operating different clustering technologies in a data center
US7899821B1 (en) 2005-04-29 2011-03-01 Karl Schiffmann Manipulation and/or analysis of hierarchical data
US7644055B2 (en) * 2005-05-02 2010-01-05 Sap, Ag Rule-based database object matching with comparison certainty
US20060253476A1 (en) * 2005-05-09 2006-11-09 Roth Mary A Technique for relationship discovery in schemas using semantic name indexing
US7840610B2 (en) * 2005-05-11 2010-11-23 International Business Machines Corporation Apparatus, system, and method for map definition generation
US7389301B1 (en) 2005-06-10 2008-06-17 Unisys Corporation Data aggregation user interface and analytic adapted for a KStore
US8046348B1 (en) * 2005-06-10 2011-10-25 NetBase Solutions, Inc. Method and apparatus for concept-based searching of natural language discourse
JP4670496B2 (en) * 2005-06-14 2011-04-13 住友電気工業株式会社 Optical receiver
US7496588B2 (en) * 2005-06-27 2009-02-24 Siperian, Inc. Method and apparatus for data integration and management
US8200975B2 (en) 2005-06-29 2012-06-12 Microsoft Corporation Digital signatures for network forms
US20070055655A1 (en) * 2005-09-08 2007-03-08 Microsoft Corporation Selective schema matching
US8306986B2 (en) * 2005-09-30 2012-11-06 American Express Travel Related Services Company, Inc. Method, system, and computer program product for linking customer information
US8073841B2 (en) 2005-10-07 2011-12-06 Oracle International Corporation Optimizing correlated XML extracts
US8001459B2 (en) 2005-12-05 2011-08-16 Microsoft Corporation Enabling electronic documents for limited-capability computing devices
US7882119B2 (en) * 2005-12-22 2011-02-01 Xerox Corporation Document alignment systems for legacy document conversions
US7523121B2 (en) * 2006-01-03 2009-04-21 Siperian, Inc. Relationship data management
US8150803B2 (en) * 2006-01-03 2012-04-03 Informatica Corporation Relationship data management
US20070214179A1 (en) * 2006-03-10 2007-09-13 Khanh Hoang Searching, filtering, creating, displaying, and managing entity relationships across multiple data hierarchies through a user interface
US7512642B2 (en) * 2006-01-06 2009-03-31 International Business Machines Corporation Mapping-based query generation with duplicate elimination and minimal union
US8370125B2 (en) * 2006-01-13 2013-02-05 Research In Motion Limited Handheld electronic device and method for disambiguation of text input providing artificial variants comprised of characters in a core alphabet
US7538692B2 (en) 2006-01-13 2009-05-26 Research In Motion Limited Handheld electronic device and method for disambiguation of compound text input and for prioritizing compound language solutions according to quantity of text components
US8375063B2 (en) * 2006-01-31 2013-02-12 International Business Machines Corporation Method and program product for migrating data from a legacy system
WO2007098396A2 (en) * 2006-02-16 2007-08-30 Gs Industrial Design, Inc. Method of freeing the bound oil present in whole stillage and thin stillage
US20070214153A1 (en) * 2006-03-10 2007-09-13 Mazzagatti Jane C Method for processing an input particle stream for creating upper levels of KStore
US7734571B2 (en) * 2006-03-20 2010-06-08 Unisys Corporation Method for processing sensor data within a particle stream by a KStore
US20080275842A1 (en) * 2006-03-20 2008-11-06 Jane Campbell Mazzagatti Method for processing counts when an end node is encountered
US20070220069A1 (en) * 2006-03-20 2007-09-20 Mazzagatti Jane C Method for processing an input particle stream for creating lower levels of a KStore
US7689571B1 (en) 2006-03-24 2010-03-30 Unisys Corporation Optimizing the size of an interlocking tree datastore structure for KStore
US8238351B2 (en) * 2006-04-04 2012-08-07 Unisys Corporation Method for determining a most probable K location
US20070239742A1 (en) * 2006-04-06 2007-10-11 Oracle International Corporation Determining data elements in heterogeneous schema definitions for possible mapping
US7961189B2 (en) * 2006-05-16 2011-06-14 Sony Corporation Displaying artists related to an artist of interest
US7676330B1 (en) 2006-05-16 2010-03-09 Unisys Corporation Method for processing a particle using a sensor structure
US7774288B2 (en) * 2006-05-16 2010-08-10 Sony Corporation Clustering and classification of multimedia data
US7711755B2 (en) * 2006-05-17 2010-05-04 Topcoder, Inc. Dynamic XSD enumeration
US20070282923A1 (en) * 2006-06-01 2007-12-06 Christopher Ward Method and apparatus for the manipulation, customization, coordination and decomposition of active data models
US7792864B1 (en) * 2006-06-14 2010-09-07 TransUnion Teledata, L.L.C. Entity identification and/or association using multiple data elements
US7533096B2 (en) * 2006-07-12 2009-05-12 International Business Machines Corporation Computer-based method for finding similar objects using a taxonomy
US7676484B2 (en) * 2006-07-30 2010-03-09 International Business Machines Corporation System and method of performing an inverse schema mapping
US20080027930A1 (en) * 2006-07-31 2008-01-31 Bohannon Philip L Methods and apparatus for contextual schema mapping of source documents to target documents
US7813948B2 (en) * 2006-08-25 2010-10-12 Sas Institute Inc. Computer-implemented systems and methods for reducing cost flow models
US9202184B2 (en) 2006-09-07 2015-12-01 International Business Machines Corporation Optimizing the selection, verification, and deployment of expert resources in a time of chaos
US8255790B2 (en) * 2006-09-08 2012-08-28 Microsoft Corporation XML based form modification with import/export capability
US8346725B2 (en) * 2006-09-15 2013-01-01 Oracle International Corporation Evolution of XML schemas involving partial data copy
US8645973B2 (en) * 2006-09-22 2014-02-04 Oracle International Corporation Mobile applications
US7870163B2 (en) * 2006-09-28 2011-01-11 Oracle International Corporation Implementation of backward compatible XML schema evolution in a relational database system
JP2010506308A (en) * 2006-10-03 2010-02-25 キューピーエス テック. リミテッド ライアビリティ カンパニー Mechanism for automatic matching of host content and guest content by categorization
US8055603B2 (en) 2006-10-03 2011-11-08 International Business Machines Corporation Automatic generation of new rules for processing synthetic events using computer-based learning processes
US20080294459A1 (en) * 2006-10-03 2008-11-27 International Business Machines Corporation Health Care Derivatives as a Result of Real Time Patient Analytics
US8145582B2 (en) 2006-10-03 2012-03-27 International Business Machines Corporation Synthetic events for real time patient analysis
US7797310B2 (en) 2006-10-16 2010-09-14 Oracle International Corporation Technique to estimate the cost of streaming evaluation of XPaths
KR101200236B1 (en) * 2006-10-31 2012-11-09 에스케이플래닛 주식회사 terminal having a lazy loading function of the wireless internet platform module and controlling method for the same
US7539701B2 (en) 2006-11-20 2009-05-26 Microsoft Corporation Generic infrastructure for migrating data between applications
US7974993B2 (en) * 2006-12-04 2011-07-05 Microsoft Corporation Application loader for support of version management
US8036859B2 (en) * 2006-12-22 2011-10-11 Merced Systems, Inc. Disambiguation with respect to multi-grained dimension coordinates
NO327323B1 (en) * 2007-02-07 2009-06-08 Fast Search & Transfer As Procedure to interface between applications in a system for searching and retrieving information
EP2126828A4 (en) * 2007-02-16 2012-01-25 Bodymedia Inc Systems and methods for understanding and applying the physiological and contextual life patterns of an individual or set of individuals
US20080208735A1 (en) * 2007-02-22 2008-08-28 American Expresstravel Related Services Company, Inc., A New York Corporation Method, System, and Computer Program Product for Managing Business Customer Contacts
US7853611B2 (en) 2007-02-26 2010-12-14 International Business Machines Corporation System and method for deriving a hierarchical event based database having action triggers based on inferred probabilities
US7970759B2 (en) 2007-02-26 2011-06-28 International Business Machines Corporation System and method for deriving a hierarchical event based database optimized for pharmaceutical analysis
US7792774B2 (en) 2007-02-26 2010-09-07 International Business Machines Corporation System and method for deriving a hierarchical event based database optimized for analysis of chaotic events
US8332331B2 (en) * 2007-03-19 2012-12-11 Hewlett-Packard Development Company, L.P. Determining a price premium for a project
US8285656B1 (en) 2007-03-30 2012-10-09 Consumerinfo.Com, Inc. Systems and methods for data verification
US7765241B2 (en) * 2007-04-20 2010-07-27 Microsoft Corporation Describing expected entity relationships in a model
US20080275895A1 (en) * 2007-05-03 2008-11-06 Leroux Daniel D Method, system, and program product for aligning models
EP1990740A1 (en) * 2007-05-08 2008-11-12 Sap Ag Schema matching for data migration
US8996394B2 (en) 2007-05-18 2015-03-31 Oracle International Corporation System and method for enabling decision activities in a process management and design environment
US20080301016A1 (en) * 2007-05-30 2008-12-04 American Express Travel Related Services Company, Inc. General Counsel's Office Method, System, and Computer Program Product for Customer Linking and Identification Capability for Institutions
US8185916B2 (en) * 2007-06-28 2012-05-22 Oracle International Corporation System and method for integrating a business process management system with an enterprise service bus
US8024241B2 (en) * 2007-07-13 2011-09-20 Sas Institute Inc. Computer-implemented systems and methods for cost flow analysis
US8271477B2 (en) 2007-07-20 2012-09-18 Informatica Corporation Methods and systems for accessing data
US8170998B2 (en) * 2007-09-12 2012-05-01 American Express Travel Related Services Company, Inc. Methods, systems, and computer program products for estimating accuracy of linking of customer relationships
US8051075B2 (en) * 2007-09-24 2011-11-01 Merced Systems, Inc. Temporally-aware evaluative score
US8060502B2 (en) * 2007-10-04 2011-11-15 American Express Travel Related Services Company, Inc. Methods, systems, and computer program products for generating data quality indicators for relationships in a database
US7930262B2 (en) 2007-10-18 2011-04-19 International Business Machines Corporation System and method for the longitudinal analysis of education outcomes using cohort life cycles, cluster analytics-based cohort analysis, and probabilistic data schemas
US20090165021A1 (en) * 2007-10-23 2009-06-25 Microsoft Corporation Model-Based Composite Application Platform
US8751626B2 (en) * 2007-10-23 2014-06-10 Microsoft Corporation Model-based composite application platform
US7822841B2 (en) * 2007-10-30 2010-10-26 Modern Grids, Inc. Method and system for hosting multiple, customized computing clusters
US7945525B2 (en) * 2007-11-09 2011-05-17 International Business Machines Corporation Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree
US7788305B2 (en) * 2007-11-13 2010-08-31 Oracle International Corporation Hierarchy nodes derived based on parent/child foreign key and/or range values on parent node
JP4568320B2 (en) * 2007-11-19 2010-10-27 株式会社日立製作所 Processing procedure generation apparatus and processing procedure generation method
US7865489B2 (en) * 2007-11-28 2011-01-04 International Business Machines Corporation System and computer program product for discovering design documents
US7865488B2 (en) * 2007-11-28 2011-01-04 International Business Machines Corporation Method for discovering design documents
US10157195B1 (en) * 2007-11-29 2018-12-18 Bdna Corporation External system integration into automated attribute discovery
US8127986B1 (en) 2007-12-14 2012-03-06 Consumerinfo.Com, Inc. Card registry systems and methods
US9990674B1 (en) 2007-12-14 2018-06-05 Consumerinfo.Com, Inc. Card registry systems and methods
US7779051B2 (en) * 2008-01-02 2010-08-17 International Business Machines Corporation System and method for optimizing federated and ETL'd databases with considerations of specialized data structures within an environment having multidimensional constraints
US7882120B2 (en) * 2008-01-14 2011-02-01 Microsoft Corporation Data description language for record based systems
US7970778B2 (en) * 2008-01-14 2011-06-28 International Business Machines Corporation Automatically persisting data from a model to a database
US9384175B2 (en) * 2008-02-19 2016-07-05 Adobe Systems Incorporated Determination of differences between electronic documents
US8200518B2 (en) 2008-02-25 2012-06-12 Sas Institute Inc. Computer-implemented systems and methods for partial contribution computation in ABC/M models
US8838652B2 (en) * 2008-03-18 2014-09-16 Novell, Inc. Techniques for application data scrubbing, reporting, and analysis
US9405513B2 (en) * 2008-04-18 2016-08-02 Software Ag Systems and methods for graphically developing rules for transforming models between description notations
US7856416B2 (en) * 2008-04-22 2010-12-21 International Business Machines Corporation Automated latent star schema discovery tool
US8266168B2 (en) 2008-04-24 2012-09-11 Lexisnexis Risk & Information Analytics Group Inc. Database systems and methods for linking records and entity representations with sufficiently high confidence
US8312033B1 (en) 2008-06-26 2012-11-13 Experian Marketing Solutions, Inc. Systems and methods for providing an integrated identifier
US7958112B2 (en) 2008-08-08 2011-06-07 Oracle International Corporation Interleaving query transformations for XML indexes
US9256904B1 (en) 2008-08-14 2016-02-09 Experian Information Solutions, Inc. Multi-bureau credit file freeze and unfreeze
US8180810B2 (en) * 2008-08-21 2012-05-15 International Business Machines Corporation Interactive generation of integrated schemas
US9092517B2 (en) 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US20100088262A1 (en) * 2008-09-29 2010-04-08 Neuric Technologies, Llc Emulated brain
US9183260B2 (en) 2008-10-09 2015-11-10 International Business Machines Corporation Node-level sub-queries in distributed databases
US8301583B2 (en) * 2008-10-09 2012-10-30 International Business Machines Corporation Automated data conversion and route tracking in distributed databases
US8060424B2 (en) 2008-11-05 2011-11-15 Consumerinfo.Com, Inc. On-line method and system for monitoring and reporting unused available credit
US8489388B2 (en) 2008-11-10 2013-07-16 Apple Inc. Data detection
US9588806B2 (en) 2008-12-12 2017-03-07 Sap Se Cluster-based business process management through eager displacement and on-demand recovery
US8335773B2 (en) * 2008-12-17 2012-12-18 Sap Ag Stable linking and patchability of business processes through hierarchical versioning
US8346819B2 (en) * 2008-12-22 2013-01-01 Sap Ag Enhanced data conversion framework
US8150723B2 (en) * 2009-01-09 2012-04-03 Yahoo! Inc. Large-scale behavioral targeting for advertising over a network
EP2211277A1 (en) * 2009-01-19 2010-07-28 BRITISH TELECOMMUNICATIONS public limited company Method and apparatus for generating an integrated view of multiple databases
US8630997B1 (en) * 2009-03-05 2014-01-14 Cisco Technology, Inc. Streaming event procesing
EP2406934B1 (en) * 2009-03-10 2014-10-08 Telefonaktiebolaget LM Ericsson (publ) Ip multimedia subsystem service configuration
US8463720B1 (en) 2009-03-27 2013-06-11 Neuric Technologies, Llc Method and apparatus for defining an artificial brain via a plurality of concept nodes defined by frame semantics
GB0906004D0 (en) * 2009-04-07 2009-05-20 Omnifone Ltd MusicStation desktop
US8484148B2 (en) * 2009-05-28 2013-07-09 Microsoft Corporation Predicting whether strings identify a same subject
US8515957B2 (en) 2009-07-28 2013-08-20 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via injection
CA3026879A1 (en) 2009-08-24 2011-03-10 Nuix North America, Inc. Generating a reference set for use during document review
US20110072023A1 (en) * 2009-09-21 2011-03-24 Yahoo! Inc. Detect, Index, and Retrieve Term-Group Attributes for Network Search
US9020944B2 (en) * 2009-10-29 2015-04-28 International Business Machines Corporation Systems and methods for organizing documented processes
US9411859B2 (en) 2009-12-14 2016-08-09 Lexisnexis Risk Solutions Fl Inc External linking based on hierarchical level weightings
JP5768063B2 (en) * 2010-01-13 2015-08-26 アビニシオ テクノロジー エルエルシー Matching metadata sources using rules that characterize conformance
US20110202484A1 (en) * 2010-02-18 2011-08-18 International Business Machines Corporation Analyzing parallel topics from correlated documents
US8745096B1 (en) * 2010-03-31 2014-06-03 Amazon Technologies, Inc. Techniques for aggregating data from multiple sources
US8682898B2 (en) * 2010-04-30 2014-03-25 International Business Machines Corporation Systems and methods for discovering synonymous elements using context over multiple similar addresses
US8355905B2 (en) 2010-05-14 2013-01-15 International Business Machines Corporation Mapping of relationship entities between ontologies
US9037615B2 (en) 2010-05-14 2015-05-19 International Business Machines Corporation Querying and integrating structured and unstructured data
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
CN102906697B (en) 2010-06-03 2015-11-25 国际商业机器公司 For the method and system of user's interface unit adaptation data model
JP5372853B2 (en) * 2010-07-08 2013-12-18 株式会社日立製作所 Digital sequence feature amount calculation method and digital sequence feature amount calculation apparatus
KR101130734B1 (en) * 2010-08-12 2012-03-28 연세대학교 산학협력단 Method for generating context hierachyand, system for generating context hierachyand
US10089390B2 (en) 2010-09-24 2018-10-02 International Business Machines Corporation System and method to extract models from semi-structured documents
US10318877B2 (en) 2010-10-19 2019-06-11 International Business Machines Corporation Cohort-based prediction of a future event
US9147042B1 (en) 2010-11-22 2015-09-29 Experian Information Solutions, Inc. Systems and methods for data verification
EP2469421A1 (en) * 2010-12-23 2012-06-27 British Telecommunications Public Limited Company Method and apparatus for processing electronic data
US8762428B2 (en) * 2011-06-06 2014-06-24 International Business Machines Corporation Rapidly deploying virtual database applications using data model analysis
US9665854B1 (en) 2011-06-16 2017-05-30 Consumerinfo.Com, Inc. Authentication alerts
US9483606B1 (en) 2011-07-08 2016-11-01 Consumerinfo.Com, Inc. Lifescore
US8972387B2 (en) * 2011-07-28 2015-03-03 International Business Machines Corporation Smarter search
US8577938B2 (en) 2011-08-23 2013-11-05 Accenture Global Services Limited Data mapping acceleration
US9106691B1 (en) 2011-09-16 2015-08-11 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US8738516B1 (en) 2011-10-13 2014-05-27 Consumerinfo.Com, Inc. Debt services candidate locator
CA2860322C (en) 2011-12-23 2017-06-27 Amiato, Inc. Scalable analysis platform for semi-structured data
US9171081B2 (en) * 2012-03-06 2015-10-27 Microsoft Technology Licensing, Llc Entity augmentation service from latent relational data
US20130238669A1 (en) * 2012-03-09 2013-09-12 Business Objects Software Ltd Using Target Columns in Data Transformation
US8676866B2 (en) * 2012-03-19 2014-03-18 Sap Ag Computing canonical hierarchical schemas
US9600795B2 (en) * 2012-04-09 2017-03-21 International Business Machines Corporation Measuring process model performance and enforcing process performance policy
WO2013159246A1 (en) * 2012-04-28 2013-10-31 Hewlett-Packard Development Company, L.P. Detecting valuable sections in webpage
US9853959B1 (en) 2012-05-07 2017-12-26 Consumerinfo.Com, Inc. Storage and maintenance of personal data
GB2502531A (en) * 2012-05-29 2013-12-04 Ibm De-serializing a source object to a target object
US10229200B2 (en) 2012-06-08 2019-03-12 International Business Machines Corporation Linking data elements based on similarity data values and semantic annotations
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US9594831B2 (en) 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
US8977622B1 (en) * 2012-09-17 2015-03-10 Amazon Technologies, Inc. Evaluation of nodes
US9654541B1 (en) 2012-11-12 2017-05-16 Consumerinfo.Com, Inc. Aggregating user web browsing data
US9916621B1 (en) 2012-11-30 2018-03-13 Consumerinfo.Com, Inc. Presentation of credit score factors
US20140156329A1 (en) * 2012-11-30 2014-06-05 Dassault Systemes DELMIA Corp. Canonical Availability Representations For Bills Of Materials
US10255598B1 (en) 2012-12-06 2019-04-09 Consumerinfo.Com, Inc. Credit card account data extraction
US20140222793A1 (en) * 2013-02-07 2014-08-07 Parlance Corporation System and Method for Automatically Importing, Refreshing, Maintaining, and Merging Contact Sets
US9894163B2 (en) * 2013-03-01 2018-02-13 Nexus Vesting Group, LLC Service request management methods and apparatus
US9697263B1 (en) 2013-03-04 2017-07-04 Experian Information Solutions, Inc. Consumer data request fulfillment system
US9406085B1 (en) 2013-03-14 2016-08-02 Consumerinfo.Com, Inc. System and methods for credit dispute processing, resolution, and reporting
US9870589B1 (en) 2013-03-14 2018-01-16 Consumerinfo.Com, Inc. Credit utilization tracking and reporting
US10102570B1 (en) 2013-03-14 2018-10-16 Consumerinfo.Com, Inc. Account vulnerability alerts
US10664936B2 (en) 2013-03-15 2020-05-26 Csidentity Corporation Authentication systems and methods for on-demand products
JP6416194B2 (en) * 2013-03-15 2018-10-31 アマゾン・テクノロジーズ・インコーポレーテッド Scalable analytic platform for semi-structured data
US9633322B1 (en) 2013-03-15 2017-04-25 Consumerinfo.Com, Inc. Adjustment of knowledge-based authentication
US10685398B1 (en) 2013-04-23 2020-06-16 Consumerinfo.Com, Inc. Presenting credit score information
US9710534B2 (en) 2013-05-07 2017-07-18 International Business Machines Corporation Methods and systems for discovery of linkage points between data sources
CN103294791A (en) * 2013-05-13 2013-09-11 西安电子科技大学 Extensible markup language pattern matching method
US9721147B1 (en) 2013-05-23 2017-08-01 Consumerinfo.Com, Inc. Digital identity
US20140379753A1 (en) * 2013-06-25 2014-12-25 Hewlett-Packard Development Company, L.P. Ambiguous queries in configuration management databases
US9244949B2 (en) * 2013-06-27 2016-01-26 International Business Machines Corporation Determining mappings for application integration based on user contributions
US9477934B2 (en) 2013-07-16 2016-10-25 Sap Portals Israel Ltd. Enterprise collaboration content governance framework
KR101519879B1 (en) * 2013-07-22 2015-05-14 광주과학기술원 Apparatus for recommanding contents using hierachical context model and method thereof
US9311429B2 (en) 2013-07-23 2016-04-12 Sap Se Canonical data model for iterative effort reduction in business-to-business schema integration
US9443268B1 (en) 2013-08-16 2016-09-13 Consumerinfo.Com, Inc. Bill payment and reporting
US9753928B1 (en) * 2013-09-19 2017-09-05 Trifacta, Inc. System and method for identifying delimiters in a computer file
CN104615600B (en) * 2013-11-04 2019-06-28 深圳力维智联技术有限公司 Similitude case compares implementation method and its device
US10325314B1 (en) 2013-11-15 2019-06-18 Consumerinfo.Com, Inc. Payment reporting systems
US10102536B1 (en) 2013-11-15 2018-10-16 Experian Information Solutions, Inc. Micro-geographic aggregation system
US9477737B1 (en) 2013-11-20 2016-10-25 Consumerinfo.Com, Inc. Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
US9529851B1 (en) 2013-12-02 2016-12-27 Experian Information Solutions, Inc. Server architecture for electronic data quality processing
US9600793B2 (en) * 2013-12-09 2017-03-21 International Business Machines Corporation Active odor cancellation
US9898707B2 (en) 2013-12-16 2018-02-20 Dassault Systemes Americas Corp. Validation of end-item completeness for product demands
US10262362B1 (en) 2014-02-14 2019-04-16 Experian Information Solutions, Inc. Automatic generation of code for attributes
USD760256S1 (en) 2014-03-25 2016-06-28 Consumerinfo.Com, Inc. Display screen or portion thereof with graphical user interface
USD759690S1 (en) 2014-03-25 2016-06-21 Consumerinfo.Com, Inc. Display screen or portion thereof with graphical user interface
USD759689S1 (en) 2014-03-25 2016-06-21 Consumerinfo.Com, Inc. Display screen or portion thereof with graphical user interface
US9892457B1 (en) 2014-04-16 2018-02-13 Consumerinfo.Com, Inc. Providing credit data in search results
US10373240B1 (en) 2014-04-25 2019-08-06 Csidentity Corporation Systems, methods and computer-program products for eligibility verification
WO2016048326A1 (en) * 2014-09-25 2016-03-31 Hewlett Packard Enterprise Development Lp Identification of a component for upgrade
US9558244B2 (en) * 2014-10-22 2017-01-31 Conversable, Inc. Systems and methods for social recommendations
US10333696B2 (en) 2015-01-12 2019-06-25 X-Prime, Inc. Systems and methods for implementing an efficient, scalable homomorphic transformation of encrypted data with minimal data expansion and improved processing efficiency
US10331633B2 (en) 2015-06-04 2019-06-25 International Business Machines Corporation Schema discovery through statistical transduction
US9760690B1 (en) * 2016-03-10 2017-09-12 Siemens Healthcare Gmbh Content-based medical image rendering based on machine learning
EP3239863A1 (en) * 2016-04-29 2017-11-01 QlikTech International AB System and method for interactive discovery of inter-data set relationships
US10726036B2 (en) * 2016-05-16 2020-07-28 Sap Se Source service mapping for collaborative platforms
WO2017210618A1 (en) 2016-06-02 2017-12-07 Fti Consulting, Inc. Analyzing clusters of coded documents
US11023483B2 (en) * 2016-08-04 2021-06-01 International Business Machines Corporation Model-driven profiling job generator for data sources
US10324908B2 (en) * 2016-09-01 2019-06-18 Sap Se Exposing database artifacts
JP6764779B2 (en) * 2016-12-26 2020-10-07 株式会社日立製作所 Synonymous column candidate selection device, synonymous column candidate selection method, and synonymous column candidate selection program
US11227001B2 (en) 2017-01-31 2022-01-18 Experian Information Solutions, Inc. Massive scale heterogeneous data ingestion and user resolution
US11568129B2 (en) * 2017-02-16 2023-01-31 North Carolina State University Spreadsheet recalculation algorithm for directed acyclic graph processing
US11119992B2 (en) * 2017-04-25 2021-09-14 Petuum, Inc. System for automated data engineering for large scale machine learning
US10635789B1 (en) * 2017-06-21 2020-04-28 Amazon Technologies, Inc. Request authorization using recipe-based service coordination
US10891275B2 (en) * 2017-12-26 2021-01-12 International Business Machines Corporation Limited data enricher
US10664133B1 (en) * 2018-01-24 2020-05-26 InVisionApp Inc. Automated linking and merging of hierarchical data structures for animated transitions
EP3561689A1 (en) * 2018-04-23 2019-10-30 QlikTech International AB Knowledge graph data structures and uses thereof
US11314807B2 (en) * 2018-05-18 2022-04-26 Xcential Corporation Methods and systems for comparison of structured documents
US10911234B2 (en) 2018-06-22 2021-02-02 Experian Information Solutions, Inc. System and method for a token gateway environment
CN109063114B (en) * 2018-07-27 2020-11-24 华南理工大学广州学院 Heterogeneous data integration method and device for energy cloud platform, terminal and storage medium
US11074230B2 (en) * 2018-09-04 2021-07-27 International Business Machines Corporation Data matching accuracy based on context features
US20200074541A1 (en) 2018-09-05 2020-03-05 Consumerinfo.Com, Inc. Generation of data structures based on categories of matched data items
US10963434B1 (en) 2018-09-07 2021-03-30 Experian Information Solutions, Inc. Data architecture for supporting multiple search models
US11315179B1 (en) 2018-11-16 2022-04-26 Consumerinfo.Com, Inc. Methods and apparatuses for customized card recommendations
US10725748B2 (en) * 2018-11-19 2020-07-28 Microsoft Technology Licensing, Llc Extracting program features for assisting software development
US11556699B2 (en) * 2019-02-04 2023-01-17 Citrix Systems, Inc. Data migration across SaaS applications
US10684966B1 (en) 2019-02-21 2020-06-16 Amazon Technologies, Inc. Orchestrating dataflows with inferred data store interactions
US11238656B1 (en) 2019-02-22 2022-02-01 Consumerinfo.Com, Inc. System and method for an augmented reality experience via an artificial intelligence bot
CN109857924A (en) * 2019-02-28 2019-06-07 重庆科技学院 A kind of big data analysis monitor information processing system and method
JP7042545B2 (en) * 2019-07-29 2022-03-28 国立研究開発法人理化学研究所 Data interpreters, methods and programs, data integration devices, methods and programs, and digital city building systems
US10628630B1 (en) * 2019-08-14 2020-04-21 Appvance Inc. Method and apparatus for generating a state machine model of an application using models of GUI objects and scanning modes
US11556848B2 (en) * 2019-10-21 2023-01-17 International Business Machines Corporation Resolving conflicts between experts' intuition and data-driven artificial intelligence models
EP4035025A1 (en) * 2019-11-06 2022-08-03 Google LLC Method and apparatus for smart and extensible schema matching framework
CN112818593B (en) * 2021-01-22 2023-07-14 中车工业研究院有限公司 Product configuration method and device based on modularized design
CN113157960A (en) * 2021-02-25 2021-07-23 北京金堤科技有限公司 Method and device for acquiring similar data, electronic equipment and computer readable storage medium
US11880377B1 (en) 2021-03-26 2024-01-23 Experian Information Solutions, Inc. Systems and methods for entity resolution
CN113515677B (en) * 2021-07-22 2023-10-27 中移(杭州)信息技术有限公司 Address matching method, device and computer readable storage medium
US11836120B2 (en) * 2021-07-23 2023-12-05 Oracle International Corporation Machine learning techniques for schema mapping
JP2023043079A (en) * 2021-09-15 2023-03-28 株式会社東芝 Information processing device, information processing method, and program
US11816154B2 (en) * 2021-10-21 2023-11-14 EMC IP Holding Company LLC Methods and systems for generating a unified metadata model
US20230229639A1 (en) * 2022-01-18 2023-07-20 Optum, Inc. Predictive recommendations for schema mapping

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397166B1 (en) * 1998-11-06 2002-05-28 International Business Machines Corporation Method and system for model-based clustering and signal-bearing medium for storing program of same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397166B1 (en) * 1998-11-06 2002-05-28 International Business Machines Corporation Method and system for model-based clustering and signal-bearing medium for storing program of same

Cited By (110)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030195890A1 (en) * 2002-04-05 2003-10-16 Oommen John B. Method of comparing the closeness of a target tree to other trees using noisy sub-sequence tree processing
US7287026B2 (en) * 2002-04-05 2007-10-23 Oommen John B Method of comparing the closeness of a target tree to other trees using noisy sub-sequence tree processing
US7533102B2 (en) * 2002-08-29 2009-05-12 International Business Machiens Corporation Method and apparatus for converting legacy programming language data structures to schema definitions
US20040044678A1 (en) * 2002-08-29 2004-03-04 International Business Machines Corporation Method and apparatus for converting legacy programming language data structures to schema definitions
US8121976B2 (en) 2002-08-29 2012-02-21 International Business Machines Corporation Method and apparatus for converting legacy programming language data structures to schema definitions
US20040068498A1 (en) * 2002-10-07 2004-04-08 Richard Patchet Parallel tree searches for matching multiple, hierarchical data structures
US7058644B2 (en) * 2002-10-07 2006-06-06 Click Commerce, Inc. Parallel tree searches for matching multiple, hierarchical data structures
US8065608B2 (en) * 2003-09-12 2011-11-22 International Business Machines Corporation System for validating a document conforming to a first schema with respect to a second schema
US20090063952A1 (en) * 2003-09-12 2009-03-05 Mukund Raghavachari System for validating a document conforming to a first schema with respect to a second schema
US20100223268A1 (en) * 2004-08-27 2010-09-02 Yannis Papakonstantinou Searching Digital Information and Databases
US7698267B2 (en) * 2004-08-27 2010-04-13 The Regents Of The University Of California Searching digital information and databases
US20070192306A1 (en) * 2004-08-27 2007-08-16 Yannis Papakonstantinou Searching digital information and databases
US8862594B2 (en) 2004-08-27 2014-10-14 The Regents Of The University Of California Searching digital information and databases
US7620641B2 (en) 2004-12-22 2009-11-17 International Business Machines Corporation System and method for context-sensitive decomposition of XML documents based on schemas with reusable element/attribute declarations
US20060136483A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation System and method of decomposition of multiple items into the same table-column pair
US20060136435A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation System and method for context-sensitive decomposition of XML documents based on schemas with reusable element/attribute declarations
US20060161560A1 (en) * 2005-01-14 2006-07-20 Fatlens, Inc. Method and system to compare data objects
US20060173753A1 (en) * 2005-01-14 2006-08-03 Fatlens, Inc. Method and system for online shopping
US7440955B2 (en) * 2005-01-14 2008-10-21 Im2, Inc Method and system to compare data objects
US20110131545A1 (en) * 2005-02-18 2011-06-02 Vasile Patrascu Stepwise template integration method and system
US8332806B2 (en) 2005-02-18 2012-12-11 International Business Machines Corporation Stepwise template integration method and system
US8943461B2 (en) 2005-02-18 2015-01-27 International Business Machines Corporation Stepwise template integration method and system
US9052879B2 (en) * 2005-02-18 2015-06-09 International Business Machines Corporation Mapping assurance method and apparatus for integrating systems
US20060190931A1 (en) * 2005-02-18 2006-08-24 Scott George M Mapping assurance method and apparatus for integrating systems
US7711752B2 (en) * 2005-06-10 2010-05-04 Canon Kabushiki Kaisha Information processing apparatus, method of controlling information processing apparatus, computer program, and storage medium
US20060282402A1 (en) * 2005-06-10 2006-12-14 Canon Kabushiki Kaisha Information processing apparatus, method of controlling information processing apparatus, computer program, and storage medium
US20070005658A1 (en) * 2005-07-02 2007-01-04 International Business Machines Corporation System, service, and method for automatically discovering universal data objects
US7818719B2 (en) 2005-07-29 2010-10-19 Microsoft Corporation Extending expression-based syntax for creating object instances
US20070028163A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Lightweight application program interface (API) for extensible markup language (XML)
US7409636B2 (en) * 2005-07-29 2008-08-05 Microsoft Corporation Lightweight application program interface (API) for extensible markup language (XML)
US20070028209A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Architecture that extends types using extension methods
US20070028212A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Extending expression-based syntax for creating object instances
US8370801B2 (en) 2005-07-29 2013-02-05 Microsoft Corporation Architecture that extends types using extension methods
US20070027905A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Intelligent SQL generation for persistent object retrieval
US20070027849A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Integrating query-related operators in a programming language
US20070027907A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Code generation patterns
US20070027862A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Anonymous types for statically typed queries
US20070027906A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Retrieving and persisting objects from/to relational databases
US20100175048A1 (en) * 2005-07-29 2010-07-08 Microsoft Corporation Architecture that extends types using extension methods
US7743066B2 (en) 2005-07-29 2010-06-22 Microsoft Corporation Anonymous types for statically typed queries
US7631011B2 (en) 2005-07-29 2009-12-08 Microsoft Corporation Code generation patterns
US20070044083A1 (en) * 2005-07-29 2007-02-22 Microsoft Corporation Lambda expressions
US7685567B2 (en) 2005-07-29 2010-03-23 Microsoft Corporation Architecture that extends types using extension methods
US7702686B2 (en) 2005-07-29 2010-04-20 Microsoft Corporation Retrieving and persisting objects from/to relational databases
US20070035558A1 (en) * 2005-08-11 2007-02-15 International Business Machines Corporation Visual model importation
US8711142B2 (en) * 2005-08-11 2014-04-29 International Business Machines Corporation Visual model importation
US20070067343A1 (en) * 2005-09-21 2007-03-22 International Business Machines Corporation Determining the structure of relations and content of tuples from XML schema components
US20070083503A1 (en) * 2005-10-07 2007-04-12 Oracle International Corporation Generating a synonym dictionary representing a mapping of elements in different data models
US7600186B2 (en) * 2005-10-07 2009-10-06 Oracle International Corporation Generating a synonym dictionary representing a mapping of elements in different data models
US20100005074A1 (en) * 2005-10-17 2010-01-07 Steve Endacott System and method for accessing data
US7774300B2 (en) 2005-12-09 2010-08-10 International Business Machines Corporation System and method for data model and content migration in content management applications
US20070136353A1 (en) * 2005-12-09 2007-06-14 International Business Machines Corporation System and method for data model and content migration in content management application
US20070185868A1 (en) * 2006-02-08 2007-08-09 Roth Mary A Method and apparatus for semantic search of schema repositories
US7529758B2 (en) 2006-02-10 2009-05-05 International Business Machines Corporation Method for pre-processing mapping information for efficient decomposition of XML documents
US20080281842A1 (en) * 2006-02-10 2008-11-13 International Business Machines Corporation Apparatus and method for pre-processing mapping information for efficient decomposition of xml documents
US8635594B1 (en) * 2006-03-30 2014-01-21 Emc Corporation Script language for storage management operations
US20090040941A1 (en) * 2006-04-14 2009-02-12 Huawei Technologies Co., Ltd. Method and system for measuring network performance
TWI449908B (en) * 2007-01-26 2014-08-21 Japan Steel Works Ltd Hydrogen residual sensor
US8819079B2 (en) * 2007-02-02 2014-08-26 Rogers Family Trust System and method for defining application definition functionality for general purpose web presences
US10120952B2 (en) 2007-02-02 2018-11-06 Rogers Family Trust System and method for defining application definition functionality for general purpose web presences
US20080189303A1 (en) * 2007-02-02 2008-08-07 Alan Bush System and method for defining application definition functionality for general purpose web presences
WO2011139258A3 (en) * 2007-02-26 2011-12-22 Microsoft Corporation Parameterized types and elements in xml schema
WO2011139258A2 (en) * 2007-02-26 2011-11-10 Microsoft Corporation Parameterized types and elements in xml schema
US8060868B2 (en) 2007-06-21 2011-11-15 Microsoft Corporation Fully capturing outer variables as data objects
US20080320440A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Fully capturing outer variables as data objects
US20090006315A1 (en) * 2007-06-29 2009-01-01 Sougata Mukherjea Structured method for schema matching using multiple levels of ontologies
US20090271765A1 (en) * 2008-04-29 2009-10-29 Microsoft Corporation Consumer and producer specific semantics of shared object protocols
US20100077174A1 (en) * 2008-09-19 2010-03-25 Nokia Corporation Memory allocation to store broadcast information
US9043470B2 (en) 2008-09-19 2015-05-26 Core Wireless Licensing, S.a.r.l. Memory allocation to store broadcast information
US8341267B2 (en) * 2008-09-19 2012-12-25 Core Wireless Licensing S.A.R.L. Memory allocation to store broadcast information
CN103345464A (en) * 2008-09-30 2013-10-09 微软公司 Modular forest automata
US8176085B2 (en) * 2008-09-30 2012-05-08 Microsoft Corporation Modular forest automata
US20100094906A1 (en) * 2008-09-30 2010-04-15 Microsoft Corporation Modular forest automata
US8423523B2 (en) * 2008-11-13 2013-04-16 SAP France S.A. Apparatus and method for utilizing context to resolve ambiguous queries
US20100121837A1 (en) * 2008-11-13 2010-05-13 Business Objects, S.A. Apparatus and Method for Utilizing Context to Resolve Ambiguous Queries
US20120203743A1 (en) * 2008-12-16 2012-08-09 International Business Machines Corporation Re-establishing traceability
US8775481B2 (en) * 2008-12-16 2014-07-08 International Business Machines Corporation Re-establishing traceability
US20100251156A1 (en) * 2009-03-31 2010-09-30 American Express Travel Related Services Company, Inc. Facilitating Discovery and Re-Use of Information Constructs
US9053180B2 (en) 2009-12-17 2015-06-09 International Business Machines Corporation Identifying common data objects representing solutions to a problem in different disciplines
US8793208B2 (en) 2009-12-17 2014-07-29 International Business Machines Corporation Identifying common data objects representing solutions to a problem in different disciplines
US20110153539A1 (en) * 2009-12-17 2011-06-23 International Business Machines Corporation Identifying common data objects representing solutions to a problem in different disciplines
US9460231B2 (en) * 2010-03-26 2016-10-04 British Telecommunications Public Limited Company System of generating new schema based on selective HTML elements
US20130019163A1 (en) * 2010-03-26 2013-01-17 British Telecommunications Public Limited Company System
US8739118B2 (en) 2010-04-08 2014-05-27 Microsoft Corporation Pragmatic mapping specification, compilation and validation
US20120078913A1 (en) * 2010-09-23 2012-03-29 Infosys Technologies Limited System and method for schema matching
US8386493B2 (en) * 2010-09-23 2013-02-26 Infosys Technologies Limited System and method for schema matching
US8583685B2 (en) 2010-11-02 2013-11-12 Alibaba Group Holding Limited Determination of category information using multiple stages
WO2012060866A1 (en) * 2010-11-02 2012-05-10 Alibaba Group Holding Limited Determination of category information using multiple stages
US8832012B2 (en) * 2011-01-14 2014-09-09 Hewlett-Packard Development Company, L. P. System and method for tree discovery
US8730843B2 (en) 2011-01-14 2014-05-20 Hewlett-Packard Development Company, L.P. System and method for tree assessment
US9817918B2 (en) 2011-01-14 2017-11-14 Hewlett Packard Enterprise Development Lp Sub-tree similarity for component substitution
US20120185421A1 (en) * 2011-01-14 2012-07-19 Naren Sundaravaradan System and method for tree discovery
US9589021B2 (en) 2011-10-26 2017-03-07 Hewlett Packard Enterprise Development Lp System deconstruction for component substitution
US9298817B2 (en) 2012-03-28 2016-03-29 International Business Machines Corporation Building an ontology by transforming complex triples
US9489453B2 (en) 2012-03-28 2016-11-08 International Business Machines Corporation Building an ontology by transforming complex triples
US8747115B2 (en) 2012-03-28 2014-06-10 International Business Machines Corporation Building an ontology by transforming complex triples
US20140016038A1 (en) * 2012-05-28 2014-01-16 Tektronix, Inc. Heuristic method for drop frame detection in digital baseband video
US8539001B1 (en) * 2012-08-20 2013-09-17 International Business Machines Corporation Determining the value of an association between ontologies
US8799330B2 (en) 2012-08-20 2014-08-05 International Business Machines Corporation Determining the value of an association between ontologies
RU2617921C2 (en) * 2012-12-25 2017-04-28 Бейджинг Джингдонг Шэнгке Инфомейшн Текнолоджи Ко, Лтд. Category path recognition method and system
US20150161181A1 (en) * 2013-12-09 2015-06-11 Andreas Doms Schema-based application model validation in a database
US9535935B2 (en) * 2013-12-09 2017-01-03 Sap Se Schema-based application model validation in a database
US10956381B2 (en) * 2014-11-14 2021-03-23 Adp, Llc Data migration system
US20160343077A1 (en) * 2015-05-18 2016-11-24 Fmr Llc Probabilistic Analysis Trading Platform Apparatuses, Methods and Systems
US20170069020A1 (en) * 2015-09-04 2017-03-09 Oracle International Corporation Xbrl comparative reporting
US10636086B2 (en) * 2015-09-04 2020-04-28 Oracle International Corporation XBRL comparative reporting
US20200097811A1 (en) * 2018-09-25 2020-03-26 International Business Machines Corporation Reinforcement learning by sharing individual data within dynamic groups
US10936819B2 (en) * 2019-02-19 2021-03-02 International Business Machines Corporation Query-directed discovery and alignment of collections of document passages for improving named entity disambiguation precision
US11132358B2 (en) 2019-02-19 2021-09-28 International Business Machines Corporation Candidate name generation
US11226972B2 (en) 2019-02-19 2022-01-18 International Business Machines Corporation Ranking collections of document passages associated with an entity name by relevance to a query

Also Published As

Publication number Publication date
US20050060332A1 (en) 2005-03-17
US7444330B2 (en) 2008-10-28
US6826568B2 (en) 2004-11-30
US20030120651A1 (en) 2003-06-26

Similar Documents

Publication Publication Date Title
US7444330B2 (en) Methods and systems for model matching
Madhavan et al. Generic schema matching with cupid
Patel-Schneider et al. The Yin/Yang web: XML syntax and RDF semantics
US20040199905A1 (en) System and method for translating data from a source schema to a target schema
Kaushik et al. Exploiting local similarity for indexing paths in graph-structured data
Hao et al. Web services discovery and rank: An information retrieval approach
US6636845B2 (en) Generating one or more XML documents from a single SQL query
US7272595B2 (en) Information search support system, application server, information search method, and program product
US6934712B2 (en) Tagging XML query results over relational DBMSs
Song et al. An ontology-driven framework towards building enterprise semantic information layer
EP1686495B1 (en) Mapping web services to ontologies
US20040172237A1 (en) Creation of structured data from plain text
KR100701104B1 (en) Method of generating database schema to provide integrated view of dispersed information and integrating system of information
US20060161525A1 (en) Method and system for supporting structured aggregation operations on semi-structured data
Patel-Schneider et al. The Yin/Yang Web: A unified model for XML syntax and RDF semantics
US7877400B1 (en) Optimizations of XPaths
Mukkala et al. Current state of ontology matching. A survey of ontology and schema matching
Arocena WebOQL: Exploiting document structure in web queries
US7805424B2 (en) Querying nested documents embedded in compound XML documents
Saake et al. Rule-based schema matching for ontology-based mediators
Broekstra et al. The state of the art on representation and query languages for semistructured data
KR100487738B1 (en) Apparatus and method XML document retrieval supporting XML query language tightly-coupled with database query language
Nottelmann et al. Combining DAML+ OIL, XSLT and probabilistic logics for uncertain schema mappings in MIND
Stárka Similarity of xml data
Kotsakis XSD: A hierarchical access method for indexing XML schemata

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001

Effective date: 20141014