|Publication number||US8219518 B2|
|Application number||US 11/621,521|
|Publication date||Jul 10, 2012|
|Filing date||Jan 9, 2007|
|Priority date||Jan 9, 2007|
|Also published as||US8903762, US20080168082, US20120271865|
|Publication number||11621521, 621521, US 8219518 B2, US 8219518B2, US-B2-8219518, US8219518 B2, US8219518B2|
|Inventors||Qi Jin, Hui Liao, Sriram Srinivasan, Lin Xu|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (114), Non-Patent Citations (36), Referenced by (10), Classifications (5), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application is related to U.S. Patent Application entitled “System and Method for Generating Code for an Integrated Data System,” Ser. No. 11/372,540, filed on Mar. 10, 2006, U.S. Patent Application entitled “Data Flow System and Method for Heterogeneous Data Integration Environments,” Ser. No. 11/373,685, filed on Mar. 10, 2006, U.S. Patent Application entitled “Dilation of Sub-Flow Operators in a Data Flow,” Ser. No. 11/372,516, filed on Mar. 10, 2006, U.S. Patent Application entitled “Classification and Sequencing of Mixed Data Flows,” Ser. No. 11/373,084, filed on Mar. 10, 2006, U.S. Patent Application entitled “Method and Apparatus for Managing Application Parameters,” Ser. No. 11/548,632, filed on Oct. 11, 2006, U.S. Patent Application entitled “Method and Apparatus for Generating Code for an Extract, Transform, and Load (ETL) Data Flow,” Ser. No. 11/548,659, filed on Oct. 11, 2006, and U.S. Patent Application entitled “Method and Apparatus for Using Set Based Structured Query Language (SQL) to Implement Extract, Transform, and Load (ETL) Splitter Operation,” Ser. No. 11/610,480, filed on Dec. 13, 2006 the disclosures of which are incorporated by reference herein.
The present invention relates generally to data processing, and more particularly to modeling data exchange in a data flow associated with an extract, transform, and load (ETL) process.
Extract, transform, and load (ETL) is a process in data warehousing that involves extracting data from outside sources, transforming the data in accordance with particular business needs, and loading the data into a data warehouse. An ETL process typically begins with a user defining a data flow that defines data transformation activities that extract data from, e.g., flat files or relational tables, transform the data, and load the data into a data warehouse, data mart, or staging table. A data flow, therefore, typically includes a sequence of operations modeled as data flowing from various types of sources, through various transformations, and finally ending in one or more targets, as described in U.S. patent application entitled “Classification and Sequencing of Mixed Data Flows” incorporated by reference above. In the course of execution of a data flow, data sometimes needs to be exchanged or staged at intermediate points within the data flow. The staging of data typically includes saving the data temporarily either in a structured physical storage medium (such as in a simple file) or in database temporary tables or persistent tables. In some cases, it may be optimal to save rows of data in the processing program's memory itself, especially when large and fast caches are present in the system (such “staging” is often referred to as “caching”).
ETL vendors conventionally support data exchange and staging internally inside of an ETL engine in a proprietary fashion, especially if the ETL engine is running outside of a relational database. For example, the DataStage ETL engine permits users to build “stages” of operations—i.e., discrete steps in the transformation sequence—and physically move rows between different stage components in memory. (Note: The term “stage” as used in the context of the DataStage engine—does not refer to the concept of saving rows to a physical media, but rather to unique operational steps). This method, typically allows for some types of performance optimizations; however, the rows of data being moved between the different stages are usually in an internal format (stored in internal memory formats in buffer pools) and the only way a user can view the rows of data is to explicitly define a File Target (or a Table Target) in the data flow and force the rows of data to be saved into a file (or a table)—i.e., only the target of such a data flow can physically export the rows into a user recognizable format.
Accordingly, a common problem of conventional data exchange and staging techniques is that users are not able to specify staging points explicitly and directly in the middle of a data flow, but only as the end of a transformation sequence using target operators. Target operators typically do not serve as an exchange operator—since target operators are destinations. For example, if a user needs to extract rows from a SQL (structured query language) table and pass the rows as input to another type of system which requires a file as input, then the user would have to represent such a process with a first job—as a Table Source operation followed by a File Target or Export operation having a specific file name. The user would then have to schedule a second (separate) job to invoke an operation that uses the file as input.
In general, this specification describes methods, systems, and computer program products for generating code from a data flow associated with an extract, transform, and load (ETL) process. In one implementation, the method includes identifying a data exchange requirement between a first operator and a second operator in the data flow. The first operator is a graphical object that represents a first data transformation step in the data flow and is associated with a first type of runtime engine, and the second operator is a graphical object that represents a second data transformation step in the data flow and is associated with a second type of runtime engine. The method further includes generating code to manage data staging between the first operator and the second operator in the data flow associated with the ETL process. The code exchanges data from a format associated with the first type of runtime engine to a format associated with the second type of runtime engine.
Particular implementations can include one or more of the following advantages. In one aspect, a data station operator is provided that can be inserted into a data flow of an ETL process, in which the data station operator represents a staging point in a data flow. The staging is done to store intermediate processed data for the purpose of tracking, debugging, ease of data recovery, and optimization purposes. In one implementation, the data station operator also permits data exchange between two linked operators that are incompatible in a same single job. Relative to conventional techniques that requires two separate jobs to perform a data exchange between two operators that are incompatible, it is more optimal to use one single job that encompasses both systems, especially if the job is run in parallel and in batches—e.g., if upstream producers and downstream consumers work in sync in a parallel and batch driven mode, the end performance is better.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
The present invention relates generally to data processing, and more particularly to modeling data-exchange in a data flow associated with an extract, transform, and load (ETL) process. The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. The present invention is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features described herein.
Running on the programmed computer 204 is an integrated development environment 208. The integrated development environment 208 is a software component that assists users (e.g., computer programmers) in developing, creating, editing, and managing code for target platforms.
In one implementation, the integrated development environment 208 includes code generation system 210 that (in one implementation) is operable to generate code to manage data exchange and data staging within a sequence of operations defined in a data flow of an ETL process, as discussed in greater detail below. In one implementation, the code generator 210 generates code using techniques as described in U.S. Patent Application entitled “Classification and Sequencing of Mixed Data Flows,” Ser. No. 11/372,540, filed on Mar. 10, 2006 (the '540 application), which is incorporated by reference above.
In operation, a data flow 212 (e.g., an ETL data flow) is received by the code generation system 210, and the data flow 212 is converted by the code generation system into a logical operator graph (LOG) 214. The logical operator graph 214 is a normalized, minimalist representation of the data flow 212 that includes logical abstract collection of operators (including, e.g., one or more of a splitter operator, join operator, filter operator, table extract operator, bulk load operator, aggregate operator, and so on). In some implementations, all of the contents of the data flow 212 may be used “as-is” by the code generation system 210 and, therefore, the logical operator graph 214 will be the same as the data flow 212. The code generation system 210 converts the logical operator graph 214 into a query graph model (QGM graph) 216. The QGM graph 216 is an internal data model used by the code generation system 210 for analysis and optimization processes, such as chunking (in which a subset of a data flow is broken into several pieces to improve performance) and execution parallelism (in which disparate sets of operations within a data flow are grouped and executed in parallel to yield better performance). After analysis, the QGM 216 is converted into an extended plan graph 218. The extended plan graph 218 represents code generated by the code generation system 210 and is sent to a runtime engine (e.g., an ETL engine) for execution.
In one implementation, the integrated development environment 208 includes a data flow graphical editor (not shown) that enables users to build data flows (e.g., data flow 212). In one implementation, the data flow graphical editor provides a new operator—i.e., a data station operator—that a user can directly drag and drop into a data flow to link a preceding (“upstream”) operator and one or more subsequent (“downstream”) operators, which data station operator specifies a data staging point in the data flow. In general, operators are represented in a data flow as graphical objects. In one implementation, the data station operator can be used as a link between a first operator (or operation) associated with first runtime engine (e.g., a relational database management system) and a second operator associated with a second runtime engine (e.g., a DataStage ETL engine).
In one implementation, the code generation system 210 is operable to automatically place individual data station operators (into a sequence of operations defined by a data flow) whenever a data exchange requirement is identified during the code generation process. In one implementation, the identification and insertion of the data exchange/staging points are seamless to the end user. Accordingly, in such an implementation, the code generation system 210 is operable to automatically generate code that manages data staging and data exchange on an ETL system that is capable of integrating various data processing runtime engines. For example, if a particular runtime engine can work with flat files as well as database tables, depending on certain optimization considerations, there may not be an exchange necessary, or if flat files are determined to be processed faster, then a file staging from an upstream operation (e.g., associated with a relational database engine) may be decided by the code generation system 210 to be more appropriate—or a decision could be made based on current system loads. A dynamic decision (based on various cost-benefit analyses) on whether a data station operator is required, may be best decided by the code generation system 210. In such cases, any suitable cost-benefit criteria can be implemented. In some cases, however, (expert) users or database administrators may have better knowledge than the code generation system 210 because of an understanding of expected data and expected system stress, e.g., when data is range partitioned and the administrator would be aware of which particular database partition nodes would be stressed. In such cases, it may be more appropriate for a user to explicitly override any staging options automatically selected by the code generation system 210 (or for a user to explicitly define a different staging format when the code generation system 210 does not add one by default).
Accordingly, unlike a conventional system in which a user must represent data staging using two or more jobs in order to exchange data from one runtime format (e.g., database table) to another runtime format (e.g., a flat file) in a data flow, the data processing system 200 permits a user to exchange data from one system-format to another in the same single job through the data station operator. Users can, therefore, use such data stations to explicitly identify points of staging or exchange interest, e.g., for diagnostics, for performance improvements, or for overriding any default choices made by the code generation system 210.
User input is received inserting a second operator associated with a second type of runtime engine into the data flow (step 404). In one example, the first operator can be associated with a relational database engine and the second operator can be associated with a DataStage ETL engine. The first operator and the second operator can be transform operators that represent data transformation steps in the data flow. User input is received inserting a data station operator (e.g., data station operator 300) into the data flow between the first operator and the second operator to link the first operator and the second operator (step 406). Thus, the data processing system permits the user to explicitly add a data staging operator into a data flow, in which the data staging operator exchanges data from a format associated with the first runtime engine into a format associated with the second runtime engine in a same single job.
Pre-determined criteria upon which the code generation system (or a user) may decide to insert a data station operator into a data flow include, for example, criteria associated with optimization, error recovery and restart, diagnostics and debugging, and cross-system data exchanges. With regard to optimization, intermediate (calculated) data may be staged to avoid having to perform the same calculation multiple times, especially in cases where the output of a single upstream operation is required by multiple downstream operations. Even when there is only one downstream consumer of the output data of a given operation, it may be prudent to stage rows of the output data, especially to a physical storage, in order to either free up memory or avoid stressing an execution system (for example, to avoid running out of database log space). With respect to error recovery and restart, in complex systems, errors during the execution of data flows may occur either due to bad (dirty) data which may cause database inconsistencies, or fatal errors doe to software failures, power loss etc. In many eases, manual intervention is required to bring databases and other systems back to a consistent state. Thus in one implementation, the code generation system (or user) inserts data station operators at specific consistency check points in the data flow, so that staging can be performed on intermediate results in a physical media (for example, in database persistent tables or files). Accordingly, restarts (either manual or automatic) can be performed starting at these check points, thus, saving quite a bit of time.
In terms of diagnostics and debugging, staging may allow administrators to identify the core cause of problems, for example, an administrator can inspect staged rows to find bad data, which may even require the administrator to re-organize ETL processes to first clean such data. Users may also explicitly add data stations in a data flow to aid in debugging of the data flow, e.g., during development and testing cycles. An inspection of such staged rows will provide a validation of whether the corresponding upstream operations did indeed perform as expected. With regard to cross-system exchanges, a data processing system that is capable of integrating various data processing engines such as the one described in the '540 application, the data being processed in such a data processing system is a mix of various data types and formats that are specific to a given underlying (runtime) data processing engine. Some runtime processing systems may be equipped to process data inside database tables, others may only work with flat files, while others may perform better using Message Queues. In some scenarios, external systems in a different (remote) site may be required to complete part of an operation—e.g., a “Name Address Lookup” facility which may be provided by an online vendor for cleansing customer addresses. Such an external vendor may even require a SOAP-based web service means of data movement.
Referring back to
Provided below is further discussion regarding implementations of a data station operator and uses thereof.
Model of a Data Station Operator
In one implementation, a data station operator models a data exchange/staging object, and a code generator system generates code that supports staging and data exchange functionalities based on the data station operator. In one implementation, a data station operator is modeled using a data flow operator modeling framework as described in the '540 application. More generally, the concept of an operator is generic to many different ETL or transformation frameworks and, therefore, the concept of a data station operator can be extended for systems other many types of data processing systems. In one implementation, a data station operator has one input port, and one output port, and includes one or more of the following attributes as shown in Table 1 below.
TABLE 1 Attribute Definition data station type The type specifies the format of the staged data . . . e.g., temporary database table, permanent database table, database view, JDBC result set and flat file. pass through flag The pass though flag is a Boolean flag indicating if the associated data station operator can be ignored. This flag can be used to turn off a data station operator without a user having to physically remove the data station operator from a data flow. name of staging object The name corresponds to, e.g., a staging table name or staging file name. data station lifetime The lifetime permits a user to specify when a staging object should be cleaned and removed after a flow execution - i.e., removing a staging object at the end of an ETL flow execution, or keeping the staging object permanently. performance hints Performance hints include information such as DB partition information according to the source or target tables, index specification, whether to preserve incoming data orders etc. Allowing a user to specify performance hints gives the user flexibility to control data flow execution.
Advantages of a Data Station Operator
Advantages of a data station operator include the following. With respect to performance, depending on the underlying runtime engine in which an ETL process is executed, staging intermediate data can yield better performance by controlling where and how data is flown through. For example, when the underlying ETL engine is a database server (e.g., DB2), the execution code of one data flow can be represented to one or several SQL statements. One single SQL statement can contain several levels of nested sub-queries to represent many transform operations. However, one single SQL statement could lead to runtime performance problems on certain DB servers. For example, two common problems could occur which are caused by one long SQL statement: 1) the log size that is required to run the SQL can be large if the number of nested queries reaches a certain level; 2) a single (nested) query is limited to DB vendor's query processing capability. In some cases, a single SQL statement will not work. In such case it is desirable to break the single SQL statement into smaller pieces for better performance.
With regard to data (format) exchange, when a data flow includes a mix of SQL operators and non-SQL operators, it is generally not possible to represent the data flow using one common language. Data flows through must be “staged” in order to transit from one type of operator to another. For example, consider a data flow that extracts data from a JDBC (Java Database Connectivity) source, goes through a couple of transformations, and then ends with the data being loaded into a target table. The code representing JDBC extraction is a java program, whereas the transformations and loads can be presented by SQL statements. In such cases, the output row sets from the JDBC extractor are staged into a DB2 table prior to sending the row sets to the following transform node.
A data station operator also permits tracing of data within a data flow. Providing a tracing functionality in a data flow permits users to monitor and track data flow runtime execution, and helps users diagnose problems when errors occur. Providing a data station operator permits a user to explicitly specify a staging point for an operator in a data flow at which a stage table/file will be created to capture all intermediate data that have been processed up to the staging point. Additional diagnostic information for the staging point can also be captured, such as number of rows processed, code being executed, temporary tables/files created, and so on. A data station operator also provides error recovery capability for a data flow. For example, when the execution of a data flow fails, the code generation system, or user, can select to begin a recovery process from a staged point where intermediate processed data is still valid. This permits for faster recovery from a failure relative to having to restart from the beginning of a data flow.
Pre-Determined Criteria for Inserting a Data Station Operator into a Data Flow
A data exchange/staging point identities a position where data exchange/staging is required in a data flow—e.g., either on a link or an output port. In one implementation, a staging point in a data flow is identified when one of the following conditions arises:
In one implementation, when a code generation system chunks a data flow into several small pieces, staging tables and staging files are created and maintained to hold intermediate row sets during an ETL process—e.g., data between extract and transform, between transform and load, or a chunking point inside a data flow. In one implementation, staging tables are database relational tables, and depending on how staging tables are used, a given stage table can be either a permanent table on ETL transform database, or a temporary table created in the data transformation session. In one implementation, staging files are flat files that hold intermediate transformed data in the text format. Staging tables and staging files can be created on a transform engine. A user can also input other specifications of a staging object, such as (table) spaces, indexes used for staging tables, location for staging files.
In one implementation, staging tables are used to hold intermediate row sets during an ETL process. A code generation system can maintain a staging table, including DDL (Database Definition Language) associated table spaces and indexes. The “lifetime” of a staging table (e.g., the duration of a stage table and when should the staging table be deleted) can be externally specified by a user or internally determined by a code generation system depending on the usage of the staging object. For example, if a staging table is generated internally by a code generation system, and is used only for a specific dataflow stream, the staging table can be created at the beginning of the data flow execution as a database temporary table, which temporary table will be deleted when the session ends. If however, an internal staging table is used to chunk a data flow into multiple parallel execution pieces, the staging object can be defined as database permanent table to hold intermediate row sets until the end of an ETL application execution.
In one implementation, staging files are flat, text files. A flat file is a text-based ASCII file that is commonly used as a bridge between non-relational data sets and relational database tables. Staging flat files can be generated by a database export utility (such as DB2 SQL export) to export data from relational DB tables, or can be generated using a custom operator interface provided by a code generation system. Flat files can be loaded into target tables through a database load utility such as DB2 load.
JDBC Result Sets
JDBC result sets are the exchange point between two or more operators. The results of a previous (upstream) operator are represented as JDBC result sets and consumed by following (downstream) operators. JDBC result sets are memory objects and, in one implementation, the handles/names of the memory objects are determined by the code generation system.
Automatically Placed Data Station Operators
For internally generated staging points (e.g., those staging points not explicitly defined by a user), a code generation system can analyze the internal presentation of a data flow (e.g., through a QGM), identify staging points and insert data station operators that chunk the data flow into multiple smaller pieces (or sub-flows). Between these sub-flows, staging tables can be used to temporarily store intermediate transformed result sets. For example, when a chunking point is identified, a QGM can include staging tables/files (e.g., represented as table/file boxes) that link to other QGM nodes. QGM In one implementation, the name of each staging table within a QGM is unique. In one implementation, DDL statements for all staging tables generated within a data flow will be returned.
In one implementation, a code generation system (e.g., code generation system 210 of
One or more of method steps described above can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Generally, the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In one implementation, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
Memory elements 804A-B can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code in order to reduce the number of times the code must be retrieved from bulk storage during execution. As shown, input/output or I/O devices 808A-B (including, but not limited to, keyboards, displays, pointing devices, etc.) are coupled to data processing system 800. I/O devices 808A-B may be coupled to data processing system 800 directly or indirectly through intervening I/O controllers (not shown).
In one implementation, a network adapter 810 is coupled to data processing system 800 to enable data processing system 800 to become coupled to other data processing systems or remote printers or storage devices through communication link 812. Communication link 812 can be a private or public network. Modems, cable modems, and Ethernet cards are just a few of the currently available types of network adapters.
Various implementations for modeling data exchange in a data flow associated with an extract, transform, and load (ETL) process have been described. Nevertheless, various modifications may be made to the implementations, and those variations would be within the scope of the present invention. For example, with respect to various implementations discussed above, different programming languages (e.g., C) can be used to stage intermediate processing data into a proprietary data format. Accordingly, many modifications may be made without departing from the scope of the following claims.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4813013||Dec 22, 1986||Mar 14, 1989||The Cadware Group, Ltd.||Schematic diagram generating system using library of general purpose interactively selectable graphic primitives to create special applications icons|
|US4901221||Apr 14, 1986||Feb 13, 1990||National Instruments, Inc.||Graphical system for modelling a process and associated method|
|US5379423||Jun 1, 1993||Jan 3, 1995||Hitachi, Ltd.||Information life cycle processor and information organizing method using it|
|US5497500||Jun 6, 1995||Mar 5, 1996||National Instruments Corporation||Method and apparatus for more efficient function synchronization in a data flow program|
|US5577253||Mar 6, 1995||Nov 19, 1996||Digital Equipment Corporation||Analyzing inductive expressions in a multilanguage optimizing compiler|
|US5729746||Oct 23, 1995||Mar 17, 1998||Leonard; Ricky Jack||Computerized interactive tool for developing a software product that provides convergent metrics for estimating the final size of the product throughout the development process using the life-cycle model|
|US5850548||Nov 14, 1994||Dec 15, 1998||Borland International, Inc.||System and methods for visual programming based on a high-level hierarchical data flow model|
|US5857180||Jul 21, 1997||Jan 5, 1999||Oracle Corporation||Method and apparatus for implementing parallel operations in a database management system|
|US5920721||Jun 11, 1997||Jul 6, 1999||Digital Equipment Corporation||Compiler generating functionally-alike code sequences in an executable program intended for execution in different run-time environments|
|US5966532||Aug 6, 1997||Oct 12, 1999||National Instruments Corporation||Graphical code generation wizard for automatically creating graphical programs|
|US6014670||Nov 7, 1997||Jan 11, 2000||Informatica Corporation||Apparatus and method for performing data transformations in data warehousing|
|US6044217||Oct 23, 1997||Mar 28, 2000||International Business Machines Corporation||Hierarchical metadata store for an integrated development environment|
|US6098153||Jan 30, 1998||Aug 1, 2000||International Business Machines Corporation||Method and a system for determining an appropriate amount of data to cache|
|US6202043||Feb 8, 1999||Mar 13, 2001||Invention Machine Corporation||Computer based system for imaging and analyzing a process system and indicating values of specific design changes|
|US6208345||Jun 8, 1998||Mar 27, 2001||Adc Telecommunications, Inc.||Visual data integration system and method|
|US6208990||Jul 15, 1998||Mar 27, 2001||Informatica Corporation||Method and architecture for automated optimization of ETL throughput in data warehousing applications|
|US6243710||Jan 21, 1999||Jun 5, 2001||Sun Microsystems, Inc.||Methods and apparatus for efficiently splitting query execution across client and server in an object-relational mapping|
|US6282699||Feb 23, 1999||Aug 28, 2001||National Instruments Corporation||Code node for a graphical programming system which invokes execution of textual code|
|US6434739||Apr 22, 1996||Aug 13, 2002||International Business Machines Corporation||Object oriented framework mechanism for multi-target source code processing|
|US6449619||Jun 23, 1999||Sep 10, 2002||Datamirror Corporation||Method and apparatus for pipelining the transformation of information between heterogeneous sets of data sources|
|US6480842 *||Mar 25, 1999||Nov 12, 2002||Sap Portals, Inc.||Dimension to domain server|
|US6604110 *||Oct 31, 2000||Aug 5, 2003||Ascential Software, Inc.||Automated software code generation from a metadata-based repository|
|US6738964||Mar 10, 2000||May 18, 2004||Texas Instruments Incorporated||Graphical development system and method|
|US6772409 *||Mar 2, 1999||Aug 3, 2004||Acta Technologies, Inc.||Specification to ABAP code converter|
|US6795790||Jun 6, 2002||Sep 21, 2004||Unisys Corporation||Method and system for generating sets of parameter values for test scenarios|
|US6807651||Jun 17, 2002||Oct 19, 2004||Cadence Design Systems, Inc.||Procedure for optimizing mergeability and datapath widths of data flow graphs|
|US6839724 *||Apr 17, 2003||Jan 4, 2005||Oracle International Corporation||Metamodel-based metadata change management|
|US6850925||May 15, 2001||Feb 1, 2005||Microsoft Corporation||Query optimization by sub-plan memoization|
|US6968326||Jul 17, 2003||Nov 22, 2005||Vivecon Corporation||System and method for representing and incorporating available information into uncertainty-based forecasts|
|US6968335 *||Nov 14, 2002||Nov 22, 2005||Sesint, Inc.||Method and system for parallel processing of database queries|
|US6978270 *||Nov 16, 2001||Dec 20, 2005||Ncr Corporation||System and method for capturing and storing operational data concerning an internet service provider's (ISP) operational environment and customer web browsing habits|
|US7003560||Nov 3, 2000||Feb 21, 2006||Accenture Llp||Data warehouse computing system|
|US7031987 *||May 30, 2001||Apr 18, 2006||Oracle International Corporation||Integrating tablespaces with different block sizes|
|US7035786||Oct 26, 2001||Apr 25, 2006||Abu El Ata Nabil A||System and method for multi-phase system development with predictive modeling|
|US7076765||Mar 3, 1999||Jul 11, 2006||Kabushiki Kaisha Toshiba||System for hiding runtime environment dependent part|
|US7103590 *||Aug 24, 2001||Sep 5, 2006||Oracle International Corporation||Method and system for pipelined database table functions|
|US7191183 *||Apr 5, 2002||Mar 13, 2007||Rgi Informatics, Llc||Analytics and data warehousing infrastructure and services|
|US7209925 *||Aug 25, 2003||Apr 24, 2007||International Business Machines Corporation||Method, system, and article of manufacture for parallel processing and serial loading of hierarchical data|
|US7340718||May 8, 2003||Mar 4, 2008||Sap Ag||Unified rendering|
|US7343585||Jan 29, 2003||Mar 11, 2008||Oracle International Corporation||Operator approach for generic dataflow designs|
|US7499917||Jan 28, 2005||Mar 3, 2009||International Business Machines Corporation||Processing cross-table non-Boolean term conditions in database queries|
|US7689576||Mar 10, 2006||Mar 30, 2010||International Business Machines Corporation||Dilation of sub-flow operators in a data flow|
|US7689582||Mar 10, 2006||Mar 30, 2010||International Business Machines Corporation||Data flow system and method for heterogeneous data integration environments|
|US7739267||Mar 10, 2006||Jun 15, 2010||International Business Machines Corporation||Classification and sequencing of mixed data flows|
|US7747563 *||Dec 7, 2007||Jun 29, 2010||Breakaway Technologies, Inc.||System and method of data movement between a data source and a destination|
|US20020046301 *||Aug 13, 2001||Apr 18, 2002||Manugistics, Inc.||System and method for integrating disparate networks for use in electronic communication and commerce|
|US20020078262||Dec 14, 2000||Jun 20, 2002||Curl Corporation||System and methods for providing compatibility across multiple versions of a software system|
|US20020116376||Feb 28, 2002||Aug 22, 2002||Hitachi, Ltd.||Routine executing method in database system|
|US20020170035||Feb 28, 2001||Nov 14, 2002||Fabio Casati||Event-based scheduling method and system for workflow activities|
|US20020198872||Feb 4, 2002||Dec 26, 2002||Sybase, Inc.||Database system providing optimization of group by operator over a union all|
|US20030033437||Apr 9, 2002||Feb 13, 2003||Fischer Jeffrey Michael||Method and system for using integration objects with enterprise business applications|
|US20030037322||Jun 21, 2002||Feb 20, 2003||Kodosky Jeffrey L.||Graphically configuring program invocation relationships by creating or modifying links among program icons in a configuration diagram|
|US20030051226||Jun 13, 2001||Mar 13, 2003||Adam Zimmer||System and method for multiple level architecture by use of abstract application notation|
|US20030101098||Nov 27, 2001||May 29, 2003||Erich Schaarschmidt||Process and device for managing automatic data flow between data processing units for operational order processing|
|US20030110470||May 31, 2002||Jun 12, 2003||Microsoft Corporation||Method and apparatus for providing dynamically scoped variables within a statically scoped computer programming language|
|US20030149556||Feb 21, 2001||Aug 7, 2003||Riess Hugo Christian||Method for modelling and controlling real processes in a data processing equipment and a data processing equipment for carrying out said method|
|US20030154274||Jan 31, 2003||Aug 14, 2003||International Business Machines Corporation||Data communications system, terminal, and program|
|US20030172059||Oct 30, 2002||Sep 11, 2003||Sybase, Inc.||Database system providing methodology for eager and opportunistic property enforcement|
|US20030182651||Mar 21, 2002||Sep 25, 2003||Mark Secrist||Method of integrating software components into an integrated solution|
|US20030229639||Jun 7, 2002||Dec 11, 2003||International Business Machines Corporation||Runtime query optimization for dynamically selecting from multiple plans in a query based upon runtime-evaluated performance criterion|
|US20030233374||Dec 2, 2002||Dec 18, 2003||Ulrich Spinola||Dynamic workflow process|
|US20030236788||Jun 3, 2002||Dec 25, 2003||Nick Kanellos||Life-cycle management engine|
|US20040054684||Nov 12, 2001||Mar 18, 2004||Kay Geels||Method and system for determining sample preparation parameters|
|US20040068479 *||Oct 4, 2002||Apr 8, 2004||International Business Machines Corporation||Exploiting asynchronous access to database operations|
|US20040107414||Oct 1, 2003||Jun 3, 2004||Youval Bronicki||Method, a language and a system for the definition and implementation of software solutions|
|US20040220923||Apr 28, 2004||Nov 4, 2004||Sybase, Inc.||System and methodology for cost-based subquery optimization using a left-deep tree join enumeration algorithm|
|US20040254948 *||Jun 12, 2003||Dec 16, 2004||International Business Machines Corporation||System and method for data ETL in a data warehouse environment|
|US20050022157||Dec 23, 2003||Jan 27, 2005||Rainer Brendle||Application management|
|US20050044527||Aug 22, 2003||Feb 24, 2005||Gerardo Recinto||Code Units based Framework for domain- independent Visual Design and Development|
|US20050055257 *||Sep 4, 2003||Mar 10, 2005||Deniz Senturk||Techniques for performing business analysis based on incomplete and/or stage-based data|
|US20050091664||Sep 10, 2004||Apr 28, 2005||Jay Cook||Method and system for associating parameters of containers and contained objects|
|US20050091684||Sep 22, 2004||Apr 28, 2005||Shunichi Kawabata||Robot apparatus for supporting user's actions|
|US20050097103 *||Sep 17, 2004||May 5, 2005||Netezza Corporation||Performing sequence analysis as a multipart plan storing intermediate results as a relation|
|US20050108209||Nov 19, 2003||May 19, 2005||International Business Machines Corporation||Context quantifier transformation in XML query rewrite|
|US20050131881||Sep 16, 2004||Jun 16, 2005||Bhaskar Ghosh||Executing a parallel single cursor model|
|US20050137852||Jan 8, 2004||Jun 23, 2005||International Business Machines Corporation||Integrated visual and language- based system and method for reusable data transformations|
|US20050149914||Oct 29, 2004||Jul 7, 2005||Codemesh, Inc.||Method of and system for sharing components between programming languages|
|US20050174986||Feb 11, 2004||Aug 11, 2005||Radio Ip Software, Inc.||Method and system for emulating a wirless network|
|US20050174988||Dec 30, 2004||Aug 11, 2005||Bernt Bieber||Method and arrangement for controlling access to sensitive data stored in an apparatus, by another apparatus|
|US20050216497||Mar 26, 2004||Sep 29, 2005||Microsoft Corporation||Uniform financial reporting system interface utilizing staging tables having a standardized structure|
|US20050227216||Nov 19, 2004||Oct 13, 2005||Gupta Puneet K||Method and system for providing access to electronic learning and social interaction within a single application|
|US20050234969||Feb 24, 2005||Oct 20, 2005||Ascential Software Corporation||Services oriented architecture for handling metadata in a data integration platform|
|US20050240354 *||Feb 24, 2005||Oct 27, 2005||Ascential Software Corporation||Service oriented architecture for an extract function in a data integration platform|
|US20050240652||Mar 8, 2005||Oct 27, 2005||International Business Machines Corporation||Application Cache Pre-Loading|
|US20050243604 *||Mar 16, 2005||Nov 3, 2005||Ascential Software Corporation||Migrating integration processes among data integration platforms|
|US20050256892 *||Mar 16, 2005||Nov 17, 2005||Ascential Software Corporation||Regenerating data integration functions for transfer from a data integration platform|
|US20050283473||Sep 21, 2004||Dec 22, 2005||Armand Rousso||Apparatus, method and system of artificial intelligence for data searching applications|
|US20060004863||Jun 8, 2004||Jan 5, 2006||International Business Machines Corporation||Method, system and program for simplifying data flow in a statement with sequenced subexpressions|
|US20060015380||Jun 15, 2005||Jan 19, 2006||Manyworlds, Inc||Method for business lifecycle management|
|US20060036522||Sep 23, 2004||Feb 16, 2006||Michael Perham||System and method for a SEF parser and EDI parser generator|
|US20060047709||Sep 5, 2002||Mar 2, 2006||Belin Sven J||Technology independent information management|
|US20060074621||Aug 31, 2004||Apr 6, 2006||Ophir Rachman||Apparatus and method for prioritized grouping of data representing events|
|US20060074730||Jan 31, 2005||Apr 6, 2006||Microsoft Corporation||Extensible framework for designing workflows|
|US20060101011||Nov 5, 2004||May 11, 2006||International Business Machines Corporation||Method, system and program for executing a query having a union operator|
|US20060112109 *||Nov 23, 2004||May 25, 2006||Chowdhary Pawan R||Adaptive data warehouse meta model|
|US20060167865||Jan 24, 2005||Jul 27, 2006||Sybase, Inc.||Database System with Methodology for Generating Bushy Nested Loop Join Trees|
|US20060174225||Feb 1, 2005||Aug 3, 2006||International Business Machines Corporation||Debugging a High Level Language Program Operating Through a Runtime Engine|
|US20060206869||Apr 21, 2006||Sep 14, 2006||Lewis Brad R||Methods and systems for developing data flow programs|
|US20060212475 *||Nov 24, 2003||Sep 21, 2006||Cheng Nick T||Enterprise information management and business application automation by using the AIMS informationbase architecture|
|US20060218123||Jun 2, 2005||Sep 28, 2006||Sybase, Inc.||System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning|
|US20060228654||Apr 7, 2005||Oct 12, 2006||International Business Machines Corporation||Solution builder wizard|
|US20070061305 *||Apr 27, 2006||Mar 15, 2007||Soufiane Azizi||System and method of providing date, arithmetic and other relational functions for OLAP sources|
|US20070078812||Sep 30, 2005||Apr 5, 2007||Oracle International Corporation||Delaying evaluation of expensive expressions in a query|
|US20070157191||Dec 29, 2005||Jul 5, 2007||Seeger Frank E||Late and dynamic binding of pattern components|
|US20070169040||Jan 13, 2006||Jul 19, 2007||Microsoft Corporation||Typed intermediate language support for languages with multiple inheritance|
|US20070203893 *||Feb 27, 2006||Aug 30, 2007||Business Objects, S.A.||Apparatus and method for federated querying of unstructured data|
|US20070214111||Mar 10, 2006||Sep 13, 2007||International Business Machines Corporation||System and method for generating code for an integrated data system|
|US20070214171 *||Mar 10, 2006||Sep 13, 2007||International Business Machines Corporation||Data flow system and method for heterogeneous data integration environments|
|US20070214176||Mar 10, 2006||Sep 13, 2007||International Business Machines Corporation||Dilation of sub-flow operators in a data flow|
|US20070244876||Mar 10, 2006||Oct 18, 2007||International Business Machines Corporation||Data flow system and method for heterogeneous data integration environments|
|US20080092112||Oct 11, 2006||Apr 17, 2008||International Business Machines Corporation||Method and Apparatus for Generating Code for an Extract, Transform, and Load (ETL) Data Flow|
|US20080147703||Oct 11, 2006||Jun 19, 2008||International Business Machines Corporation||Method and Apparatus for Managing Application Parameters|
|US20080147707||Dec 13, 2006||Jun 19, 2008||International Business Machines Corporation||Method and apparatus for using set based structured query language (sql) to implement extract, transform, and load (etl) splitter operation|
|1||*||Alkis Simitisis, Mapping Conceptual to Logical Models for ETL Processes, ACM, 2005.|
|2||*||Alkis Simitsis. "Mapping Conceptual to Logical Models for ETL Processes." ACM, 2005, pp. 67-77.|
|3||Arusinski et al., "A Software Port from a Standalone Communications Management Unit to an Integrated Platform", 2002, IEEE, pp. 1-9.|
|4||Carreira et al., "Data Mapper: An Operator for Expressing One-to Many Data Transformations", Data Warehousing and Knowledge Discovery, Tjoa et al, editors, 7th International Conference DaWaK 2005 Copenhagen, Denmark, Aug. 22-26, 2005, pp. 136-145.|
|5||Carreira et al., "Execution of Data Mappers", IQIS, 2004, pp. 2-9, 2004 ACM 1-58113-902-0/04/0006, Paris, France.|
|6||Ferguson et al., "Platform Independent Translations for a Compilable Ada Abstract Syntax", Feb. 1993 ACM 0-89791-621-2/93/0009-0312 1.50, pp. 312-322.|
|7||Final Office Action for U.S. Appl. No. 11/610,480 dated Apr. 13, 2011.|
|8||Friedrich, II, Meta-Data Version and Configuration Management in Multi-Vendor Environments, SIGMOD, Jun. 14-16, 2005, 6 pgs., Baltimore, MD.|
|9||Gurd et al., "The Manchester Prototype Dataflow Computer", Communications of the ACM, Jan. 1985, pp. 34-52, vol. 28, No. 1.|
|10||Haas et al., "Clio Grows Up: From Research Prototype to Industrial Tool", SIGMOD, Jun. 14-16, 2005, 6 pgs., Baltimore, MD.|
|11||Hernandez et al., "Clio: A schema mapping tool for information integration", IEEE Computer Society, 2005.|
|12||Ives, Zachary G., An Adaptive Query Execution System for Data Integration, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Jun. 1999, vol. 28, Issue 2, ACM, New York, New York, United States.|
|13||Jardim-Gonçalves et al., "Integration and adoptability of APs: the role of ISO TC184/SC4 standards", International Journal of Computer Applications in Technology, 2003, pp. 105-116, vol. 18, Nos. 1-4.|
|14||Konstantinides, et al., "The Khoros Software Development Environment for Image and Signal Processing," May 1994, IEEE, vol. 3, pp. 243-252.|
|15||Notice of Allowance for U.S. Appl. No. 11/548,659 dated May 13, 2011.|
|16||Office Action for U.S. Appl. No. 11/372,540 dated Mar. 30, 2011.|
|17||Office Action history of U.S. Appl. No. 11/372,516, dates ranging from Apr. 6, 2006 to Nov. 17, 2009.|
|18||Office Action history of U.S. Appl. No. 11/372,540, dates ranging from Mar. 11, 2009 to Sep. 19, 2011.|
|19||Office Action history of U.S. Appl. No. 11/373,084, dates ranging from Feb. 20, 2009 to Feb. 3, 2010.|
|20||Office Action history of U.S. Appl. No. 11/373,685, dates ranging from Jan. 10, 2008 to Nov. 16, 2009.|
|21||Office Action history of U.S. Appl. No. 11/548,632, dates ranging from May 11, 2010 to Jul. 11, 2011.|
|22||Office Action history of U.S. Appl. No. 11/548,659, dates ranging from Nov. 10, 2010 to Sep. 14, 201t.|
|23||Office Action history of U.S. Appl. No. 11/610,480, dates ranging from Sep. 10, 2010 to Aug. 31, 2011.|
|24||Poess et al., "TPC-DS, Taking Decision Support Benchmarking to the Next Level", ACM SIGMOD, Jun. 4-6, 2002, 6 pgs., Madison, WI.|
|25||Rafaieh et al., "Query-based data warehousing tool", DOLAP, Nov. 8, 2002, 8 pgs., McLean, VA.|
|26||Ramu, "Method for Initializing a Plateform and Code Independent Library", IBM Technical Disclosure Bulletin, Sep. 1994, pp. 637-638, vol. 37, No. 9.|
|27||Simitsis, "Mapping Conceptual to Logical Models for ETL Processes", ACM Digital Library, 2005, pp. 67-76.|
|28||Stewart et al., "Dynamic Applications from the Ground Up", Haskell '05, Sep. 30, 2005, Tallinn, Estonia, ACM, pp. 27-38.|
|29||Tjoa, et al. (Eds.), "Data Warehousing and Knowledge Discovery," Proceedings of 7th International Conference, DaWaK 2005, Copenhagen, Denmark, Aug. 22-26, 2005, Springer 2005.|
|30||U.S. Appl. No. 09/707,504, filed Nov. 7, 2000, Banavar, et al.|
|31||U.S. Patent Application entitled "Method and Apparatus for Adapting Application Front-Ends to Execute on Heterogeneous Device Platforms", filed Nov. 7, 2000.|
|32||Vassiliadis et al., "A generic and customizable framework for the design of ETL scenarios", Information Systems, Databases: Creation, Management and Utilization, 2005, pp. 492-525, vol. 30, No. 7.|
|33||Werner et al., "Just-in-sequence material supply-a simulation based solution in electronics", Robotics and Computer-Integrated Manufacturing, 2003, pp. 107-111, vol. 19, Nos. 1-2.|
|34||Werner et al., "Just-in-sequence material supply—a simulation based solution in electronics", Robotics and Computer-Integrated Manufacturing, 2003, pp. 107-111, vol. 19, Nos. 1-2.|
|35||Yu, "Transform Merging of ETL Data Flow Plan", IKE '03 International Conference, 2003, pp. 193-198.|
|36||Zhao et al., "Automated Glue/Wrapper Code Generation in Integration of Distributed and Heterogeneous Software Components", Proceedings of the 8th IEEE International Enterprise Distributed Object Computing Conf. (EDOC 2004), 2004, IEEE, pp. 1-11.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8538912 *||Sep 22, 2010||Sep 17, 2013||Hewlett-Packard Development Company, L.P.||Apparatus and method for an automatic information integration flow optimizer|
|US8751438 *||Apr 13, 2012||Jun 10, 2014||Verizon Patent And Licensing Inc.||Data extraction, transformation, and loading|
|US8782101 *||Jan 20, 2012||Jul 15, 2014||Google Inc.||Transferring data across different database platforms|
|US8812482||Oct 14, 2010||Aug 19, 2014||Vikas Kapoor||Apparatuses, methods and systems for a data translator|
|US8983933||Dec 21, 2012||Mar 17, 2015||Hewlett-Packard Development Company, L.P.||Costs of operations across computing systems|
|US9251226 *||Mar 15, 2013||Feb 2, 2016||International Business Machines Corporation||Data integration using automated data processing based on target metadata|
|US20120072391 *||Sep 22, 2010||Mar 22, 2012||Alkiviadis Simitsis||Apparatus and method for an automatic information integration flow optimizer|
|US20120151434 *||Jun 14, 2012||Juan Ricardo Rivera Hoyos||System And Method For Generating One Or More Plain Text Source Code Files For Any Computer Language Representing Any Graphically Represented Business Process|
|US20130275360 *||Apr 13, 2012||Oct 17, 2013||Verizon Patent And Licensing Inc.||Data extraction, transformation, and loading|
|US20140279830 *||Mar 15, 2013||Sep 18, 2014||International Business Machines Corporation||Data integration using automated data processing based on target metadata|
|U.S. Classification||707/602, 707/798|
|Jan 23, 2007||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, CALIF
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JIN, QI;LIAO, HUI;SRINIVASAN, SRIRAM;AND OTHERS;REEL/FRAME:018793/0408
Effective date: 20070105