« PreviousContinue »
METHOD AND APPARATUS FOR DATA
VALIDATION IN MULTIDIMENSIONAL
CROSS REFERENCE TO RELATED 5
This application claims priority from U.S. Provisional Patent Application No. 60/493,783, filed Aug. 11, 2003 and is related to U.S. Provisional Patent Application No. 60/493, 10 812 filed Aug. 11, 2003, and co-pending U.S. application Ser. No. 10/915,570 entitled "Method and Apparatus for Accessing Multidimensional Data," filed Aug. 11, 2004. The contents of each f the above-identified applications are incorporated herein by reference. 15
The present invention relates to the management of information in multidimensional databases and more particularly 20 to methods of identifying the validity of data in multidimensional databases.
The increase in data storage and retrievable capabilities, together with advances in online analytical processing (OLAP) has resulted in unprecedented access to information. Typically, OLAP server products are either multidimensional OLAP (MOLAP) or relational OLAP (ROLAP). 30 Both of these structures can store multidimensional information and have their respective and well known advantages and disadvantages.
In any database containing multidimensional data, ensuring that the data accessed is valid is a resource taxing 35 activity. Typically, the database management system (DBMS) may check the validity of data when it is requested or alternatively may indicate the validity of data in advance, for example through the use of a flag.
Data in a database can be stored with a timestamp, 40 indicating the last time when that piece of data was last written to. Data becomes invalid when any data that it is dependent on (its source data) is updated. Therefore, every time data is queried, either in itself or for use as part of a larger calculation, the DBMS may check the timestamp of 45 all the source data and recalculate the data if necessary. A disadvantage of this method is that the number of database accesses is high, increasing the query or calculation time. An advantage of this method is that if data does not need to be recalculated, the calculation time is minimised. In today's 50 environment when processing speeds have far outmatched IO speeds, this method of data validation may be inefficient, particularly if the source data regularly changes. Also, there is a minor increase in the database storage requirement due to having to store timestamps with the data. 55
Alternatively, whenever source data is updated, all data that is dependant on that source data may be either deleted or flagged as invalid, forcing recalculation of the dependent data if it is queried or used in a larger calculation. A disadvantage of this method is that during data load, large 60 quantities of data must be invalidated, degrading data load performance. However, calculation and query performance is maximised, due to avoiding having to check all the source data of the calculated data prior to reading the calculated data. 65
Therefore, the approach selected for data validation depends on the nature of the data in the database. For
constantly changing databases where calculation performance is not important, invalidation at query/calculation time may be preferred. If the time taken to load data is not important, invalidation at data load time may be preferred. This creates a problem for databases that may not fit into either of these generalisations, with the presently available options for cell validation having high associated efficiency degradation.
It is an object of the present invention to overcome or alleviate problems in management of multidimensional databases at present, or at least to provide the public with a useful alternative.
Further objects of the present invention may become apparent from the following description, given by way of example only.
Calculated Cell: A cell including at least one calculated member.
Calculated Member: A member whose value is dependent on one or more other members and/or a mathematical formula.
Cell: A location in a multidimensional database. A cell is a tuple of members.
Dimension: A set of hierarchically related members.
Input-level cell: A cell whose location contains only members that are not dependent on other members.
Member: A unique position on a dimension that includes in itself or points to data.
OLAP: On-Line Analytical Processing. A category of applications and technologies that allow the collection, storage, manipulation and investigation of multidimensional data.
OLAP Server: An application that provides OLAP functionality over a multidimensional database.
Outline: The set of all dimensions in a multidimensional database.
Source cell: A cell including at least one source member.
Source member: A member on which another member (a calculated member) is dependant.
Throughout this specification, data in the multidimensional database has been referred to by reference to members and cells. However, this terminology is not intended to limit the scope of the invention to any particular data format in a multidimensional database.
Unless the context clearly requires otherwise, throughout the description and the claims, the words "include", "including", and the like, are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense, that is to say, in the sense of "including, but not limited to".
SUMMARY OF THE INVENTION
According to one aspect of the present invention there is provided a method of managing cells formed from a tuple of members in a multidimensional database, the method comprising treating at least one dimension in the database as a plurality of hierarchically structured members divided into at least two levels, thereby defining a validation curve as the location of an intersection of said levels, evaluating in what level cells in the database are located and managing cells contained in a lower level by automatically indicating them as one of valid and invalid at data load time and managing cells contained in an upper level by indicating them as one of valid and invalid according to a method involving at least one of less reading from and writing to the database at data
load time than if all the cells in the upper level were indicated as one of valid and invalid at data load time.
Preferably, the location of at least one validation curve for at least one dimension varies across the dimension.
Preferably, the multidimensional database comprises at 5 least two dimensions in the database treated as a plurality of hierarchically structured members having at least two levels and the method further comprises locating the validation curve within each dimension separately.
Preferably, the method comprises allowing for the adjust- 10 ment of the location of the validation curve of at least one dimension. The method may further comprise automatically adjusting the operation of the multidimensional database by adjusting the location of the validation curve according to historical data. 15
Preferably, each cell comprises at least two members and wherein the method further comprises allowing for the adjustment of the method by which cells are determined to be above or below the validation curve. The method may further comprise automatically adjusting the method by 20 which cells are determined to be above or below the validation curve according to historical data. The historical data may indicate the location of cells in the database that have been queried, used as part of a larger calculation and/or updated and the frequency that cells in the database have 25 been queried, used as part of a larger calculation and/or updated.
Preferably, the method further comprises treating the at least one dimension in the database as a plurality of hierarchically structured members divided into two levels, con- 30 sisting of one upper level and one lower level.
Preferably, cells in said upper level are managed according to at least one of the options:
the cells are permanently indicated as invalid;
the cells are stored with a timestamp indicating the time 35 that the cell was last updated, wherein a calculated cell is evaluated as being one of valid and invalid by comparing the timestamp of the calculated cell with the timestamp of its source cells located above the validation curve and by determining the validity of its source cells located below the 40 validation curve; and
only cells that may be queried are stored in the multidimensional database and the cells are indicated as one of valid and invalid at data load time.
Preferably, all cells in said upper level for all dimensions 45 are managed according to one of the options a)-c).
Preferably, no cells above the validation line are stored in the multidimensional database.
Preferably, option b) is used and if the calculated cell is evaluated to be invalid the method comprises recalculating 50 the calculated cell by evaluating the validity of its source cells, working down the hierarchical structure until a base of valid cells is reached and then working up the hierarchical structure recalculating cells until said calculated cell is recalculated. 55
Preferably, the multidimensional database comprises an index defining its cells, the index including a pointer to the relevant data forming the cells, wherein the validity of a cell is indicated by associating validity information with the index. 60
Preferably, the cells are formed from at least two members from different dimensions.
According to another aspect of the present invention, there is provided a computerised database management system for a multidimensional database comprising cells 65 formed from a tuple of members, the computerised database management system comprising a communication interface
for reading from and writing to a database, computer processing means for controlling the operations of the database management system and computer memory containing an instruction set readable by said computer processing means, wherein when the instruction set is executed by the computer processing means the computerised database management system uses said communication interface to: maintain at least one dimension in the database as a plurality of hierarchically structured members divided into at least two levels, thereby defining a validation curve as the location of an intersection of said levels; and evaluate in what level cells in the database are located and managing cells contained in a lower level by automatically indicating them as one of valid and invalid at data load time and managing cells contained in an upper level by indicating them as one of valid and invalid according to a method involving at least one of less reading from and writing to the database at data load time than if all the cells in the upper level were indicated as one of valid and invalid at data load time.
Preferably, the computerised database management system includes a pluggable component operable to set the location of the validation curve. Preferably, the pluggable component is operable to analyse historical data on the use of the multidimensional database and automatically adjust the location of the validation curve dependent on the historical data.
Preferably, the computerised database management system includes a pluggable component operable to set a method by which cells are determined to be above or below the validation curve, by setting the number of members that need to be above or below the validation curve in order for the cell to be above or below the validation curve. Preferably, the pluggable component is operable to analyse historical data on the use of the multidimensional database and automatically adjust the location of the validation curve dependent on the historical data.
According to another aspect of the present invention, there is provided a method of managing cells formed from a tuple of members in a multidimensional database, the method comprising maintaining an index to the cells, the index defining a hierarchical structure containing pointers to the members that make up each cell, allocating each cell to one of at least two levels in the hierarchical structure and associating with the index a cell validity indicator for each cell, wherein the cell validity indicator is updated upon the occurrence an event, wherein the class of event that causes the cell validity indicator to be updated is dictated by the level in which the cell is allocated.
Preferably, the method further comprises automatically updating the cell validity indicator for cells allocated to the lowermost level when new data is loaded into the database.
Preferably, cells in an upper level of said at least two levels are permanently indicated as invalid.
Preferably, cells in an upper level of said at least two levels are stored with a timestamp indicating the time that the cell was last updated and wherein a calculated cell is evaluated as being one of valid and invalid by comparing the timestamp of the calculated cell with the timestamp of its source cells in the upper level and by determining the validity of its source cells that are located below the validation curve.
Preferably, only cells that may be queried are stored in the multidimensional database and the cells are indicated as one of valid and invalid at data load time.
Further aspects of the present invention may become apparent from the following description, given by way of
example of preferred embodiments only and with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1: Shows a block diagram of a computer system for performing the validation method of the present invention.
FIG. 2: Shows a diagrammatic representation of a multidimensional database structure and a possible location of a validation curve within the structure. 10
FIG. 3: Shows a flow diagram of the steps to perform a calculation using data members in a multidimensional database.
FIGS. 4A, B: Show two examples of possible steps for checking the validity of cells above the validation curve. 15
FIGS. 5A, B: Shows diagrammatically the evaluation of source cells and the calculation of a cell in a multidimensional database.
DETAILED DESCRIPTION OF THE DRAWINGS 20
The present invention relates to a method of managing data in a multidimensional database and in particular to a method of ensuring that queried cells or cells used as part of a larger calculation are valid before being returned or used 25 in the calculation. The method includes separating the upper and lower hierarchical levels of the database and invalidating cells in the lower levels at load time. Cells in the higher levels are invalidated using a different method that may in at least some circumstances decrease or eliminate the need to 30 read from and write to the database during data load time.
Referring first to FIG. 1, a diagrammatic representation of a computer system in which the present invention has been implemented is shown and generally referenced by arrow 500. A database 1 is provided, containing a multidimen- 35 sional data space that may be arranged in a hierarchical manner. Two branches from the hierarchical structure of the database 1 are referenced 10 and 11. Each branch 10, 11 is made up of a plurality of members referenced la-le and 2a-2h respectively. The unit of information in the database 40 that is queried is a cell, defined by a tuple of members (Ml, M2, M3 . . . MN), where Ml to MN represents one member from each of the N dimensions in the database 1. A cell is either populated with a numeric value, or is empty. An empty cell has an implicit value of null. A cell is typically formed 45 by a member from every dimension in the database.
The database 1 may be a single multidimensional database space or alternatively may be made up of a plurality of multidimensional database spaces or treated as one or a plurality of multidimensional database spaces for the pur- 50 poses of applying the present invention thereto. Where there are a plurality of multidimensional database spaces, the invention may be provided to one of the spaces, all of the spaces or a selected number of database spaces.
In communication with the database 1 is a database 55 management system (DBMS). The DBMS 2 includes a calculation sub-system 20, validation sub-system 21 and input/output (IO) sub-system 22. The calculation sub-system 20 performs calculations on data retrieved from the database 1 or otherwise input to the DBMS 2, the validation sub- 60 system 21 controls the validation and invalidation of data in the database 1 and the IO sub-system 22 controls the reading to and writing from the database 1. The DBMS 2 may be an OLAP server.
The database 1 may be queried using a remote computer 65 3 running a suitable application, which may communicate with the DBMS 2 through a local or wide area network 4.
Suitable computer processors, communication interfaces and data storage mechanisms for the computer system 500 shown in FIG. 1 are well known in the relevant arts and therefore will not be described further herein.
A user of the system will input queries for the database and receive the results back from the DBMS 2 by operating the remote computer 3, using an input device such as a keyboard and point and click device. The DBMS 2 then retrieves the relevant information from the database 1, performs any required calculations and returns the result to the remote computer 3 for display to the user. Although in the description herein, it is assumed that the remote computer 3 a "dumb" terminal in relation to managing the database 1, those skilled in the relevant arts will appreciate that the remote computer 3 may perform some of the functions of DBMS 2 described herein, with suitable mechanisms in place to lock data while any particular computer is performing an operation on that data. Also, the processing capabilities of the DBMS 2 may be distributed amongst a plurality of computer processors.
FIG. 2 shows branch 10 and a modified branch 11, referenced 11a, from different dimensions in the database 1. The branches 10, 11 and 11A described herein represent dimensions of a database and are a simplified example, given for the purposes of illustration only. Practical systems will have many dimensions and more complex branch structures. Any combination of members, one from each dimension defines a cell, which may be requested in a query or for a calculation. FIG. 2 shows a hierarchical structure for the data. Alternatively, the actual data in the database 1 may be separate from the data structure, with the data structure indicated by an index. In some embodiments the structure of the data may be managed separately from the hierarchical structure of the index. Methods for managing data in this manner are described in the Applicant's co-pending application No. 60/493,812 entitled "Method And Apparatus For Accessing Multi Dimensional Data" and having the same filing date as this application. The contents of this copending application are hereby incorporated herein in their entirety.
According to the present invention, cells in the database 1 are treated differently according to their level in the index hierarchy. Calculated cells in a lower consolidation level of the database 1 that have a source cell that has changed are indicated as invalid at data load time. Cells in a higher consolidation level in the database 1 are updated using a different strategy that may at least in some circumstances reduce the degradation of in data load performance over indicating these cells as invalid at data load time. In FIG. 2, the separation of the database 1 into levels is shown diagrammatically by a validation curve 12, which is inherently defined by the intersection of the levels. In this example, two levels are defined, treating members la, 2a and 2b as being in the upper consolidation levels of the database 1 and the remaining members in the lower consolidation levels of the database 1.
Whether a cell is above or below the validation curve 12 determines when and how its validity is verified. As mentioned herein above, a cell may be defined by a tuple of members in the database 1 and a single cell may include individual members above and below the validation curve 12. There are therefore various strategies for determining whether a particular cell is above or below the validation curve 12. By choosing an appropriate strategy, tuning of the database 1 to a particular operating environment may be achieved. The choice, in conjunction with an adjustable validation curve (if provided) affects the total number of