Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS20070226264 A1
Publication typeApplication
Application numberUS 11/386,346
Publication dateSep 27, 2007
Filing dateMar 22, 2006
Priority dateMar 22, 2006
Also published asUS20080189241
Publication number11386346, 386346, US 2007/0226264 A1, US 2007/226264 A1, US 20070226264 A1, US 20070226264A1, US 2007226264 A1, US 2007226264A1, US-A1-20070226264, US-A1-2007226264, US2007/0226264A1, US2007/226264A1, US20070226264 A1, US20070226264A1, US2007226264 A1, US2007226264A1
InventorsGang Luo, Philip Yu
Original AssigneeGang Luo, Yu Philip S
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
System and method for real-time materialized view maintenance
US 20070226264 A1
Abstract
There are provided a method, a computer program product, and a system for maintaining a materialized view defined on a relation of a relational database. The method includes the step of performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.
Images(8)
Previous page
Next page
Claims(20)
1. A method for maintaining a materialized view defined on a relation of a relational database, the method comprising:
performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.
2. The method of claim 1, wherein said performing step utilizes at least one filtering relation data structure to perform the content-based filtering.
3. The method of claim 2, wherein the at least one filtering relation data structure is capable of being shared among multiple materialized views.
4. The method of claim 1, further comprising generating an estimate of at least one of an importance and an effect of the update to the relation.
5. The method of claim 4, further comprising at least one of, performing a load shedding operation on the relational database based upon the estimate, and quantifying the effect of the update being omitted from the materialized view based upon the estimate.
6. The method of claim 1, further comprising localizing an effect of the update on the materialized view.
7. The method of claim 1, further comprising collapsing multiple updates to the relation to improve filtering efficiency.
8. A computer program product comprising a computer usable medium having computer usable program code for maintaining a materialized view defined on a relation of a relational database, said computer program product comprising:
computer usable program code for performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.
9. The computer program product of claim 8, wherein said computer usable program code for performing the content-based filtering utilizes at least one filtering relation data structure to perform the content-based filtering, and wherein the at least one filtering relation data structure is capable of being shared among multiple materialized views.
10. The computer program product of claim 8, further comprising computer usable program code for generating an estimate of at least one of an importance and an effect of the update to the relation.
11. The computer program product of claim 10, further comprising computer usable program code for at least one of, performing a load shedding operation on the relational database based upon the estimate, and quantifying the effect of the update being omitted from the materialized view based upon the estimate.
12. The computer program product of claim 8, further comprising computer usable program code for localizing an effect of the update on the materialized view.
13. The computer program product of claim 8, further comprising computer usable program code for collapsing multiple updates to the relation to improve filtering efficiency.
14. A system for maintaining a materialized view defined on a relation of a relational database, the system comprising:
a materialized view manager for performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.
15. The system of claim 14, wherein said materialized view manager utilizes at least one filtering relation data structure to perform the content-based filtering.
16. The system of claim 15, wherein the at least one filtering relation data structure is capable of being shared among multiple materialized views.
17. The system of claim 14, wherein said materialized view manager generates an estimate of at least one of an importance and an effect of the update to the relation.
18. The system of claim 17, wherein said materialized view manager at least one of, performs a load shedding operation on the relational database based upon the estimate, and quantifies the effect of the update being omitted from the materialized view based upon the estimate.
19. The system of claim 14, wherein said materialized view manager localizes an effect of the update on the materialized view.
20. The system of claim 14, wherein said materialized view manager collapses multiple updates to the relation to improve filtering efficiency.
Description
BACKGROUND

1. Technical Field

The present invention relates generally to relational databases and, more particularly, to a system and method for real-time materialized view maintenance for relational databases.

2. Description of the Related Art

Recently, there has been a growing trend to use data warehouses to make real-time decisions about a corporation's day-to-day operations. Most major relational database management system (RDBMS) vendors have spent great efforts on real-time data warehousing, including IBM's business intelligence, MICROSOFT's digital nervous system, ORACLE's Oracle10g, NCR's active data warehouse, and COMPAQ's zero-latency enterprise.

A real-time data warehouse needs to handle real-time, online updates in addition to the traditional data warehouse query workload. This raises a problem that is present to a lesser degree in traditional data warehouses, namely when a base relation is updated, maintaining the materialized view(s) defined on it can bring a heavy burden to the corresponding RDBMS.

To mitigate this problem, several methods have been proposed to detect irrelevant updates to a base relation R that do not affect the materialized view MV defined on R. For example, see the following, which are each incorporated by reference herein: Blakeley et al, “Updating Derived Relations: Detecting Irrelevant and Autonomously Computable Updates”, ACM Transactions on Database Systems (TODS), 1989, 14(3), pp. 369-400; Blakeley et al., “Efficiently Updating Materialized Views”, ACM International Conference on Management of Data (SIGMOD), 1986, pp. 61-71; and Levy et al., “Queries Independent of Updates”, International Conference on Very Large Data Bases (VLDB), 1993, pp. 171-181. However, all of these methods are “content-independent” in the sense that they only consider the “where” clause condition in a materialized view's definition while ignoring the content in the other base relations of the materialized view. As a result, these methods make over-conservative decisions and miss a large number of filtering opportunities.

For example, consider the following materialized view MV:

create materialized view MV as

select * from R, S, T

where R.a=S.b and S.c=T.d

and R.e>20 and S.f=“xyz” and T.g=50;

Assume that a materialized view MV records anomaly exists so that very few tuples in R, S, and T satisfy the where clause condition (R.a=S.b and S.c=T.d and R.e>20 and S.f=“xyz” and T.g=50) in the MV's definition. Suppose a tuple tR whose tR.e=30 is inserted into base relation R. Since tR.e>20, the existing prior art methods in the above-referenced articles cannot tell whether or not the MV will change. Therefore, the standard materialized view maintenance method has to be used, as follows. S is checked for a matching tuple(s) tS such that tS.b=tR.a and ts.f=“xyz”. If such a matching tuple ts exists, then T is further checked for matching tuple(s) tT such that tT.d=tS.c and tT.g=50. If both S and T are large and cannot be cached in memory, then such checking can incur a large number of input and output operations and become fairly expensive. However, because of the MV records anomaly, it is most likely that the insertion of tR into R will not affect the MV and, thus, all of the expensive checking is wasted.

SUMMARY

These and other drawbacks and disadvantages of the prior art are addressed by the present principles, which are directed to a system and method for real-time materialized view maintenance.

According to an aspect of the present invention, there is provided a method for maintaining a materialized view defined on a relation of a relational database. The method includes the step of performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.

According to another aspect of the present invention, there is provided a computer program product including a computer usable medium having computer usable program code for maintaining a materialized view defined on a relation of a relational database. The computer program product includes computer usable program code for performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.

According to yet another aspect of the present invention, there is provided a system for maintaining a materialized view defined on a relation of a relational database. The system includes a materialized view manager for performing content-based filtering on the relation to identify an update to the relation as being irrelevant with respect to the materialized view.

These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:

FIG. 1 is a block diagram illustrating an exemplary networked environment to which the present principles may be applied, according to an embodiment thereof;

FIG. 2 is a block diagram illustrating an exemplary computing device to which the present principles may be applied, according to an embodiment thereof;

FIG. 3 is a flow diagram for an exemplary method for maintaining a materialized view defined on a relation of a relational database, according to an embodiment of the present principles;

FIG. 4 is a flow diagram for another exemplary method for maintaining a materialized view defined on a relation of a relational database, according to an embodiment of the present principles;

FIG. 5 is a diagram for three exemplary filtering relations, according to an embodiment of the present principles;

FIG. 6 is a flow diagram for a method for filtering out the irrelevant portion of an update to a materialized view MV, according to an embodiment of the present principles; and

FIG. 7 is a diagram for an exemplary base relation S to which the present invention may be applied, according to an embodiment thereof.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiments of the present principles are directed to a system and method for real-time materialized view maintenance. Advantageously, embodiments of the present principles may be used to identify irrelevant updates to a relation (a base relation or a derived relation) with respect to a materialized view defined on that relation. The irrelevant updates are identified more accurately and efficiently as compared to prior art approaches for performing the same.

It should be understood that the elements shown in the FIGURES may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in software on one or more appropriately programmed general-purpose digital computers having a processor and memory and input/output interfaces.

Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, an exemplary networked environment to which the present principles may be applied, is indicated generally by the reference numeral 100. The environment 100 includes one or more client devices 110 connected to a server 120 via a network 130. The network 130 may include wired and/or wireless links. The server 120 may be connected in signal communication with one or more resources 140. The resources 140 may include one or more local and/or remote sources. The resources 140 may be connected to the server 120 directly and/or via, e.g., one or more networks 140 (including wired and/or wireless links). Each of the client devices 110 may include a materialized view maintenance system 199 (also referred to herein as “materialized view manager”) for maintaining a materialized view as described herein.

Turning to FIG. 2, an exemplary computing device to which the present principles may be applied is indicated generally by the reference numeral 200. It is to be appreciated that elements of the computing device 200 may be employed in any of the client devices 110, the server 120, and/or the resources 140. Moreover, it is to be further appreciated that elements of the computing device 200 may be employed in the materialized view maintenance system 199.

The computing device 200 includes at least one processor (CPU) 202 operatively coupled to other components via a system bus 204. A read only memory (ROM) 206, a random access memory (RAM) 208, a display adapter 210, an I/O adapter 212, a user interface adapter 214, a sound adapter 299, and a network adapter 298, are operatively coupled to the system bus 204.

A display device 216 is operatively coupled to system bus 204 by display adapter 210. A disk storage device (e.g., a magnetic or optical disk storage device) 218 is operatively coupled to system bus 204 by I/O adapter 212.

A mouse 220 and keyboard 222 are operatively coupled to system bus 204 by user interface adapter 214. The mouse 220 and keyboard 222 are used to input and output information to and from system 200.

At least one speaker (herein after “speaker”) 297 is operatively coupled to system bus 204 by sound adapter 299. A (digital and/or analog) modem 296 is operatively coupled to system bus 204 by network adapter 298.

To address the above-mentioned problems of the prior art approaches to maintaining a materialized view on a relation of a relational database, we introduce content-based filtering into materialized view maintenance. In one embodiment, up to four illustrative requirements may be utilized for efficient filtering to identify irrelevant updates to base relations of a materialized view. In an embodiment, we design “filtering relations” that summarize the most relevant information in the base relations and fulfill pre-specified requirements such as, but not limited to, the four illustrative requirements described herein. These filtering relations capture the relationship among multiple join attributes and can be efficiently maintained in real time. Upon an update ΔR to a base relation R that has a materialized view MV defined on it, the RDBMS uses the corresponding filtering relations of the other base relations of the MV to determine whether or not ΔR is irrelevant. The checking of filtering relations is usually significantly faster than checking base relations. Also, compared to the where clause condition in the MV's definition, filtering relations can provide more precise information about whether or not ΔR is irrelevant. In this way, the RDBMS can quickly and more precisely detect irrelevant updates to R and hence reduce the materialized view maintenance overhead.

In an embodiment, one or more of the following four illustrative requirements may be used to design effective summary data structures: compactness; association; a high filtering ratio; and easy maintenance. As noted above, the present invention is not limited to only these four illustrative requirements and, given the teachings of the present invention provided herein, one of ordinary skill in this and related arts will contemplate these and various other requirements for implementing content-based filtering for materialized view maintenance, while maintaining the scope of the present invention. Moreover, different implementations of any of these same four requirements may also be implemented in accordance with the present principles, while maintaining the scope of the present principles.

Consider a base relation R that has a join view JV defined on it. Our goal is to quickly filter out most of the irrelevant updates to R. This filtering process allows false negatives for irrelevant updates but not false positives. In other words, for any update ΔR to R, this filtering process may include the following characteristics.

For example, in one characteristic, if our method says that ΔR is irrelevant, then it must be true that ΔR is irrelevant.

In another characteristic, in the case that ΔR is irrelevant, with high probability p, our method can determine that ΔR is irrelevant; with low probability 1-p, our method indicates that it does not know whether ΔR is irrelevant.

In yet another characteristic, in the case that ΔR is relevant, our method indicates that it does not know whether ΔR is irrelevant.

As noted above, it is preferable to use one or more (and preferably, although not necessarily, all) of the following requirements to design effective summary data structures: compactness; association; a high filtering ratio; and easy maintenance.

Regarding compactness, the summary data structures are preferably small as they are likely to be cached in memory. Thus, compactness can be an issue in achieving real-time results.

Regarding association, the summary data structures preferably capture the relationship among multiple join attributes of a base relation. That is, given a join attribute value (e.g., S.b of the MV in the introduction), we can use the join attribute value to find the associated values of other join attributes (e.g., S.c).

Regarding the high filtering ratio, the summary data structures can preferably quickly and correctly filter out most (or all) of the irrelevant updates to the base relations of a join view.

Regarding easy maintenance, upon updates to the base relations, the summary data structures are preferably efficiently maintained in real time.

There are several existing summary data structures (e.g., bloom filters, multi-attribute B-tree indices, and so forth). However, none of the existing summary data structures satisfies all of the above four properties nor is otherwise suitable for our filtering purposes.

In the following, we first give an overview of our content-based detection method for irrelevant updates. Thereafter, a more detailed description of the content-based detection method is provided.

Consider a join view JV that is defined on base relations R1, R2, . . . , and Rn (n≧2). For each Ri (1≦i≦n), we create a filtering relation FRi that summarizes the most relevant information in Ri. Upon an update ΔRi to a base relation Ri (1≦i≦n) of JV, our content-based method performs the following operations.

Operation O1: Update the filtering relation FRi accordingly.

Operation O2: To detect whether or not ΔRi is irrelevant, use the where clause condition in the JV's definition and techniques such as, but not limited to, those described by the following, which are each incorporated by reference herein: Blakeley et al, “Updating Derived Relations: Detecting Irrelevant and Autonomously Computable Updates”, TODS 14(3): 369-400, 1989; Blakeley et al., “Efficiently Updating Materialized Views”, SIGMOD Conf. 1986: 61-71; and Levy et al., “Queries Independent of Updates”, VLDB 1993: 171-181.

Operation O3: If Operation O2 cannot tell that ΔRi is irrelevant, then check the filtering relations FR1, FR2, . . . FRi−1, FRi+1, FRi+2, . . . , and FRn to determine whether or not ΔRi is irrelevant.

Operation O4: If Operation O3 cannot tell that ΔRi is irrelevant, then check base relations R1, R2, . . . , Ri−2, Ri+1, Ri+2, . . . , and R1 to determine exactly whether or not ΔRi is irrelevant. In the case that ΔRi is relevant, the JV is refreshed. Operation O4 is the standard join view maintenance method.

Turning to FIG. 3, a method for maintaining a materialized view defined on a relation of a relational database is indicated generally by the reference numeral 300. It is to be appreciated that the relation may be a base relation or a derived relation (i.e., a relation derived from a base or other relation). It is to be further appreciated that the method of FIG. 3 is described particularly with respect to Operation O1 and Operation O3 above, which illustrate operations performed in accordance with embodiments of the present principles.

Create one or more filtering relation data structures based on one or more pre-specified requirements (step 305). Optionally, the filtering relation data structure(s) may be created at step 305 to be capable of being shared among multiple materialized views.

Update the filtering relation data structure(s) (e.g., per Operation O1 described herein) (step 306).

Compress, reduce the number of, and/or otherwise modify the filtering relation data structure(s) (step 307). Step 307 may be performed, e.g., to enhance a result of the content-based filtering performed at step 310, as described in further detail herein below.

Perform content-based filtering on the relation, using the filtering relation data structure(s), to identify an update to the relation as being irrelevant with respect to the materialized view (e.g., per Operation O3 described herein) (step 310).

Estimate the importance and/or an effect(s) of the update to the relation (step 315).

Perform a load shedding operation on the relational database based upon a result of the estimate performed at step 315 (step 320).

Quantify the effect of the update being omitted from the materialized view based upon a result of the estimate performed at step 315 (step 325).

Localize the effect of the update on the materialized view (step 330).

Collapse multiple updates to the relation into a single transaction (combined update) to obtain a benefit such as, but not limited to, improving filtering efficiency (step 335). Thus, if the update under consideration can be combined with other corresponding updates to the base relation into a single transaction, then increased efficiency can likely be obtained.

The update that is identified as being irrelevant or a portion thereof identified as being irrelevant is filtered out or otherwise omitted from the materialized view (step 340).

It is to be appreciated that steps 307, 315, 320, 325, 330, and 335 are optional. Thus, one or more of steps 307, 315, 320, 325, 330, and 335 may be omitted in some embodiments of the present principles.

It is to be further appreciated that while in some embodiments, the present principles may be combined with one or more prior art steps and/or approaches for maintaining materialized views, any such steps and/or approaches of the prior art are omitted from FIG. 3 for the sake of brevity.

Turning to FIG. 4, another method for maintaining a materialized view defined on a relation of a relational database is indicated generally by the reference numeral 400. It is to be appreciated that the relation may be a base relation or a derived relation (i.e., a relation derived from a base or other relation). It is to be further appreciated that the method of FIG. 4 is described to illustrate a particular embodiment that involves the present principles (e.g., Operations O1 and O3) and, optionally, various aspects of the prior art (e.g., Operations O2 and O4).

Update the filtering relation(s) (e.g., per Operation O1 described herein) (step 405).

Use the where clause condition in the join view's definition to determine whether or not ΔRi is irrelevant (e.g., per Operation O2 described herein) (step 410).

It is then determined whether or not ΔRi has been deemed irrelevant, based on the where clause (step 415). If so, then the method is terminated.

Otherwise, check the filtering relation(s) to determine whether or not ΔRi is irrelevant (e.g., per Operation O3 described herein) (step 420). It is then determined whether or not ΔRi has been deemed irrelevant, based on the checked filtering relations (step 425). If so, then the method is terminated. Otherwise, check the base relations to determine whether or not ΔRi is irrelevant (e.g., per Operation O4 described herein) (step 430).

A description will now be given regarding a method implementing the present principles, according to an embodiment of the present principles.

Suppose that Cw is the where clause condition in the definition of the join view JV. Cw is rewritten into a conjunction of m terms ci (1≦i≦m). Each term ci belongs to one of the following three categories:

Category 1: For each i (1≦i≦m1), ci is a conjunctive equi-join condition on two base relations Rj and Rk (1≦j<k≦n). That is, ci is of the conjunctive form Rj.a1=Rk.b1

Rj.a2=Rk.b2 . . . Rj.ah=Rk.bh (h≧1). For different i's (1≦i≦m1), either the corresponding j's or the corresponding k's are different.

Category 2: For each i (m1+1≦i≦m2), ci is a selection condition on a single base relation Rj (1≦j≦n). For different i's (m1+1≦i≦m2), the corresponding j's are different.

Category 3: For each i (m2+1≦i≦m), ci is neither a conjunctive equi-join condition on two base relations nor a selection condition on a single base relation.

For example, consider the join view MV mentioned above. The where clause condition in the MV's definition is a conjunction of five terms. The first two terms (R.a=S.b and S.c=T.d) belong to Category 1. The other three terms (R.e>20, S.f=“xyz”, and T.g=50) belong to Category 2. An example term of Category 3 is R.x+S.y>T.z, which does not appear in the where clause condition of the MV's definition.

For each base relation Ri (1≦i≦n), we create a filtering relation FRiDc(Ri)). The projection list D includes all join attributes of Ri that appear in some term of Category 1. That is, for each term cj (1≦j≦m1) that is of the form Ri.a1=Rk.b1

Ri.a2=Rk.b2 . . . Ri.ah=Rk.bh (1≦k≦n, k≠i, h≧1), attributes {a1, a2, . . . , ah}D. Also, we build an index on attributes (a1, a2, . . . , ah) The selection condition C is the term of Category 2 that is on Ri. That is, for some j (m1+1≦j≦m2), if the term cj is a selection condition on Ri, then C=cj. Otherwise (i.e., if no such cj exists), we have C=true.

For example, consider the join view MV mentioned above, which will now be described with respect to FIG. 5. Turning to FIG. 5, three exemplary filtering relations are indicated generally by the reference numeral 500. A first filtering relation FRR is shown with respect to an attribute a and an index on attribute a. A second filtering relation FRS is shown with respect to attributes b and c and respective indexes on attributes b and c. A third filtering relation FRT is shown with respect to an attribute d and an index on attribute d.

In Operation O3, upon an update ΔRi to base relation Ri (1≦i≦n), the updated tuples in Ri are joined with the corresponding filtering relations of the other base relations of the JV (i.e., FR1, FR2, . . . , FRi−1, FRi+1, FRi+2, . . . , and FRn). If no join result tuple is generated, the content-based method in accordance with the present principles determines ΔRi to be irrelevant. Otherwise, the content-based method in accordance with the present principles does not know whether ΔRi is irrelevant unless the other base relations R1, R2, . . . , Ri−1, Ri+1, Ri+2, . . . , and Rn are checked. This is because in checking the filtering relations, the terms in Category 3 are ignored and, hence, we may have false negatives.

When the updated tuples in Ri are joined with the filtering relations FR1, FR2, . . . , FRi−1, FRi+1, FRi+2, . . . , and FRn, in an embodiment, the content-based method in accordance with the present principles only cares whether the join result set JS is empty. Hence, during the join process, two optimizations may be used to reduce the join overhead. In a first optional optimization, some attributes are projected out immediately after they are no longer needed. In a second optional optimization, for certain filtering relations, if there are multiple matching tuples in the filtering relation for an input tuple, then our content-based method only finds the first matching tuple rather than all matching tuples. In other words, for each input tuple to such a filtering relation, our content-based method generates at most one join result tuple. These two optional optimizations essentially compute a subset SS of the projection of JS and ensure that SS

JS=φ. The details of these two optimizations are straightforward and readily determined by one of ordinary skill in this and related art and are, thus, omitted here for the sake of brevity. However, two examples are provided herein after for illustrative purposes.

Consider the join view MV mentioned in the introduction. To illustrate the first optimization, consider an update ΔR to base relation R. In this case, the content-based method in accordance with the present principles only joins πa(ΔR) with the filtering relation FRS. For the join result Jra(ΔR)

a=bFRS, attributes a and b are projected out before Jr is joined with FRT. If either Jr or πc(Jr) c=dFRT is empty, then the content-based method knows that ΔR is irrelevant. Actually in this case, the content-based method can catch all irrelevant updates to the base relations. Thus, if we ignore the overhead of the checking/updating filtering relations, the content-based method avoids all unnecessary join view maintenance overhead in the content-independent method of the prior art. The overhead of checking/updating filtering relations is often minor.

To illustrate the second optimization, suppose tuple ts is inserted into S. In the filtering process, our content-based method joins tuple tS1b, c(tS) first with FRR, and then with FRT. When the content-based method searches in FRR, once the method finds the first tuple tR matching tS1, the method generates the join result tuple tjc(tR

a=btS1), stops the search in FRR, and continues to do the join with FRT. This is because the attributes of FRR do not include the join attribute c with FRT. Therefore, from the perspective of determining whether the join result with FRT is empty, there is no need to obtain more tuples in FRR that match tS1. If no tuple in FRR is found to match tS1, then we know that tuple tS is irrelevant. Similarly, when the content-based method searches in FRT, once the method finds the first tuple matching tj, the method stops the search in FRT.

In the traditional join view maintenance method, the work needed when base relation Ri (1≦i≦n) is updated is as follows:

update Ri;
Operation O2; /* check where clause condition in JV
    definition */
If (Operation O2 fails)
  Operation O4; (expensive) /* maintain JV using base
     relations */

When we say Operation O2 fails, we mean that Operation O2 cannot tell whether the update to Ri is irrelevant.

For comparison, in our content-based detection method, the work needed when base relation Ri (1≦i≦n) is updated is as follows:

update Ri;
Operation O1; (cheap) /* update FRi */
Operation O2; /* check where clause condition in JV
    definition */
If (Operation O2 fails)
  Operation O3; (cheap) /* check filtering relations */
  If (Operation O3 fails)
    Operation O4; (expensive) /* maintain JV using base
       relations */

Usually, due to selection and projection, filtering relations are much smaller than base relations and, thus, more likely to be cached in memory. In this case, checking filtering relations is much faster than checking base relations. If the percentage of the updates to the base relations that are irrelevant are greater than a pre-specified threshold, and using filtering relations can filter out most of the irrelevant updates, then the extra work of (cheap) Operations O1 and O3 is dominated by the work saved in the expensive Operation O4. As a result, the total join view maintenance overhead is greatly reduced. As an example for illustrative purposes, the pre-specified threshold may be, but is not limited to, about 5%. However, as is readily appreciated by one of ordinary skill in this and related arts, the exact percent is dependent upon the base relation size, the materialized view definition, and so forth and, thus varies from case to case.

Note that in order to minimize the sizes of filtering relations (the compactness property), in an embodiment, the terms in Category 3 are not considered in filtering relations and, thus, get ignored in the filtering process. Usually, using the terms in Categories 1 and 2 is sufficient to filter out most of the irrelevant updates.

In some embodiments of content-based filtering in accordance with the present principles, further enhancements may be performed. For example, some enhancements that may be performed for some embodiments include: compressing filtering relations; reducing the number of filtering relations; relaxing the equi-join condition of category 1; filtering out the irrelevant portion of an update; sharing a filtering relation among multiple join views; selectively skipping Operation O3; using information about (intermediate) join results in Operation O3; and load shedding. These enhancements may be used to enhance the compactness, efficiency, and functionality of filtering relations. It is to be appreciated that one or more of the enhancements may be utilized for a given embodiment in accordance with the present principles. These enhancements will be further described in detail herein after.

An embodiment will now be described with respect to compressing filtering relations. The performance advantages of the content-based detection method in accordance with the present principles depend heavily on the sizes of filtering relations. The smaller the filtering relations, the more likely they can be cached in memory and, thus, the greater performance advantages of the content-based detection method. Therefore, it is beneficial to reduce the sizes of filtering relations.

To achieve this size reduction goal, we use the following hashing method. For each term ci (1≦i≦m1) of Category 1 that is of the form Rj.a1=Rk.b1

Rj.a2=Rk.b2 . . . Rj.ah=Rk.bh(1≦j<k≦n, h≧1), if the representation of attributes (a1, a2, . . . , ah) is longer than that of an integer attribute, then we use a hash function H to map each (a1, a2, . . . , ah) into an integer. In the filtering relation FRj of base relation Rj, we store H(a1, a2, . . . , ah) rather than (a1, a2, . . . , ah). Also, in the filtering relation FRk of base relation Rk, we store H(b1, b2, . . . , bh) rather than (b1, b2, . . . , bh).

In practice, a large number of joins are based on key/foreign key attributes and the values of these attributes are usually long strings (e.g., ids). Therefore, hashing can often reduce the sizes of filtering relations significantly.

Suppose a hash function H (or multiple hash functions) has been applied to the filtering relation FRi of base relation Ri (1≦i≦n). Upon an update ΔRi to Ri, the hash function H is first applied to the corresponding join attributes of the updated tuples ΔRi. Then ΔRi is joined with the filtering relations FR1, FR2, . . . , FRi−1. FRi+1, FRi+2, . . . and FRn.

In the above hashing method, due to hash conflicts, we may introduce false negatives in detecting irrelevant updates using filtering relations. However, typical modern computers can represent a large number of distinct integer values (e.g., a 32-bit computer can represent 232 distinct integer values). In practice, if a good hash function is used, the probability of having hash conflicts should be low. As a result, this hashing method should not introduce a large number of false negatives.

An embodiment will now be described with respect to reducing the number of filtering relations. In practice, most updates occur to one (or a few) base relation. The other base relations are rarely updated. In this case, in an embodiment, our content-based method may only keep filtering relations for the rarely updated base relations. No filtering relation may be kept for the most frequently updated base relation. Then for the update to the mostly frequently updated base relation (i.e., for most updates to the base relations), the filtering relation maintenance overhead is avoided. As a tradeoff, when some rarely updated base relation is updated (i.e., for a few updates to the base relations), the content-based detection method is not preferred for use. Rather, we may use the standard join view maintenance method.

Suppose base relation Ri (1≦i≦n) is small enough to be cached in memory in most cases. Also, no hash function has been applied to the corresponding filtering relation FRi. Then there is no need to keep FRi. Rather, in Operation O3, when we check filtering relations for irrelevant updates to some other base relation Rj (1≦j≦n, j≠i), we use base relation Ri and filtering relation FRk's (1≦k≦n, k≠i, k≠j). We may build some indices on the join attributes of Ri. This can save the maintenance overhead of FRi when Ri is updated.

An embodiment will now be described with respect to relaxing the equi-join condition of Category 1. For each term of Category 1, in an embodiment, we restrict the equi-join condition on two base relations Rj and Rk (1≦j<k≦n) to be of conjunctive form. In another embodiment, this condition can be relaxed so that for each term of Category 1, the equi-join condition on Rj and Rk is of disjunctive-conjunctive form ( Rj.ai r,s =Rk.bi r,s ), where t≧1 and hr≧1 (1≦r≦t). Then for each r (1≦r≦t), our content-based method keeps attributes (ai r.1 ,ai r.2 , . . . , ai r. ) in the filtering relation FRj of Rj, and attributes (bi r.1 ,bi r.2 , . . . , bi r. ) in the filtering relation FRk of Rk. Also, in checking filtering relations for irrelevant updates, our content-based method considers the equi-join conditions on two base relations that are of disjunctive-conjunctive form.

An embodiment will now be described with respect to filtering out the irrelevant portion of an update. In the basic algorithm, the entire update ΔRi to base relation Ri (1≦i≦n) is treated as an entity. That is, in Operation O3, ΔRi is first joined with the filtering relations FR1, FR2, . . . , FRi−1, FRi+1, FRi+2, . . . , and FRn. If the join result set is empty, then we know that ΔRi is irrelevant. Otherwise in Operation O4, the entire ΔRi is joined with the base relations R1, R2, . . . , Ri−1, Ri+1, Ri+2, . . . , and Rn.

In general, if ΔRi includes multiple tuples, then some tuples may be irrelevant while others may be relevant. In this case, treating the entire ΔRi as an entity may be too coarse. Another method is to treat each individual tuple in ΔRi as an entity. In Operation O3, the irrelevant tuples in ΔRi are filtered out. Then the remaining tuples in ΔRi are passed to Operation O4.

Turning to FIG. 6, a method for filtering out the irrelevant portion of an update to a materialized view MV is indicated generally by the reference numeral 600. In the method 600, suppose ΔRi contains q tuples ti (1≦i≦q). In Operation O3, for each i (1≦i≦q), the number i is appended as an additional attribute aa to tuple ti (step 605). When ΔRi is joined with the filtering relations FR1, FR2, . . . , FRi−1, FRi+1, FRi+2, . . . , and FRn, aa is never projected out (step 610). After we obtain the join result set Sj from step 610, if Sj≠Ø (step 615), attribute aa is extracted from Sj (step 620). Otherwise, the update ΔRi is indicated as being irrelevant (step 625). Then after duplicate elimination (step 630), the values of aa represent the remaining tuples in ΔRi that need to be passed to Operation O4 (step 635).

An embodiment will now be described with respect to sharing a filtering relation among multiple join views.

Suppose multiple join views are built on the same base relation R. A simple method is to build multiple filtering relations of R, one for each join view. In certain cases, this may introduce redundancy among these filtering relations and cause two problems. First, the probability that the filtering relations are cached in memory is decreased. As a result, Operation O3 becomes more expensive. Second, when R is updated, updating all the filtering relations of R will be costly.

In this case, if possible, it may be better to let multiple join views share the same filtering relation of base relation R. For example, suppose join view JV1 is defined as follows:

create materialized view JV1 as

select * from R1, S, T1

where R1.a=S.b and S.c=T1.d and C1.

C1 is a selection condition on S.f. Join view JV2 is defined as follows:

create materialized view JV2 as

select * from R2, S, T2

where R2.e=S.b and S.f=T2.g and C2.

C2 is a selection condition on S.c. Then for base relation S, we may build only one filtering relation FRSb.c.f(σVC 1 vC 2 (S)) rather than two filtering relations FRSb.cC 1 (S)) and FRS2b.fC 2 (S)). FRs can be used for both JV1 and JV2. Whether FRS is better than FRS1 and FRS2 depends on the overlapping degree of C1and C2.

Turning to FIG. 7, an exemplary base relation S to which the present invention may be applied is indicated generally by the reference numeral 700. Base relation S is built on filtering relations FRS, FRS1, and FRS2, and involves attributes c, b, and f.

An embodiment will now be described with respect to selectively skipping Operation O3. If either a small percentage of the update ΔRi to base relation Ri is irrelevant, or ΔRi is large enough so that hash/sort-merge join becomes the join method of choice for the join with some base relation Rj (1≦j≦n, j≠i), then the content-based method may perform worse than the traditional content-independent method of the prior art. In this case, Operation O3 can be skipped in the content-based method. This is equivalent to using the content-independent method plus updating the filtering relation FRi accordingly. We can easily build an analytical model that can provide a means to determine the upper bound on the size of ΔRi (or lower bound on the percentage of ΔRi that is irrelevant) where performing Operation O3 is beneficial. As an example for illustrative purposes, the “small percentage” of the update ΔRi to base relation Ri that is irrelevant may be, but is not limited to, about 0.5%. However, as is readily appreciated by one of ordinary skill in this and related arts, the exact percent is dependent upon the base relation size, the materialized view definition, and so forth and, thus varies from case to case.

An embodiment will now be described with respect to using the information about (intermediate) join results in Operation O3. Recall that in Operation O3, ΔRi is joined with the filtering relations FR1, FR2, . . . , FRi−1, FRi+1, FRi+2, . . . , and FRn. As a result, we know the (intermediate) join result sizes. If these (intermediate) join result sizes are significantly different from original estimates, we know that the statistics in the database are imprecise.

Then in Operation O4, when the remaining tuples in ΔRi (after filtering) are joined with the base relations R1, R2, . . . , Ri−1, Ri+1, Ri+2, . . . , and Rn, the information that is gained in Operation O3 may be used to choose a better query plan.

For example, consider the join view mentioned in the introduction. The base relation R is updated by ΔR. Suppose that it is believed that each tuple in ΔR has only a few matching tuples in base relation S. As a result, in Operation O4, index nested loops is chosen as the join method for the join with S. However, from the information we gained in Operation O3, we know that each tuple in ΔR has a large number of matching tuples in the filtering relation FRS (and thus also a large number of matching tuples in S). Then in Operation O4, our content-based method may indicate to choose hash join as the join method for the join with S.

An embodiment will now be described with respect to load shedding. Our content-based method can estimate the effect of an update to the base relation on the join view. If the RDBMS is overloaded, then such estimate can provide guidance to a load shedding algorithm. For example, we may ignore those “unimportant” updates to base relations during join view maintenance. Moreover, we may collapse multiple updates into a single transaction so that more efficient algorithms (such as hash/sort-merge join) can be used.

Having described preferred embodiments of a system and method (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US7822712 *Oct 18, 2007Oct 26, 2010Google Inc.Incremental data warehouse updating
Classifications
U.S. Classification1/1, 707/999.2
International ClassificationG06F17/30
Cooperative ClassificationG06F17/30451, G06F17/30457
European ClassificationG06F17/30S4P3T1, G06F17/30S4P3T3
Legal Events
DateCodeEventDescription
Apr 4, 2006ASAssignment
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LUO, GANG;YU, PHILIP SHI-LUNG;REEL/FRAME:017429/0249
Effective date: 20060316