US 20020198896 A1 Abstract In a database system, a method of maintaining a self-tuning histogram having a plurality of existing rectangular shaped buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency. At least one new bucket is created in response to a query on the database. Each new bucket is contained within at least one existing bucket and the new bucket becomes a child bucket and the existing bucket containing it becomes a parent bucket. The boundaries of each new bucket correspond to a region of the database accessed by the query and the frequency of the new bucket is a number of data records returned by the query. Buckets may be merged based on a merge criterion such as similar bucket density when the total number of buckets exceeds the predetermined budget. The boundaries of a new bucket may be shrunk if the boundaries of the new bucket intersect any existing bucket boundaries.
Claims(23) 1. In a database system, a method of maintaining a self-tuning histogram having a plurality of existing buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency comprising the step of creating at least one new bucket in response to a query on the database wherein each new bucket is contained within at least one existing bucket and wherein the new bucket becomes a child bucket and the existing bucket becomes a parent bucket. 2. The method of 3. The method of 4. The method of 5. The method of 6. The method of 7. The method of 8. The method of 9. In a database system, a method of maintaining a self-tuning histogram having a plurality of existing buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency comprising the steps of:
a) examining the results of a query executed on the database; b) creating at least one candidate hole in the histogram based on the results of the query; c) modifying each candidate hole such that the modified hole is completely contained within at least one existing parent bucket and does not partially intersect any existing bucket; and d) creating a new child bucket in the histogram corresponding to each modified hole. 10. The method of 11. The method of 12. The method of 13. The method of 14. The method of 15. The method of 16. A computer readable medium having computer executable instructions for performing steps for maintaining a self-tuning histogram having a plurality of existing buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency, the steps comprising:
a) examining the results of a query executed on the database; b) creating at least one candidate hole in the histogram based on the results of the query; c) modifying each candidate hole such that the modified hole is completely contained within at least one existing parent bucket and does not partially intersect any existing bucket; and d) creating a new child bucket in the histogram corresponding to each modified hole. 17. The computer readable medium of 18. The computer readable medium of 19. The computer readable medium of 20. An apparatus for maintaining a self-tuning histogram having a plurality of existing buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency comprising:
a) means for examining the results of a query executed on the database; b) means for creating at least one candidate hole in the histogram based on the results of the query; c) means for modifying each candidate hole such that the modified hole is completely contained within at least one existing parent bucket and does not partially intersect any existing bucket; and d) means for creating a new child bucket in the histogram corresponding to each modified hole. 21. The apparatus of 22. An apparatus for maintaining a self-tuning histogram having a plurality of existing buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency comprising:
a) a memory device for storing a database comprising multiple data records; b) a computer having one or more processing units for executing a stored computer program, said computer including a rapid access memory store; and c) an interface for coupling the memory device for storing the database to the computer to allow records to be retrieved from the database; wherein d) the stored program has components including i) a component for examining the results of a query executed on the database; ii) a component for creating at least one candidate hole in the histogram based on the results of the query; iii) a component for modifying each candidate hole such that the modified hole is completely contained within at least one existing parent bucket and does not partially intersect any existing bucket; and iv) a component for creating a new child bucket in the histogram corresponding to each modified hole. 23. The apparatus of Description [0001] The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of histogram construction for database systems. [0002] Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns. [0003] A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and modify data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use information about the data distribution to help access data in a database more efficiently. Typical servers comprise a query optimizer which estimate the selectivity of queries and generate efficient execution plans for queries. Query optimizers generate execution plans based on he query and in doing so exploits statistical information on the column(s) of the table(s) reference in the series. [0004] Database servers may create histograms on the columns of tables to represent the distribution of a data. A histogram is one means of representing the distribution of data in a database. A histogram on a data attribute consists generally of a set of partitions or boundaries which divide the range of data on the attribute into a set of segments or buckets. Also associated with each bucket is a frequency which corresponds to the number of data tuples which fall within the boundaries of the bucket. The frequency associated with a bucket, or bucket frequency, is an indication of the density of data within the bucket's boundaries, and should not be confused with the absolute value of the data within the bucket. [0005] The accuracy of the estimations of the query optimizer are enhanced by the availability of histograms, however, creating and maintaining histograms can incur significant costs, particularly for large databases. This problem is particularly striking for multi-dimensional histograms that capture joint distributions of correlated data attributes. Although multi-dimensional histograms can be highly valuable, the relatively high cost of building and maintaining them often prevents their use. [0006] Query optimization in relational database systems has traditionally relied on single-attribute histograms to compute the selectivity of queries. For queries that involve multiple attributes, most database systems make the attribute value independence assumption, i.e., assume that Prob(A [0007] An alternative to assuming attribute value independence is to use histograms over multiple attributes, which are generally referred to as multidimensional histograms. Ideally, multidimensional histograms should consist of buckets that enclose regions of the data domain with close-to-uniform tuple density. At the same time, multidimensional histograms should be sufficiently compact and efficiently computable. Unfortunately, existing multidimensional histogram construction techniques fail to satisfy these requirements robustly across data distributions. [0008] Several techniques exist in the literature to compute selectivity estimators of multidimensional data sets. These techniques include wavelets and discrete cosine transformations, sampling, and multidimensional histograms. The V-optimal(f/f) family of histograms groups contiguous sets of frequencies into buckets and minimizes the variance of the overall frequency approximation. These histograms work well for estimating the result size of tree, function tree, equality join, and selection queries under a definition of optimality that captures the average error over all possible queries and databases. However, these histograms need to record every distinct attribute value inside each bucket, which is impractical. Moreover, the construction algorithm involves an exhaustive and exponential enumeration of all possible histograms. A more practical approach is to restrict the attention to V-optimal(v,f) histograms, which group contiguous sets of values into buckets, minimizing the variance of the overall frequency approximation. A dynamic programming algorithm has been presented for building unidimensional V-optimal(v,f) histograms in O(N [0009] A multidimensional version of the Equi-Depth histogram recursively partitions the data domain into buckets with the same frequency, one dimension at a time. A technique called Mhist is based on MaxDiff(v,a) histograms in which the data domain is iteratively partitioned using a greedy procedure. In each step, MaxDiff(v,a) identifies the bucket in most need of partitioning and splits it along the dimension with the highest difference in frequency between consecutive values. GenHist histograms allow unrestricted overlap among buckets. If more than two buckets overlap, the density of tuples in their intersection is approximated as the sum of the data densities of the overlapping buckets. For the technique to work, a tuple that lies in the intersection of many buckets is counted in only one of them (chosen probabilistically). Progressively coarser grids are constructed over the data set and the densest cells are converted into buckets of the histogram. A certain percentage of tuples in those cells is removed to make the resulting distribution smoother. [0010] The above discussed histogram techniques are static in the sense that after the histograms are built, their buckets and frequencies remain fixed regardless of any changes in the data distribution. One technique decides when reorganization is needed by using thresholds that depend on the number of updates over the relation or the accuracy of the histogram. For example, if the average estimation error is above a given value, the whole histogram is discarded and rebuilt from scratch. Some techniques consider histogram refinement as an alternative to periodic reconstruction. One such technique maintains a backing sample and an approximate Equi-Depth histogram in memory. During insertions and deletions, both the sample and the histogram are updated. When the Equi-Depth constraint that all bucket frequencies should be equal is violated beyond a given threshold, some buckets are split and others are merged to restore the Equi-Depth constraint. If no reorganization can restore the constraint, the existing histogram is discarded and a new one is built from the sample. Another technique considers dynamic compressed histograms that store some values in singleton or singular buckets, while the rest are partitioned using Equi-Depth into regular buckets. The general idea is to relax histogram constraints up to a certain point, after which the histogram is reorganized so that it satisfies the constraints. For dynamic compressed histograms, aχ [0011] One parametric technique for approximating data distributions that uses feedback from the query execution engine represents the data distribution as a linear combination of model functions. The weighting coefficients of this linear combination are adjusted using feedback information and a least squares technique. This technique is dependent on the choice of model functions and assumes that the data follows some smooth and known distribution. STGrid histograms use query feedback to refine buckets. An STGrid histogram greedily partitions the data domain into disjoint buckets that form a grid, and refines their frequencies using query feedback. After a predetermined number of queries, the histogram is restructured by merging and splitting rows of buckets one at a time (to preserve the grid structure). Accuracy is traded for efficiency in histogram tuning, a goal of this technique. Since STGrid histograms need to maintain the grid structure at all times, and due to the greedy nature of the technique, some locally beneficial splits and merges have the side effect of modifying distant and unrelated regions, hence decreasing the overall accuracy. [0012] Self-tuning histograms for databases have a plurality of existing buckets defined by at least two bucket boundaries, a bucket volume, and a bucket frequency. The results of a query executed on the database are examined and at least one candidate hole in the histogram is created based on the results of the query. Each candidate hole is modified such that the modified hole is completely contained within at least one existing parent bucket and does not partially intersect any existing bucket. A new child bucket is created in the histogram corresponding to each modified hole. [0013] In one implementation, each bucket has a rectangular shape. The boundaries of the candidate hole correspond to a region of the database accessed by the query and the frequency of the candidate hole is a number of data records returned by the query. Buckets are merged based on a merge criterion when the total number of buckets exceeds a predetermined budget. In this implementation, the merge criterion is a similar bucket density, wherein bucket density is based on the bucket frequency divided by the bucket volume. In this implementation, the frequency of the parent bucket is diminished by the frequency of the child bucket. [0014]FIG. 1 is an exemplary operating environment for practice of the present invention; [0015]FIG. 2 is a block diagram of components used for practice of an embodiment of the invention; [0016]FIG. 3 is a block diagram illustrating a histogram constructed in accordance with an embodiment of the present invention; [0017]FIG. 4 is a flow diagram depicting the steps of a method for practicing an embodiment of the present invention; [0018]FIG. 5 is a block diagram of the practice of an aspect of the present invention; [0019]FIG. 6 is a block diagram of the practice of an aspect of the present invention; [0020]FIG. 7 is a block diagram of the practice of an aspect of the present invention; [0021]FIG. 8 is a block diagram of the practice of an aspect of the present invention; and [0022]FIG. 9 is a block diagram of the practice of an aspect of the present invention. [0023] With reference to FIG. 1 an exemplary embodiment of the invention is practiced using a general purpose computing device [0024] The system memory includes read only memory (ROM) [0025] The computer [0026] A number of program modules may be stored on the hard disk, magnetic disk [0027] The computer [0028] When used in a LAN networking environment, the computer [0029]FIG. 2 illustrates a block diagram of a database system [0030]FIG. 3 illustrates a histogram [0031] Examining the histogram [0032] The volume of a bucket b is defined as vBox(b)−Σ [0033] Where v(q∩b) denotes the volume of the intersection of q and b (not box(b)). [0034]FIG. 4 is a flow diagram depicting a method [0035] In step [0036] Referring now to FIG. 5, a bucket b with frequency f(b)=100 is shown. The result stream for a query q indicates that T [0037] If the intersection of a query q and a bucket b is rectangular as in Example 1, it is always considered a candidate hole. However, it is not always possible to create a hole in a bucket b to form a new bucket q∩b. This is because the children of b might be taking some of b's space, and therefore the bounding box of q∩b may not be rectangular anymore thus violating the rectangular partitioning constraint imposed on the histogram by the method. In Example 1, FIG. 5, the intersection between q and b's parent b [0038]FIG. 6 shows a four bucket histogram and the progressive shrinking of the initial candidate hole, c=q∩b. The buckets that partially intersect with c, called participants in the algorithm are b [0039] After a candidate hole has been shrunk to an appropriate shape such that it does not intersect with any child of b [0040] In step [0041]FIG. 7 depicts a three bucket histogram H. Given a two bucket budget, buckets b [0042] A penalty function is used to return the cost of merging a pair of buckets. If two buckets b [0043] where dom(D) is the domain of the data set D. In other words, the penalty for merging two buckets measures the difference in approximation accuracy between the old, more expressive histogram where both buckets are separate and the new, smaller histogram where the buckets have been collapsed. A merge with a small penalty will result in little difference in approximation for range queries and therefore will be preferred over another merge with higher penalty. Since estimated density of tuples inside a bucket is constant by definition, penalty functions can be calculated efficiently. All regions r [0044] There are two families of merges that correspond to merging adjacent buckets in the tree representation of the histogram: parent-child merges and sibling-sibling merges. In a parent-child merge, a bucket is merged with its parent. In a sibling-sibling merge, two buckets with the same parent are merged possibly taking some of the parent space (since both siblings must be enclosed in a rectangular bounding box). Parent-child merges are useful to eliminate buckets that become too similar to their parents, e.g., when their own children cover all interesting regions and therefore carry all useful information. Sibling-sibling merges are useful to extrapolate frequency distributions to yet unseen regions in the data domain, and also to consolidate buckets with similar density that cover close regions. [0045]FIG. 8 illustrates a parent-child merge of buckets b [0046] where H′ is the histogram that results from merging b [0047]FIG. 9 illustrates a sibling-sibling merge of buckets b [0048] Hence:
[0049] Where H′ is the histogram that results from merging b [0050] As can be seen from the foregoing description the method of updating histograms of the present invention allows buckets to be nested and tunes the histogram to the specific query workload received by the database system. Buckets are allocated where needed the most as indicated by the workload, which leads to improved query selectivity estimations. [0051] In the foregoing description, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit or scope of the present invention as defined in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather that a restrictive sense. Referenced by
Classifications
Legal Events
Rotate |