US 20080162424 A1 Abstract A system and associated computer product improve the search of multidimensional databases. The present system determines a near-optimal grid index that is used to locate a geometric shape in a spatial database. More particularly, the present system improves the technique of sampling data for defining the grid cell size in a grid for a given data set, minimizing the number times the data set needs to be sampled, thereby reducing the time to compute the cost of alternative grid index parameters.
Claims(8) 1. A computer program product having a plurality of instruction codes stored in a computer-usable storage medium executed by a computer for performing the method of determining an optimal grid index specification for multidimensional data, comprising:
creating a geometry histogram by scanning geometries to determine minimum and maximum extents, computing bucket sizes of buckets, and scanning the geometries to generate the geometry histogram using the buckets, wherein the bucket sizes are computed by dividing a size of a largest maximum bounding rectangle into a number of intervals; computing a set of query region sizes; for each size of a set of query region sizes, computing a minimum performance indicator by, for each of the buckets of the geometry histogram, estimating a number of index entries for each of multiple grid size levels by computing bucket intersections; and returning an optimal grid size for each size of the query region. 2. The computer program product of computing a set of grid sizes at a first level of the multiple grid size levels. 3. The computer program product of computing a matrix of grid size levels for the first level; for each matrix row of the matrix,
computing the number of index entries; and
in response to determining that the number of index entries is less than a minimum number of index entries, setting the minimum number of index entries to the determined number of index entries. 4. The computer program product of computing estimated intersections for each of multiple grid sizes. 5. A system for determining an optimal grid index specification for multidimensional data, comprising:
a processor; and a memory coupled to the processor, the memory comprising a plurality of instruction codes executable by the processor for:
creating a geometry histogram by scanning geometries to determine minimum and maximum extents, computing bucket sizes of buckets, and scanning the geometries to generate the geometry histogram using the buckets, wherein the bucket sizes are computed by dividing a size of a largest maximum bounding rectangle into a number of intervals;
computing a set of query region sizes;
for each size of a set of query region sizes, computing a minimum performance indicator by, for each of the buckets of the geometry histogram, estimating a number of index entries for each of multiple grid size levels by computing bucket intersections; and
returning an optimal grid size for each size of the query region.
6. The system of computing a set of grid sizes at a first level of the multiple grid size levels. 7. The system of computing a matrix of grid size levels for the first level; for each matrix row of the matrix,
computing the number of index entries; and
in response to determining that the number of index entries is less than a minimum number of index entries, setting the minimum number of index entries to the determined number of index entries.
8. The system of computing estimated intersections for each of multiple grid sizes. Description This application is a continuation application of and claims the benefit of “System And Method For Determining An Optimal Grid Index Specification For Multidimensional Data”, having application Ser. No. 11/007,132, filed Dec. 7, 2004, the disclosure of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. patent application Ser. No. 10/144,058, entitled “Systems, Methods, and Computer Program Products to Improve Indexing of Multidimensional Databases,” filed on May 10, 2002, and to co-pending U.S. patent application Ser. No. 10/141,919, entitled “Reducing Index Size for Multi-Level Grid Indexes,” filed on May 10, 2002, both of which are assigned to the assignee as the present application, each of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. patent application Ser. No. 12/021,193, which is a continuation application of and claims the benefit of “Method to Improve Indexing of Multidimensional Databases”, having U.S. patent application Ser. No. 10/144,058, filed May 10, 2002, and each of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 10/792,446, entitled “Index Exploitation For Spatial Data,” filed on Mar. 2, 2004, by David Adler, Attorney's Docket Number SVL920040011US1, assigned to the assignee of the present invention, which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 11/255,357, entitled “Reducing Index Size for Multi-Level Grid Indexes,” filed on Oct. 20, 2005, by David Adler et al., Attorney's Docket Number SVL920020013US2, which is a continuation application of U.S. patent application Ser. No. 10/141,919, filed May 10, 2002, assigned to the assignee of the present invention, and each of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 11/255,296, entitled “Reducing Index Size for Multi-Level Grid Indexes,” filed on Oct. 20, 2005, by David Adler et al., Attorney's Docket Number SVL920020013US3, which is a divisional application of Ser. No. 10/141,919, filed May 10,2002, assigned to the assignee of the present invention, and each of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 11/931,786, entitled “Reducing Index Size for Multi-Level Grid Indexes,” filed on Oct. 30, 2007, by David Adler et al., Attorney's Docket Number SVL920020013US4, which is a continuation application of U.S. patent application Ser. No. 10/141,919, filed May 10, 2002, assigned to the assignee of the present invention, and each of which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 12/020,474, entitled “REDUCING INDEX SIZE FOR MULTI-LEVEL GRID INDEXES”, filed on Jan. 25, 2008, by David Adler et al., Attorney's Docket Number SVL920020013US5, which is a continuation of U.S. patent application Ser. No. 11/931,786, entitled “Reducing Index Size for Multi-Level Grid Indexes,” filed on Oct. 30, 2007, by David Adler et al., Attorney's Docket Number SVL920020013US4, which is a continuation application of U.S. patent application Ser. No. 10/141,919, filed May 10, 2002, assigned to the assignee of the present invention, and each of which is incorporated herein by reference in its entirety. This application relates to U.S. Pat. No. 7,143,098, issued on Nov. 28, 2006, having U.S. patent application Ser. No. 10/144,389, filed May 10, 2002, which is incorporated herein by reference in its entirety. This application relates to co-pending U.S. U.S. patent application Ser. No. 11/255,297, entitled “Systems, Methods, and Computer Program Products to Reduce Computer Processing in Grid Cell Size Determination for Indexing of Multidimensional Databases”, filed on Oct. 20, 2005, which is a continuation application of and claims the benefit of “Systems, Methods, and Computer Program Products to Reduce Computer Processing in Grid Cell Size Determination for Indexing of Multidimensional Databases”, U.S. Pat. No. 7,143,098, issued on Nov. 28, 2006, having U.S. patent application Ser. No. 10/144,389, filed May 10, 2002, and each of which is incorporated herein by reference in its entirety. The present invention generally relates to the field of indexing computer-based multidimensional data. More specifically, the present invention relates to reducing data collection used in the determination of the grid cell size when grid-indexing techniques are applied to multidimensional data on a computer system. Databases are computerized information storage and retrieval systems. A relational database management system (RDBMS) is a database management system (DBMS) that uses relational techniques for storing and retrieving data. Relational databases are organized into tables that comprise rows, all rows having the same columns of data. Each column maintains information on a particular type of data for the data records that comprise the rows. The rows are formally called tuples or records. A database typically comprises many tables and each table typically comprises tuples and columns. The tables are typically stored on direct access storage devices (DASD), such as magnetic or optical disk drives for semi-permanent storage. Tables are searched using, for example, a Structured Query Language (SQL), which specifies search operations or predicates to perform on columns of tables in the database. The search operations qualify rows in the database tables that satisfy the search conditions. Relational database management system (RDBMS) software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO). Indexes are used with database implementations to provide good application query performance. Indexes are typically constructed using the data values in one or more columns of an RDBMS table row (e.g., using information such as product number, customer name, address, etc.). This information is represented by bit strings that define numeric or character values. An RDBMS may implement a B-tree index that generates a binary tree based on the bit string values. When a query includes values of columns contained in an index, the B-tree index can be scanned quickly to find the candidate rows with these column values. Indexing techniques are used to quickly access data. Spatial data is typically information associated with geometric shapes such as lines, points, poly-lines, polygons, and surfaces. Consequently, spatial data is typically represented by one or more coordinates comprising pairs of numeric values (x, y) corresponding to, for example, locations on the earth. Spatial data is often very large and may have two, three, or more dimensions. Queries against spatial data typically are more complex than identifying a specific row or a set of rows with values between a simple range. The indexing techniques for traditional alphanumeric data are typically based on a linear ordering of key values in a single dimension. B-tree indexing is one of the most common techniques used but this is only suited for single dimension data, not multi-dimensional data such as spatial data. Various indexing techniques have been developed specifically for multi-dimensional data. Grid indexing is one of these indexing techniques associated with searching spatial multidimensional data. Although this technology has proven to be useful, it would be desirable to present additional improvements. The grid cell size used in grid indexing strongly affects the efficiency of accessing spatial data by techniques that employ grid indexing. A problem has been to refine the determination of particular grid cell sizes and thereby reduce the overhead associated with searching a spatial data set via grid indexing using conventional techniques. More particularly, a problem has been to reduce the computational processing to perform the sampling that occurs during statistics collection. Such data is used to determine the proper grid cell size. Those skilled in the art will appreciate the technique of accessing spatial data by determining overlap of a geometric shape with a grid cell matrix. A grid index contains one index entry for each grid cell that overlap a geometric shape. The storage and processing increases with the number of grid cells that overlap a geometric shape. This aspect would suggest large grid cell sizes compared to geometric shape sizes in order to approach a one-to-one relationship of index entries to geometric shapes. Typical spatial queries are based on finding geometric shapes which overlap a rectangular query region. The grid index technique will scan all index entries in the grid cells which overlap the query region. As the grid size increases, more index entries outside the query region will need to be examined and discarded. An optimal grid size must determine the appropriate trade-off between these two opposing considerations. A geometric shape that is typically the subject of spatial data may be approximated by a rectangle. When a rectangle bounds the geometric shape with a minimum enclosure, it is referred to as a “minimum-bounding rectangle.” A minimum-bounding rectangle is defined to approximate a geometric shape located in a space. Coordinates located on a grid reference the minimum-bounding rectangle and the approximated geometric shape that represent the location of the minimum-bounding rectangle. For example, the coordinates on a grid that correspond to the corners of the minimum-bounding rectangle are stored and used to reference the minimum-bounding rectangle. An index enables fast access of a certain subset of data contained in a larger set of data. The index comprises a data structure and the techniques used to build, maintain, and search the data structure for the purpose of accessing a subset of data. For example, an index may define a data structure that is used to access a specific geometric shape included in a set of spatial data. The particular index of the present example may define a data structure that contains references to the minimum-bounding rectangles associated with various geometric shapes in a spatial data set. By accessing locator references associated with the minimum-bounding rectangles, the process of accessing particular geometric shapes in a spatial data set is simplified. Conventional techniques have typically required significant resources to locate a geometric shape in a spatial database. The lack of an efficient process for determining an index that facilitates streamlined location of minimum-bounding rectangles and associated geometric shapes has contributed to inefficient access of information in spatial databases with grid indexing. One problem has been to minimize the amount of data that is processed in order to determine an efficient grid cell size. That is, there exists a need to reduce from the processing required to perform sampling during statistics collection. What is therefore needed is a system, a computer program product, and an associated method to improve the determination of the grid cell size when grid-indexing techniques are applied to spatial data on a computer system for significantly reducing the processing time. The need for such a solution has heretofore remained unsatisfied. SUMMARY OF THE INVENTION The present invention satisfies this need, and presents a system, a computer program product, and an associated method (collectively referred to herein as “the system” or “the present system”) for determining an optimal grid index specification for multidimensional data. Conventional techniques require significant resources to locate a geometric shape in a spatial database. The present system advantageously improves the technique of indexing data in a multidimensional database. Such data is used to locate a geometric shape in a spatial database by associating the geometric shape with one or more grid cells. More particularly, the present system minimizes the processing to collect geometric shape statistics. These data are used to determine an efficient grid cell size in a manner such that the grid indexing technique of the present system efficiently locates a particular minimum-bounding rectangle. The minimum-bounding rectangle is associated with one or more grid cells and facilitates the location of geometric shapes. The present system reduces the amount of computer processing required for sampling data during statistics collection associated with determining efficient grid cell sizes. In one embodiment, the present system uses geometric shape statistics to estimate the number of index entries associated with geometric shapes in the spatial database at a first grid level. The present system estimates the number of index entries associated with two or more grid levels, thereby reducing the overall number of index entries used to determine an index performance indicator, “NeQr”. The index performance indicator, NeQr, is an estimate of the grid index performance. The number of index entries and the grid cell sizes are used to estimate the number of index entries associated with the first grid level. While accessing the number of index entries, the present system obtains the number of index entries and the grid cell sizes associated with two or more grid levels. By reducing the number of index entries that identify associations between grid cells and geometric shapes, computer processing used during indexing data in a multidimensional database is minimized. Therefore the present system provides a technique for improving searches that use indexing techniques and operate on databases that comprise complex data such as, for example, spatial data. The techniques of the present system are especially advantageous when applied to grid-indexing techniques associated with geometric shapes that are represented by spatial data in spatial databases. However, the present system is not restricted to techniques applied to spatial databases and can be used with techniques for searching other multidimensional databases. The present system improves conventional techniques used to determine the grid cell size and facilitate efficient indexing of a spatial database. More particularly, the present system generates a reduced number of index entries when defining the grid cell size in a grid for a given data set compared with conventional techniques. By reducing the number of index entries, the determination of a minimum value of the index performance indicator, NeQr, is improved over conventional techniques by minimizing the number of index entries evaluated during a search for a geometric shape. The minimum value of the index performance indicator, NeQr, further represents the minimum number of index entries generated for geometric shapes that overlap a query window. Once the minimum number of index entries is determined, space can be partitioned by a grid into the appropriate number of grid cells to support efficient grid indexing, where space may be represented by data. The present system determines the index performance indicator, NeQr, by generating and processing statistics associated with the data, such as the spatial data. The present system constructs a histogram of average geometry sizes that can then be used subsequently to compute the index performance indicator, NeQr, of candidate grid index values. A minimum index performance indicator and associated grid index values can then be identified. One embodiment for partitioning space into grid cells, for the purpose of accessing spatial multidimensional data, comprises ascribing different levels to the partitioned space. The plurality of levels may represent partitions of the space in varying levels of granularity. The present system operates on these varying levels of granularity. The various features of system As shown in Further, a multidimensional data cube A technique for partitioning space into grids such as a grid A geometric shape identifier (ID) The determination of the index performance indicator NeQr As illustrated in For example, the geometric shape A, The geometric shape ID The index data structure Similarly, geometric shape B, - SELECT
- MIN(shape . . . xmin) AS xmin,
- MIN(shape . . . ymin) AS ymin,
- MAX(shape . . . xmax) AS xmax,
- MAX(shape . . . ymax) AS ymax,
- MAX(
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
- THEN shape . . . xmax−shape . . . xmin
- ELSE shape. .ymax - shape. .ymin
- END) AS maxmbr,
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
- AVERAGE(
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
- THEN shape . . . xmax−shape . . . xmin
- ELSE shape . . . ymax−shape . . . ymin
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
- END) AS avgmbr
- FROM schema.table
System - scaleFactor=pow(10.0, 4−int(log10(maxmbr)))
- WITH mbrs(mbr) AS (
- SELECT
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
- THEN shape. .xmax - shape. .xmin
- ELSE shape. .ymax - shape. .ymin
- END*:scaleFactor
- AS mbr
- FROM sde.counties),
- buckets (bucket) as (
- SELECT
- CASE
- WHEN mbr<10.0 then ceil(mbr/0.5)*0.5
- WHEN mbr<100.0 then ceil(mbr/5)*5
- WHEN mbr<1000.0 then ceil(mbr/50)*50
- WHEN mbr<10000.0 then ceil(mbr/500)*500
- WHEN mbr<100000.0 then ceil(mbr/5000)*5000
- ELSE 100000
- END AS bucket
- FROM mbrs)
- CASE
- SELECT
- bucket/:scaleFactor AS bucket,
- COUNT(*) AS count
- FROM groups
- GROUP BY bucket
- ORDER by bucket
- CASE WHEN shape . . . xmax−shape . . . xmin>shape . . . ymax−shape . . . ymin
System System When a grid cell size G System For a space of dimension “d” The total cost “NeQr” is then the sum of “NeQr[i]” over the grid sizes for all levels. In step An index performance indicator NeQrMin Method The number of levels, NumLevels Statistics are computed on a per-level basis (step System Therefore, if the result of decision step The RAM While the program Generally, the computer programs and operating systems are all tangibly embodied in a computer-usable medium, such as the memory Moreover, the computer programs The user interface When operating in accordance with one embodiment of system System It is to be understood that the specific embodiments of the invention that have been described are merely illustrative of certain applications of the principle of system Referenced by
Classifications
Legal Events
Rotate |