Ron is a senior software engineer for Informix. He can be contacted at gutman@ informix.com.
If you're one of those programmers who thinks that work done 10 years ago is hopelessly out of date, here's a fact to consider -- David Hilbert published his space-filling curve in 1891. As Ron shows this month, it's still cutting-edge stuff.
-- Tim Kientzle
Imagine a database table of homes for sale. To find homes in the $140,000 to $150,000 price range, you could use the SQL query in Example 1(a). Such one-dimensional queries are efficiently handled by the B-tree indexes provided by most DBMS systems. However, a B-tree index is less efficient for retrieving all homes in a geographical area, as in Example 1(b). Using a B-tree index on longitude, the DBMS would retrieve all homes with a longitude in the desired range of longitudes, then eliminate those with a latitude outside the desired range of latitudes (Figure 1). In the process, the DBMS would retrieve far more data than required before obtaining the final result. Indexing on latitude would not help.
Ideally, the DBMS engine would provide a different type of index designed for efficient retrieval of multidimensional data. Informix, for example, offers R-tree indexes (see "R-trees: A Dynamic Index Structure for Spatial Searching," by A. Guttman, Proceedings of the SIGMOD Conference, June 1984). R-trees offer superior performance and flexibility, but adding a new type of index to a DBMS engine is not easy. Oracle's Spatial Cartridge (http:// ntsolutions.oracle .com/products/sdo/html/sdo_twp.htm) relies on a different approach, one that can be implemented without modifying the DBMS engine.
Like Oracle's approach, the technique I examine here works in conjunction with a B-tree or any other one-dimensional index. It relies on the properties of space-filling curves to achieve good performance and can be implemented by the application developer, because no modification to the DBMS is required.
Spatial Orderings
A space-filling curve can help index points on a map by placing the many points of the map's region into some suitable order, like beads on a string. Using this ordering, you can translate points of the 2D region into values suitable for B-tree indexing. In other words, the space-filling curve turns a 2D problem into a one-dimensional problem.
To approximate a space-filling curve, divide the area of interest into small squares (cells) and place them in some predetermined order. Figure 2 shows three successively detailed approximations of the well-known space-filling curve invented by David Hilbert. The Hilbert curve is the limit of the infinite sequence of these approximations. To distinguish between an infinite space-filling curve and a finite approximation of it, I use the term "spatial ordering" for these approximations.
To use a B-tree for spatial indexing, use the spatial ordering to convert 2D attributes into a one-dimensional "spatial key." The spatial key is just a number that reflects the position of the cell within the spatial ordering. Figure 3(a) shows the numberings of the cells in a Hilbert ordering.
Designing and Creating the Database
When you design the database, first divide the region into small cells. There is more than one way to do this, but the most common employs a "quadtree" hierarchy. Assuming the region to be divided is a square, a quadtree divides the square into four smaller squares equal in size. Each of those squares is further divided into four still smaller squares. The division process is repeated recursively until the cells are appropriately small. The end result is a grid of cells. Figure 2 shows the first, second, and third steps in this process.
Next, put the grid cells into some order. Figure 3 shows three possible orders -- row order, Morton order (also called "Z-order"), and Hilbert order (based on the Hilbert curve). Row order suffers a serious disadvantage because it is not based on a hierarchy such as a quadtree. Morton and Hilbert order are both based on a quadtree hierarchy, so they are natural allies for grid cells derived from a quadtree.
Finally, number the cells according to the spatial order. Assign number 0 to the first cell in the ordering, number 1 to the second cell, and so on. Figure 3 shows the cells numbered according to Hilbert, Morton, and row orders. These numbers are the spatial keys -- the one-dimensional attribute that will be indexed by the B-tree.
How small should the cells be? As small as you can make them without overflowing the integer size for the spatial keys. If the spatial keys are unsigned 32-bit integers, then you can divide the quadtree to a depth of 16; you will need every possible 32-bit integer value to assign each cell a number, but you will have enough values.
The numbering of the cells is cast in stone at design time. Once you've stored spatial keys in the database, the query capability will work only if it assumes the same numbering scheme.
When you create your database table of points, include a column to contain this spatial key, which I will refer to as skey. The database will have a B-tree index on the skey column, which you must be careful to update every time you insert a new item or change the coordinates of some entry.
Query Time
When a buyer wants to see a map of available homes in a particular area, the software first determines which cells lie completely or partially inside the area of interest (called the "query region"), then constructs a query asking for the houses with the corresponding skeys values. This returns all the houses in those cells. Because some of the cells were partly outside of the query region, the final step is to filter out the houses that lie outside of the query region.
An Implementation
The queryRanges function given in space.h and space.c (available electronically; see "Resource Center," page 5) accepts a Box structure containing the latitudes and longitudes of the boundaries of a rectangular area. The function returns a list where each item is a range of keys. The keys are grouped into ranges because the number of keys could be large enough to cause performance headaches and certainly much larger than you could expect the DBMS to deal with. Each range of keys includes a beginning key and an ending key. Any house with a key equal to the beginning key or the ending key of a range, or any key value between the two, should be retrieved from the database.
The queryRanges function is designed to be used both for searching (when you need to know all the keys in a large area) and updating a single record (when it computes the key for a particular location). In the latter case, the two points in the Box structure are the same. Figure 4 shows a query region with its ranges of Hilbert keys and the equivalent SQL.
Performance
Computing the spatial keys does not present a major performance issue. Rather, the question is how fast the database server can retrieve the data using the spatial keys you have computed. Two things determine the speed of retrieval when using spatial keys:
- The number and complexity of the queries sent to the server.
- The amount of data read from disk and transmitted to the client program.
When a database server processes a query, it begins by parsing it and creating a plan for executing the query. For spatial queries, more ranges lead to more query overhead, because each range involves some overhead in query compilation and in accessing the index.
On the other hand, more ranges mean less data read from the disk. Ranges of spatial keys are a way of describing the query region, but the description is approximate. The region described must always include all of the original query region so as to include all of the desired data, but it may include some undesired data. More ranges provide a tighter, more accurate description of the query region and the retrieval of less undesired data. Fewer ranges reduces the query overhead.
Clearly, what you want is a good trade-off between the number of ranges and the accuracy of the region represented by those ranges -- a continuous space-filling curve offers the best trade off.
Morton versus Hilbert
Figures 4 and 5 show ranges computed with the discontinuous Morton curve and the continuous Hilbert curve; space.h and space.c implement both. The compile-time constant, HILBERT, enables the Hilbert implementation.
In general, the number of ranges is proportional to the length of the boundary of the query region. When the Hilbert curve is used to produce ranges for a rectangular query region, the number of ranges averages about (W+H)/2, where W and H are dimensions of the rectangle in units equal to the width of the cells. For the Morton curve, the average is closer to W+H. The jumps in the Morton curve explain this difference.
Though Hilbert clearly performs better, some say that the additional complexity is not worthwhile. Presumably, it is too hard to code. You can decide for yourself by looking at the C code where the compile-time constant is used to enable/disable the Hilbert curve.
Large and Small Query Regions
A space-filling curve is like a fractal; it has the property of self-similarity, that is, if you zoom in, you find that the details look a lot like the big picture. This property gives a spatial ordering obtained using a space-filling curve an important advantage over a spatial ordering that is not based on a space-filling curve, such as row order illustrated in Figure 3(c). The methods outlined so far will work well when all of the query regions are roughly the same size. However, if the spatial ordering is not based on a space-filling curve, queries of greatly varying size would present problems. Large queries would require large numbers of ranges, resulting in poor performance.
The number of ranges on average is proportional to the perimeter of the query region. Suppose the cells are 100 meters across. A query 1 kilometer long and high will require 10 ranges on average. That's okay. A query 20 kilometers long and high will require 200 ranges on average. That's not so good. Suppose the query is a query for airports. Users would not expect much data to be retrieved so they wouldn't expect the query to take long. But 200 ranges will entail a significant amount of overhead in database processing.
However, because space-filling curves are based on a hierarchy (a quadtree hierarchy in the case of the Hilbert and Morton curves), they offer a solution. The quadtree hierarchy includes cells of varying size. Each leaf cell is assigned a single key. Each higher-level cell corresponds to a single range of keys.
Figures 4 and 6 show how you can use this relationship between the curve and the quadtree to generate about the same number of ranges regardless of the size of the query region. The query region in Figure 4 is covered by three ranges. Figure 6(a) is a larger query region with seven ranges. Figure 6(b) shows the same query but with leaf cells aggregated into cells twice as wide and high to reduce the number of ranges. The number of ranges is about the same as for the smaller query of Figure 4.
You pay a small price for this benefit; the excess area covered by the aggregated ranges is greater than if the aggregation is not used. The ranges include cells that do not overlap the query region. However, you choose depth to keep the excess roughly in proportion to the area of the query region, so the overall solution is acceptable.
Regardless of the size of a query region, it can be represented with reasonable accuracy by a reasonable number of ranges without any change to the keys in the database.
The depth parameter of queryRanges is used to control the amount of aggregation. It says how far down in the quadtree hierarchy to go, that is, what size cells to use when generating ranges. The highest level cell in the hierarchy is at level 0. If depth is 3, then queryRanges will descend three levels to cells that are 1/8 as wide and high as the top-level cell. In Figures 4, 5, and 6(a), depth is 3. In Figure 6(b), depth is 2.
How queryRanges Works
The function queryRanges calls the recursive function subtreeRanges, which does the real work. The subtreeRanges function explores the quadtree of cells to find cells that overlap with the query region. The recursion begins with the top quadtree cell, which is passed to subtreeRanges by queryRanges.
When called, subtreeRanges compares two regions, the query region represented by the structure *pQuery and the quadtree cell represented by the structure *pCell. The Box structure is used to express the rectangular limits of both the cells and the query region. If the query region and the cell do not overlap, subtreeRanges returns without doing anything. If they do overlap, then one of two things will happen:
- If the level of the cell is less than the level given by the depth parameter (that is, the cell is higher in the quadtree than the lowest cells to be looked at by subtreeRanges), then subtreeRanges constructs the four child cells of the cell and calls itself four times to explore the subtree for each child cell.
- If the level of the cell is equal to or greater than depth, then the subtree will not be explored. Instead, subtreeRanges computes the range of keys for the cell and adds that range to the list of ranges at *pRangeList. If the cell is a leaf cell, then the range will consist of a single key. (The constant DEPTH determines the leaf depth.)
To compute the lowest and highest keys within a cell, first look at the relationship between a key and the quadtree. Each level in the quadtree hierarchy corresponds to two bits in a key. The highest two bits specify one of the four cells at level 1. The next highest two bits specify the level 2 child of the level 1 cell, and so on. The lowest order two bits picks one leaf cell out of four. These relationships hold for both Morton and Hilbert orderings.
The structure Cell contains a field, keyPrefix, that contains the bits that are common to all keys for cells in the subtree of the cell. Each of the subtree keys can be generated by shifting keyPrefix to the left and filling in the low-order bits. But you only need the first key and the last key in the range of keys. The first key is just keyPrefix shifted left, the low-order bits set to 0. The last key can be computed by adding the total number of keys within the cell to the first key and subtracting 1. The number of keys within the cell is simply a power of 4. At the leaf level, it is 1. One level above that it is 4, and two levels above it is 16. The power of 4 is easily computed with a shift.
So shifts and adds are all that are needed to compute the range from the cell's depth and prefix. In addition, you need to compute a child's prefix from the parent's prefix. That is done by shifting the parent prefix left two bits and adding the two bits for the child cell.
When using Morton order, it is as simple as that. For Hilbert order, all of this applies, but additional complexity arises from the changing orientations of its basic U shape. The Morton order is built from Z shapes that all have the same orientation. The U shape in a Hilbert ordering is rotated into various positions -- sometimes on its side and sometimes upside down -- to maintain continuity.
To deal with these rotations, the relative order of four sibling cells in the Hilbert order is computed and stored in the local array hilbertOrder, which is used to compute the key prefixes for the cells. Two static tables are used to compute the ordering stored in hilbertOrder. The table hilbertIndexTable provides the position of a cell among its siblings given its position among its sibling in the Morton order and the orientation of its parent's U shape. The table hilbertRotationTable provides the orientation of the cell's U shape given that of the parent and the cell's position among its siblings.
Admittedly, the code for Hilbert is tricky, but it's not much code for the benefit yielded.
As Figure 4 illustrates, a range of keys need not correspond to a single quadtree cell. Some ranges consist of several adjoining cells. Because you want to minimize the number of ranges produced by queryRanges, you want to ensure that one range is produced in such cases, not separate ranges for each cell. This is achieved by taking advantage of the fact that queryRanges visits cells in order of their appearance in the spatial ordering and the fact that ranges are consequently placed in the range list in ascending order. When a range is added to the list, it is compared with the last range added to the list. If there are no intervening key values between them, the two ranges are combined into one range representing exactly the same set of cells. The order of visitation ensures that no opportunities to combine will be missed.
The difference in effectiveness between the Morton and Hilbert orders arises in the combining of ranges. For Hilbert, ranges will be combined much more often.
Application in an RDBMS Environment
These techniques can be used to provide spatial indexing for spatial objects stored in a relational database that provides only B-trees for indexing. Of course, there are some practical issues to consider.
The natural way to turn a set of ranges into a SQL query is to use the SQL OR operator to combine the ranges in a single SQL query. However, many database query engines will fail to use the B-tree index on the spatial key when presented with a query like this. However, they will do the right thing when presented with several SQL queries combined into a single query with the UNION operator. The effect is the same, except that failure to use the B-tree would result in incredibly bad performance.
Conceivably, some database engines will choke if presented with too many ranges in a single query (whether using the OR or the UNION operator). It may then be necessary to either reduce the number of ranges or use multiple queries, each with a small number of ranges.
Possible Extensions
The techniques I've outlined here are suitable for many applications, particularly those in which the objects are points. For wider application, there are many possible extensions, for example, support for lines and areas as query regions or database objects. Some applications may require keys longer than 32 bits to ensure adequate resolution when the database has densely packed objects in a large space. Other optimizations can be implemented to produce better results using fewer resources.
Some applications may have 3D requirements. For example, imagine a database of crimes with locations and times. A query to retrieve crimes in a particular neighborhood during a particular week would call for the indexing of a 3D space with two dimensions of space and one dimension of time. The space-filling curves would wind through this 3D space.
Other Applications
Space-filling curves can do more than query regions. They are handy for "spatial clustering," ordering the data on disk by spatial key so it can be scooped up quickly once your spatial indexing technique has located it. So even when the database server has an R-tree built in (the ideal), space-filling curves can play a role in the retrieval of the data. Space-filling curves are useful even inside the R-tree. A variant of the R-tree called the "Hilbert R-tree" uses the Hilbert curve to help maintain spatial organization within the tree.
Another operation used in spatial databases is the spatial join that locates pairs of objects with some common spatial attribute (for instance, they are within one mile of each other). Usually, the objects come from two tables. Sorting the two tables by spatial key facilitates the search.
Farther afield, space-filling curves have been proposed for solving a problem known as the "Planar Traveling Salesman Problem," which asks you to choose an order for visiting a number of points in a plane so that the total distance traveled is minimized. It is expensive to compute an optimal solution, but a very inexpensive solution is to visit the points in the order they appear on a space-filling curve. After finding an approximate solution quickly, known algorithms can improve on the solution.
Other possible applications include compression of raster images and dithering -- a graphics technique that mixes pixels of varying colors and shades to give the illusion of colors and shades not supported by the display.
Acknowledgments
Thanks to Rick Gutman and reviewers at Informix and Etak for their help in producing this article.
DDJ
Copyright © 1999, Dr. Dobb's Journal